mirror of
https://github.com/yacy/yacy_search_server.git
synced 2026-03-11 16:22:26 -04:00
Problem: Failed URLs (404, DNS errors, timeouts) are continuously redistributed via DHT, causing infinite recrawl loops and network-wide index pollution. No mechanism exists in YaCy to communicate error status across peers. Solution - Layer 1 (Proactive Rejection): - Receiver checks local Solr index for httpstatus_i != 200 BEFORE accepting RWI entries - Rejects URL immediately if marked as failed previously - Adds rejected URL hash to errorURL response list - Prevents index pollution at ingestion time - Works even if sender doesn't support errorURL protocol (backward compatible) Solution - Layer 2 (Error Feedback): - Receiver reports rejected error URLs back to sender via errorURL response parameter - Sender receives errorURL list and marks those URLs locally as failed - Sender stops re-distributing these URLs to other peers - Network-wide error propagation prevents repeated distribution cycles Implementation Details: - transferURL.java: Implements proactive rejection + error reporting * Checks incoming URL against Solr error status before storing * Collects rejected URL hashes in errorURLs StringBuilder * Returns errorURL list to sender in response - Protocol.java: Processes error URL feedback from receiver * Extracts errorURL from response * Marks reported URLs locally via crawlQueues.errorURL.push() * Logs DHT error reports for monitoring Benefits: - Dramatically reduces network traffic of broken URLs - Prevents wasted crawl resources on unreachable targets - Maintains clean, usable index across distributed network - Defense-in-depth: two independent layers work together - Backward compatible: old peers ignore errorURL parameter Testing: - Log monitoring shows 'DHT: Received X rejected error URL reports from peer Y' - Proactive rejection shows 'blocked X URLs' in transfer logs - Error URLs automatically removed from circulation