Files
pr0vieh 5b55319d27 feat: Add two-layer DHT error propagation to prevent broken URL redistribution
Problem:
Failed URLs (404, DNS errors, timeouts) are continuously redistributed via DHT,
causing infinite recrawl loops and network-wide index pollution. No mechanism
exists in YaCy to communicate error status across peers.

Solution - Layer 1 (Proactive Rejection):
- Receiver checks local Solr index for httpstatus_i != 200 BEFORE accepting RWI entries
- Rejects URL immediately if marked as failed previously
- Adds rejected URL hash to errorURL response list
- Prevents index pollution at ingestion time
- Works even if sender doesn't support errorURL protocol (backward compatible)

Solution - Layer 2 (Error Feedback):
- Receiver reports rejected error URLs back to sender via errorURL response parameter
- Sender receives errorURL list and marks those URLs locally as failed
- Sender stops re-distributing these URLs to other peers
- Network-wide error propagation prevents repeated distribution cycles

Implementation Details:
- transferURL.java: Implements proactive rejection + error reporting
  * Checks incoming URL against Solr error status before storing
  * Collects rejected URL hashes in errorURLs StringBuilder
  * Returns errorURL list to sender in response

- Protocol.java: Processes error URL feedback from receiver
  * Extracts errorURL from response
  * Marks reported URLs locally via crawlQueues.errorURL.push()
  * Logs DHT error reports for monitoring

Benefits:
- Dramatically reduces network traffic of broken URLs
- Prevents wasted crawl resources on unreachable targets
- Maintains clean, usable index across distributed network
- Defense-in-depth: two independent layers work together
- Backward compatible: old peers ignore errorURL parameter

Testing:
- Log monitoring shows 'DHT: Received X rejected error URL reports from peer Y'
- Proactive rejection shows 'blocked X URLs' in transfer logs
- Error URLs automatically removed from circulation
2026-01-20 22:17:14 +01:00
..
2025-09-25 23:55:05 +02:00
2025-08-16 13:33:39 -06:00