Some checks failed
CI / build (push) Has been cancelled
* add native-root HTTPS client for robots policy so HTTPS robots.txt fetches work under MITM * introduce `robots.missing_policy`, document all robots flags, and handle missing robots responses * add HTTPS MITM integration tests (self-signed CA, CONNECT tunnel verification) * document MITM CA generation/install steps and wire server listener to support upgrades
8.3 KiB
8.3 KiB
Robots Proxy Requirements & Architecture
Goals
- Provide a forward HTTP proxy that protects upstream origins from buggy crawlers by enforcing each host's
robots.txtpolicy before a request is forwarded. - Maintain a local cache of parsed robots directives per host so that crawlers can aggressively request documents without repeatedly fetching
robots.txt. - Offer a configuration surface that lets operators toggle protections, define crawler identity (User-Agent string), and supply sane defaults for timeouts and retries.
- Lay groundwork for future policies such as request rate limiting, automatic backoff, or per-host overrides without redesigning the core proxy.
- Expose observability hooks (structured logs, metrics counters) that make it easy to monitor allowed/blocked requests and cache freshness.
Non-goals (initial release)
- HTTPS MITM is opt-in: terminating TLS requires operators to supply/trust a root CA and accept the risks of decrypting crawler traffic. When disabled, the proxy still operates as a transparent CONNECT tunnel without robots enforcement on encrypted payloads.
- No persistent/shared cache in v1. Everything lives in-memory; later we can add Redis or file-backed caches for multi-node deployments.
- No admin UI or control-plane API beyond config files/CLI flags in the first iteration.
Key Requirements
- Proxy behavior
- Accept standard HTTP proxy requests and forward them to the origin, returning responses transparently.
- Support GET/HEAD initially; other verbs can pass through but won't be robots-enforced until defined.
- Respect upstream response streaming so large payloads do not accumulate in memory.
- Robots enforcement
- Fetch
robots.txtfor a hostname on first request (or cache miss) using the same proxy client stack. - Parse directives per user-agent, falling back to
*when the configured crawler agent is absent. - Deny requests whose path matches a
Disallowrule before contacting the origin; respond with 403 and log the violation. - Operators can choose whether a missing
robots.txt(HTTP 404) allows all requests or causes the proxy to block them with a 404 response.
- Fetch
- Caching strategy
- Store parsed directives alongside metadata: fetched timestamp, HTTP status, cache TTL, fetch errors.
- Re-fetch when TTL expires or when the origin returned 4xx/5xx during the robots fetch (with exponential backoff to avoid hammering).
- Provide manual invalidation hooks (CLI or signal) to drop entries without restarting.
- Configuration
- File-based config (YAML/TOML) or CLI flags specifying listen address, crawler user-agent, cache TTL defaults, concurrency, and timeout settings.
- Allow per-host overrides (e.g., custom user-agent, forced allow/deny, TTL).
- Provide a middleware
layersarray so operators can order optional protections (robots, rate limiting, logging) without code changes.
- Observability
- Structured logs for each request: host, path, action (allowed/blocked/bypassed), cache state (hit/miss/stale), latency.
- Metrics counters for requests processed, robots fetch attempts, cache hits, denied requests, fetch failures.
High-Level Architecture
+-----------------------+
| Proxy Listener |
| (HTTP CONNECT + TLS) |
+-----------+-----------+
|
v
+-----------------------+ +---------------------+
| Request Pipeline |----->| Origin HTTP Client |
| (middleware chain) | | (keepalive pool) |
+-----+---------+-------+ +---------+-----------+
| | ^
| | |
| +-----v-----+ +-------+--------+
| | Robots |<----------| Robots Cache |
| | Policy | | (in-memory) |
| +-----------+ +----------------+
|
+--> future policies (rate limiting, rewrite, custom auth)
Components
- Proxy Listener: Accepts client connections, parses HTTP proxy protocol, establishes tunnels for HTTPS, and forwards HTTP requests into the pipeline.
- Request Pipeline: Ordered middleware stack; each policy can inspect/modify the request/context. Core policies: logging, robots enforcement, response streaming.
- Robots Policy: Responsible for locating or fetching directives, evaluating the current request path against them, and either allowing the request to continue or short-circuiting with a denial response.
- Robots Cache: In-memory map keyed by host+scheme storing parsed directives plus metadata. Supports TTL, background refresh, and invalidation signals.
- Origin HTTP Client: Performs outbound requests when policies permit them, using connection pooling, timeouts, and retries per config.
Request Flow
- Client sends HTTP request to proxy (e.g.,
GET http://example.com/page). - Proxy Listener normalizes the request and creates a context object containing host, path, scheme, and headers.
- Logging middleware records the start event and attaches a request ID.
- Robots policy extracts the host, consults the cache:
- Cache hit + fresh: Evaluate rules immediately.
- Cache miss/stale: Acquire per-host lock, fetch
robots.txt, parse, update cache, then evaluate.
- If the path is disallowed, respond with 403, emit log/metric, and terminate the pipeline.
- If allowed, pass the request to later policies (future rate limiting, tracing) before handing off to the Origin HTTP Client.
- Origin client forwards the request, streams the response back to the client, while logging completion and updating metrics.
Robots Cache & Fetch Strategy
- Key:
<scheme>://<host>:<port>ensuring separate entries for HTTP vs HTTPS or custom ports. - Value: Parsed directives (allow/disallow lists with priority), crawl-delay info, sitemap references, metadata {fetched_at, expires_at, http_status, etag/last-modified}.
- Fetching:
- Use HEAD first to check existence? Simpler: GET
/robots.txtwith standard timeouts. - Respect
Cache-Controlheaders when setting TTL; fallback TTL configurable (e.g., 1 hour) with min/max bounds. - On 404: treat as "allow all" but cache short TTL. On 401/403/5xx: treat as temporary failure, optionally fail-closed (configurable) and retry later.
- Use HEAD first to check existence? Simpler: GET
- Concurrency: Use per-host mutex/future to avoid stampedes when multiple crawler requests arrive simultaneously.
- Invalidation: CLI command or signal (e.g.,
SIGHUP) triggers dropping either all cache entries or a specific host.
Extensibility Hooks for Future Policies
- Define a
Policytrait, e.g.async fn handle(&self, ctx: &mut RequestContext, next: Next<'_>) -> Result<Response>. Policies compose Tower-style so they can short-circuit or call the next handler. - Provide shared context data (request metadata, cache stats, metrics emitters) so policies can make decisions without deep coupling.
- Ship built-in policies: logging, robots enforcement. Later, add rate limiting (token bucket keyed by host), adaptive retry/backoff, response filtering.
- Ensure configuration supports enabling/disabling policies and ordering them as needed.
Technology Choices (tentative)
- Language/runtime: Rust on top of Tokio for async IO, leveraging Hyper for both proxy listener and origin client to get strong performance and memory safety.
- Dependencies:
- HTTP proxy foundation:
hyper+hyper-utilfor client/server plumbing, potentiallytowerfor middleware ergonomics. - Robots parser: existing crates such as
robotstxtor a lightweight custom parser depending on flexibility needs. - Metrics/logging:
tracingecosystem for structured logs and OpenTelemetry exporters, plusmetricsorprometheuscrate for counters.
- HTTP proxy foundation:
Testing Strategy
- Unit tests for robots parser wrapper, cache TTL logic, and policy decision outcomes.
- Integration tests with an in-memory HTTP server serving different
robots.txtpermutations, verifying allow/deny behavior and caching. - Load/smoke tests to ensure the proxy handles concurrent crawler requests and respects TTLs without deadlocks.
Open Questions / Future Work
- Should we allow per-client identities (multiple crawler user-agents) simultaneously?
- How will we persist cache metadata across restarts if needed?
- When implementing rate limiting, does policy need shared state (Redis) for multi-proxy deployments?
- What is the operator UX for cache inspection (CLI, metrics endpoint)?