Files
robots_proxy/docs/architecture.md
Zachary D. Rowitsch f01548ad8a
Some checks failed
CI / build (push) Has been cancelled
Add MITM integration tests and configurable robots failover
* add native-root HTTPS client for robots policy so HTTPS robots.txt fetches work under MITM
  * introduce `robots.missing_policy`, document all robots flags, and handle missing robots responses
  * add HTTPS MITM integration tests (self-signed CA, CONNECT tunnel verification)
  * document MITM CA generation/install steps and wire server listener to support upgrades
2025-11-28 15:49:00 -05:00

110 lines
8.3 KiB
Markdown

# Robots Proxy Requirements & Architecture
## Goals
- Provide a forward HTTP proxy that protects upstream origins from buggy crawlers by enforcing each host's `robots.txt` policy before a request is forwarded.
- Maintain a local cache of parsed robots directives per host so that crawlers can aggressively request documents without repeatedly fetching `robots.txt`.
- Offer a configuration surface that lets operators toggle protections, define crawler identity (User-Agent string), and supply sane defaults for timeouts and retries.
- Lay groundwork for future policies such as request rate limiting, automatic backoff, or per-host overrides without redesigning the core proxy.
- Expose observability hooks (structured logs, metrics counters) that make it easy to monitor allowed/blocked requests and cache freshness.
## Non-goals (initial release)
- HTTPS MITM is opt-in: terminating TLS requires operators to supply/trust a root CA and accept the risks of decrypting crawler traffic. When disabled, the proxy still operates as a transparent CONNECT tunnel without robots enforcement on encrypted payloads.
- No persistent/shared cache in v1. Everything lives in-memory; later we can add Redis or file-backed caches for multi-node deployments.
- No admin UI or control-plane API beyond config files/CLI flags in the first iteration.
## Key Requirements
1. **Proxy behavior**
- Accept standard HTTP proxy requests and forward them to the origin, returning responses transparently.
- Support GET/HEAD initially; other verbs can pass through but won't be robots-enforced until defined.
- Respect upstream response streaming so large payloads do not accumulate in memory.
2. **Robots enforcement**
- Fetch `robots.txt` for a hostname on first request (or cache miss) using the same proxy client stack.
- Parse directives per user-agent, falling back to `*` when the configured crawler agent is absent.
- Deny requests whose path matches a `Disallow` rule before contacting the origin; respond with 403 and log the violation.
- Operators can choose whether a missing `robots.txt` (HTTP 404) allows all requests or causes the proxy to block them with a 404 response.
3. **Caching strategy**
- Store parsed directives alongside metadata: fetched timestamp, HTTP status, cache TTL, fetch errors.
- Re-fetch when TTL expires or when the origin returned 4xx/5xx during the robots fetch (with exponential backoff to avoid hammering).
- Provide manual invalidation hooks (CLI or signal) to drop entries without restarting.
4. **Configuration**
- File-based config (YAML/TOML) or CLI flags specifying listen address, crawler user-agent, cache TTL defaults, concurrency, and timeout settings.
- Allow per-host overrides (e.g., custom user-agent, forced allow/deny, TTL).
- Provide a middleware `layers` array so operators can order optional protections (robots, rate limiting, logging) without code changes.
5. **Observability**
- Structured logs for each request: host, path, action (allowed/blocked/bypassed), cache state (hit/miss/stale), latency.
- Metrics counters for requests processed, robots fetch attempts, cache hits, denied requests, fetch failures.
## High-Level Architecture
```
+-----------------------+
| Proxy Listener |
| (HTTP CONNECT + TLS) |
+-----------+-----------+
|
v
+-----------------------+ +---------------------+
| Request Pipeline |----->| Origin HTTP Client |
| (middleware chain) | | (keepalive pool) |
+-----+---------+-------+ +---------+-----------+
| | ^
| | |
| +-----v-----+ +-------+--------+
| | Robots |<----------| Robots Cache |
| | Policy | | (in-memory) |
| +-----------+ +----------------+
|
+--> future policies (rate limiting, rewrite, custom auth)
```
### Components
- **Proxy Listener**: Accepts client connections, parses HTTP proxy protocol, establishes tunnels for HTTPS, and forwards HTTP requests into the pipeline.
- **Request Pipeline**: Ordered middleware stack; each policy can inspect/modify the request/context. Core policies: logging, robots enforcement, response streaming.
- **Robots Policy**: Responsible for locating or fetching directives, evaluating the current request path against them, and either allowing the request to continue or short-circuiting with a denial response.
- **Robots Cache**: In-memory map keyed by host+scheme storing parsed directives plus metadata. Supports TTL, background refresh, and invalidation signals.
- **Origin HTTP Client**: Performs outbound requests when policies permit them, using connection pooling, timeouts, and retries per config.
## Request Flow
1. Client sends HTTP request to proxy (e.g., `GET http://example.com/page`).
2. Proxy Listener normalizes the request and creates a context object containing host, path, scheme, and headers.
3. Logging middleware records the start event and attaches a request ID.
4. Robots policy extracts the host, consults the cache:
- **Cache hit + fresh**: Evaluate rules immediately.
- **Cache miss/stale**: Acquire per-host lock, fetch `robots.txt`, parse, update cache, then evaluate.
5. If the path is disallowed, respond with 403, emit log/metric, and terminate the pipeline.
6. If allowed, pass the request to later policies (future rate limiting, tracing) before handing off to the Origin HTTP Client.
7. Origin client forwards the request, streams the response back to the client, while logging completion and updating metrics.
## Robots Cache & Fetch Strategy
- **Key**: `<scheme>://<host>:<port>` ensuring separate entries for HTTP vs HTTPS or custom ports.
- **Value**: Parsed directives (allow/disallow lists with priority), crawl-delay info, sitemap references, metadata {fetched_at, expires_at, http_status, etag/last-modified}.
- **Fetching**:
- Use HEAD first to check existence? Simpler: GET `/robots.txt` with standard timeouts.
- Respect `Cache-Control` headers when setting TTL; fallback TTL configurable (e.g., 1 hour) with min/max bounds.
- On 404: treat as "allow all" but cache short TTL. On 401/403/5xx: treat as temporary failure, optionally fail-closed (configurable) and retry later.
- **Concurrency**: Use per-host mutex/future to avoid stampedes when multiple crawler requests arrive simultaneously.
- **Invalidation**: CLI command or signal (e.g., `SIGHUP`) triggers dropping either all cache entries or a specific host.
## Extensibility Hooks for Future Policies
- Define a `Policy` trait, e.g. `async fn handle(&self, ctx: &mut RequestContext, next: Next<'_>) -> Result<Response>`. Policies compose Tower-style so they can short-circuit or call the next handler.
- Provide shared context data (request metadata, cache stats, metrics emitters) so policies can make decisions without deep coupling.
- Ship built-in policies: logging, robots enforcement. Later, add rate limiting (token bucket keyed by host), adaptive retry/backoff, response filtering.
- Ensure configuration supports enabling/disabling policies and ordering them as needed.
## Technology Choices (tentative)
- **Language/runtime**: Rust on top of Tokio for async IO, leveraging Hyper for both proxy listener and origin client to get strong performance and memory safety.
- **Dependencies**:
- HTTP proxy foundation: `hyper` + `hyper-util` for client/server plumbing, potentially `tower` for middleware ergonomics.
- Robots parser: existing crates such as `robotstxt` or a lightweight custom parser depending on flexibility needs.
- Metrics/logging: `tracing` ecosystem for structured logs and OpenTelemetry exporters, plus `metrics` or `prometheus` crate for counters.
## Testing Strategy
- Unit tests for robots parser wrapper, cache TTL logic, and policy decision outcomes.
- Integration tests with an in-memory HTTP server serving different `robots.txt` permutations, verifying allow/deny behavior and caching.
- Load/smoke tests to ensure the proxy handles concurrent crawler requests and respects TTLs without deadlocks.
## Open Questions / Future Work
- Should we allow per-client identities (multiple crawler user-agents) simultaneously?
- How will we persist cache metadata across restarts if needed?
- When implementing rate limiting, does policy need shared state (Redis) for multi-proxy deployments?
- What is the operator UX for cache inspection (CLI, metrics endpoint)?