Some checks failed
CI / build (push) Has been cancelled
* add native-root HTTPS client for robots policy so HTTPS robots.txt fetches work under MITM * introduce `robots.missing_policy`, document all robots flags, and handle missing robots responses * add HTTPS MITM integration tests (self-signed CA, CONNECT tunnel verification) * document MITM CA generation/install steps and wire server listener to support upgrades
110 lines
8.3 KiB
Markdown
110 lines
8.3 KiB
Markdown
# Robots Proxy Requirements & Architecture
|
|
|
|
## Goals
|
|
- Provide a forward HTTP proxy that protects upstream origins from buggy crawlers by enforcing each host's `robots.txt` policy before a request is forwarded.
|
|
- Maintain a local cache of parsed robots directives per host so that crawlers can aggressively request documents without repeatedly fetching `robots.txt`.
|
|
- Offer a configuration surface that lets operators toggle protections, define crawler identity (User-Agent string), and supply sane defaults for timeouts and retries.
|
|
- Lay groundwork for future policies such as request rate limiting, automatic backoff, or per-host overrides without redesigning the core proxy.
|
|
- Expose observability hooks (structured logs, metrics counters) that make it easy to monitor allowed/blocked requests and cache freshness.
|
|
|
|
## Non-goals (initial release)
|
|
- HTTPS MITM is opt-in: terminating TLS requires operators to supply/trust a root CA and accept the risks of decrypting crawler traffic. When disabled, the proxy still operates as a transparent CONNECT tunnel without robots enforcement on encrypted payloads.
|
|
- No persistent/shared cache in v1. Everything lives in-memory; later we can add Redis or file-backed caches for multi-node deployments.
|
|
- No admin UI or control-plane API beyond config files/CLI flags in the first iteration.
|
|
|
|
## Key Requirements
|
|
1. **Proxy behavior**
|
|
- Accept standard HTTP proxy requests and forward them to the origin, returning responses transparently.
|
|
- Support GET/HEAD initially; other verbs can pass through but won't be robots-enforced until defined.
|
|
- Respect upstream response streaming so large payloads do not accumulate in memory.
|
|
2. **Robots enforcement**
|
|
- Fetch `robots.txt` for a hostname on first request (or cache miss) using the same proxy client stack.
|
|
- Parse directives per user-agent, falling back to `*` when the configured crawler agent is absent.
|
|
- Deny requests whose path matches a `Disallow` rule before contacting the origin; respond with 403 and log the violation.
|
|
- Operators can choose whether a missing `robots.txt` (HTTP 404) allows all requests or causes the proxy to block them with a 404 response.
|
|
3. **Caching strategy**
|
|
- Store parsed directives alongside metadata: fetched timestamp, HTTP status, cache TTL, fetch errors.
|
|
- Re-fetch when TTL expires or when the origin returned 4xx/5xx during the robots fetch (with exponential backoff to avoid hammering).
|
|
- Provide manual invalidation hooks (CLI or signal) to drop entries without restarting.
|
|
4. **Configuration**
|
|
- File-based config (YAML/TOML) or CLI flags specifying listen address, crawler user-agent, cache TTL defaults, concurrency, and timeout settings.
|
|
- Allow per-host overrides (e.g., custom user-agent, forced allow/deny, TTL).
|
|
- Provide a middleware `layers` array so operators can order optional protections (robots, rate limiting, logging) without code changes.
|
|
5. **Observability**
|
|
- Structured logs for each request: host, path, action (allowed/blocked/bypassed), cache state (hit/miss/stale), latency.
|
|
- Metrics counters for requests processed, robots fetch attempts, cache hits, denied requests, fetch failures.
|
|
|
|
## High-Level Architecture
|
|
```
|
|
+-----------------------+
|
|
| Proxy Listener |
|
|
| (HTTP CONNECT + TLS) |
|
|
+-----------+-----------+
|
|
|
|
|
v
|
|
+-----------------------+ +---------------------+
|
|
| Request Pipeline |----->| Origin HTTP Client |
|
|
| (middleware chain) | | (keepalive pool) |
|
|
+-----+---------+-------+ +---------+-----------+
|
|
| | ^
|
|
| | |
|
|
| +-----v-----+ +-------+--------+
|
|
| | Robots |<----------| Robots Cache |
|
|
| | Policy | | (in-memory) |
|
|
| +-----------+ +----------------+
|
|
|
|
|
+--> future policies (rate limiting, rewrite, custom auth)
|
|
```
|
|
|
|
### Components
|
|
- **Proxy Listener**: Accepts client connections, parses HTTP proxy protocol, establishes tunnels for HTTPS, and forwards HTTP requests into the pipeline.
|
|
- **Request Pipeline**: Ordered middleware stack; each policy can inspect/modify the request/context. Core policies: logging, robots enforcement, response streaming.
|
|
- **Robots Policy**: Responsible for locating or fetching directives, evaluating the current request path against them, and either allowing the request to continue or short-circuiting with a denial response.
|
|
- **Robots Cache**: In-memory map keyed by host+scheme storing parsed directives plus metadata. Supports TTL, background refresh, and invalidation signals.
|
|
- **Origin HTTP Client**: Performs outbound requests when policies permit them, using connection pooling, timeouts, and retries per config.
|
|
|
|
## Request Flow
|
|
1. Client sends HTTP request to proxy (e.g., `GET http://example.com/page`).
|
|
2. Proxy Listener normalizes the request and creates a context object containing host, path, scheme, and headers.
|
|
3. Logging middleware records the start event and attaches a request ID.
|
|
4. Robots policy extracts the host, consults the cache:
|
|
- **Cache hit + fresh**: Evaluate rules immediately.
|
|
- **Cache miss/stale**: Acquire per-host lock, fetch `robots.txt`, parse, update cache, then evaluate.
|
|
5. If the path is disallowed, respond with 403, emit log/metric, and terminate the pipeline.
|
|
6. If allowed, pass the request to later policies (future rate limiting, tracing) before handing off to the Origin HTTP Client.
|
|
7. Origin client forwards the request, streams the response back to the client, while logging completion and updating metrics.
|
|
|
|
## Robots Cache & Fetch Strategy
|
|
- **Key**: `<scheme>://<host>:<port>` ensuring separate entries for HTTP vs HTTPS or custom ports.
|
|
- **Value**: Parsed directives (allow/disallow lists with priority), crawl-delay info, sitemap references, metadata {fetched_at, expires_at, http_status, etag/last-modified}.
|
|
- **Fetching**:
|
|
- Use HEAD first to check existence? Simpler: GET `/robots.txt` with standard timeouts.
|
|
- Respect `Cache-Control` headers when setting TTL; fallback TTL configurable (e.g., 1 hour) with min/max bounds.
|
|
- On 404: treat as "allow all" but cache short TTL. On 401/403/5xx: treat as temporary failure, optionally fail-closed (configurable) and retry later.
|
|
- **Concurrency**: Use per-host mutex/future to avoid stampedes when multiple crawler requests arrive simultaneously.
|
|
- **Invalidation**: CLI command or signal (e.g., `SIGHUP`) triggers dropping either all cache entries or a specific host.
|
|
|
|
## Extensibility Hooks for Future Policies
|
|
- Define a `Policy` trait, e.g. `async fn handle(&self, ctx: &mut RequestContext, next: Next<'_>) -> Result<Response>`. Policies compose Tower-style so they can short-circuit or call the next handler.
|
|
- Provide shared context data (request metadata, cache stats, metrics emitters) so policies can make decisions without deep coupling.
|
|
- Ship built-in policies: logging, robots enforcement. Later, add rate limiting (token bucket keyed by host), adaptive retry/backoff, response filtering.
|
|
- Ensure configuration supports enabling/disabling policies and ordering them as needed.
|
|
|
|
## Technology Choices (tentative)
|
|
- **Language/runtime**: Rust on top of Tokio for async IO, leveraging Hyper for both proxy listener and origin client to get strong performance and memory safety.
|
|
- **Dependencies**:
|
|
- HTTP proxy foundation: `hyper` + `hyper-util` for client/server plumbing, potentially `tower` for middleware ergonomics.
|
|
- Robots parser: existing crates such as `robotstxt` or a lightweight custom parser depending on flexibility needs.
|
|
- Metrics/logging: `tracing` ecosystem for structured logs and OpenTelemetry exporters, plus `metrics` or `prometheus` crate for counters.
|
|
|
|
## Testing Strategy
|
|
- Unit tests for robots parser wrapper, cache TTL logic, and policy decision outcomes.
|
|
- Integration tests with an in-memory HTTP server serving different `robots.txt` permutations, verifying allow/deny behavior and caching.
|
|
- Load/smoke tests to ensure the proxy handles concurrent crawler requests and respects TTLs without deadlocks.
|
|
|
|
## Open Questions / Future Work
|
|
- Should we allow per-client identities (multiple crawler user-agents) simultaneously?
|
|
- How will we persist cache metadata across restarts if needed?
|
|
- When implementing rate limiting, does policy need shared state (Redis) for multi-proxy deployments?
|
|
- What is the operator UX for cache inspection (CLI, metrics endpoint)?
|