Files
robots_proxy/README.md
Zachary D. Rowitsch f01548ad8a
Some checks failed
CI / build (push) Has been cancelled
Add MITM integration tests and configurable robots failover
* add native-root HTTPS client for robots policy so HTTPS robots.txt fetches work under MITM
  * introduce `robots.missing_policy`, document all robots flags, and handle missing robots responses
  * add HTTPS MITM integration tests (self-signed CA, CONNECT tunnel verification)
  * document MITM CA generation/install steps and wire server listener to support upgrades
2025-11-28 15:49:00 -05:00

6.2 KiB
Raw Permalink Blame History

Robots Proxy Project Checklist

  • Requirements & architecture doc define proxy features, request flow, robots caching strategy, and interfaces for future policies. (See docs/architecture.md.)
  • Skeleton repo set up runtime, dependencies, linters, test harness, and CI stub.
  • Core proxy scaffold implement a basic forward proxy that logs and transparently forwards requests/responses.
  • Robots policy module fetch and cache robots.txt per host, parse rules, and enforce disallow logic per user-agent.
  • Config + middleware wiring enable robots enforcement via config and add hooks for logging and future rate limiting policies.
  • Tests + tooling add unit tests for parser/cache plus an integration test that checks enforcement against a local server.
  • Documentation expand README with setup, usage, and roadmap for rate limiting, admin API, and distributed cache options.

Setup

Requires Rust 1.75+ (stable toolchain). Clone the repo and run:

cargo build

Running the proxy

cargo run --release -- --config config.yaml

If no config path is provided, the proxy loads config.{yaml,toml,json} if present and falls back to environment variables (ROBOTS_PROXY__*).

Configuration

Defaults are defined in src/settings.rs (also described in docs/architecture.md). Key sections:

  • listen_addr: interface/port to bind.
  • user_agent: crawler identity for robots.txt evaluation.
  • robots: enable flag, cache TTL, and fail_open toggle.
  • robots.fail_open: when true, network/HTTP failures fetching robots.txt (timeouts, 5xx, DNS errors) allow the request to continue; when false, those failures return a 403.
  • robots.missing_policy: choose whether a missing robots.txt (HTTP 404) blocks the request (deny) or is treated as allow (allow).
  • middleware.layers: ordered middleware list (current kinds: robots, logging, rate_limit). Unknown kinds are skipped with a log warning so you can preconfigure upcoming policies.
  • mitm: controls HTTPS interception. Set enabled: true, point to a CA cert/key pair, and set per-host cert cache TTL + validity window.

Example config:

listen_addr: 0.0.0.0:8080
user_agent: my-crawler
robots:
  enabled: true
  cache_ttl_secs: 600
  fail_open: false
  missing_policy: allow # or "deny" to reject when robots.txt is missing
middleware:
  layers:
    - kind: robots
      enabled: true
    - kind: logging
      enabled: false
mitm:
  enabled: true
  ca_cert_path: ./dev_ca.pem
  ca_key_path: ./dev_ca.key
  cert_cache_ttl_secs: 600
  cert_validity_days: 7

Generating and installing the MITM CA

  1. Create a private CA generate a keypair and self-signed root certificate and store both securely. Example (adjust subject fields as needed):
    openssl req -x509 -newkey rsa:4096 -days 365 -nodes \
      -keyout dev_ca.key -out dev_ca.pem \
      -subj "/CN=robots-proxy-dev-ca"
    
    The proxy only needs PEM files on disk; guard dev_ca.key with OS file permissions or a secrets manager because compromise lets attackers mint trusted certificates.
  2. Point the proxy at the CA update your config mitm.ca_cert_path/mitm.ca_key_path to reference the files above and restart the proxy. The proxy now issues leaf certs per CONNECT target using this CA.
  3. Distribute the root to clients crawlers must trust the proxy CA or TLS handshakes will fail. Install dev_ca.pem into each clients trust store, e.g.:
    • macOS: sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain dev_ca.pem
    • Debian/Ubuntu: copy the PEM to /usr/local/share/ca-certificates/robots-proxy.crt then run sudo update-ca-certificates
    • Custom crawler bundles: append the PEM to the library/application trust bundle and redeploy. When automating, version the CA, roll it periodically, and remove stale trust entries when rotating.
  4. Verify and monitor run an HTTPS request through the proxy with an agent that trusts the root and confirm robots decisions appear in logs. Keep the CA private key offline or in an HSM for production deployments.

Development workflow

  • cargo fmt
  • cargo clippy --all-targets --all-features
  • cargo test

CI (.github/workflows/ci.yml) runs all of the above on push/PR. docs/architecture.md outlines the design, middleware pipeline, and open questions.

Adding a middleware layer

  1. Implement a Tower Layer/Service in src/proxy/service.rs (or a new module) that wraps Request<Incoming>ProxyResponse.
  2. Add a new LayerKind variant in src/settings.rs and expose configuration knobs (e.g., rate limits, logging options).
  3. Update build_service to recognize the new kind, stack your layer, and ensure it is guarded behind a config flag so it can be toggled/ordered via middleware.layers.
  4. Write unit tests for the layer's behavior (similar to the validation/robots tests already in src/proxy/service.rs).

Integration tests

tests/proxy_integration.rs exercises the proxy service end-to-end by spinning up a local HTTP origin server and calling the middleware pipeline. Run via cargo test --test proxy_integration. If the environment forbids binding loopback sockets (e.g., in sandboxed CI), the tests skip automatically.

Roadmap

  • Rate limiting middleware (per-host token bucket with optional Redis backend)
  • Logging/tracing middleware wired to tracing subscribers
  • Admin/API for cache inspection and manual invalidation
  • Distributed cache support (Redis or persistent store)
  • Integration tests hitting a local HTTP server to validate robots enforcement end-to-end

HTTPS MITM feature plan

  • Research MITM TLS requirements: root CA generation, secure storage, and client trust distribution.
  • Design certificate issuance pipeline: per-origin leaf cert caching, CSR signing, rotation strategy.
  • Integrate TLS termination for CONNECT: accept CONNECT, perform TLS handshake with client using generated cert, establish upstream TLS, pipe data.
  • Ensure robots enforcement hooks into decrypted HTTP traffic (HTTP/1.1 initially).
  • Add configuration for MITM (enable flag, CA paths, key protections) and document security considerations.
  • Extend tests to cover MITM flow (unit tests for cert issuance, integration test with trusted root).