Files
robots_proxy/README.md
Zachary D. Rowitsch f01548ad8a
Some checks failed
CI / build (push) Has been cancelled
Add MITM integration tests and configurable robots failover
* add native-root HTTPS client for robots policy so HTTPS robots.txt fetches work under MITM
  * introduce `robots.missing_policy`, document all robots flags, and handle missing robots responses
  * add HTTPS MITM integration tests (self-signed CA, CONNECT tunnel verification)
  * document MITM CA generation/install steps and wire server listener to support upgrades
2025-11-28 15:49:00 -05:00

114 lines
6.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Robots Proxy Project Checklist
- [x] Requirements & architecture doc define proxy features, request flow, robots caching strategy, and interfaces for future policies. (See `docs/architecture.md`.)
- [x] Skeleton repo set up runtime, dependencies, linters, test harness, and CI stub.
- [x] Core proxy scaffold implement a basic forward proxy that logs and transparently forwards requests/responses.
- [x] Robots policy module fetch and cache robots.txt per host, parse rules, and enforce disallow logic per user-agent.
- [x] Config + middleware wiring enable robots enforcement via config and add hooks for logging and future rate limiting policies.
- [x] Tests + tooling add unit tests for parser/cache plus an integration test that checks enforcement against a local server.
- [ ] Documentation expand README with setup, usage, and roadmap for rate limiting, admin API, and distributed cache options.
## Setup
Requires Rust 1.75+ (stable toolchain). Clone the repo and run:
```sh
cargo build
```
## Running the proxy
```sh
cargo run --release -- --config config.yaml
```
If no config path is provided, the proxy loads `config.{yaml,toml,json}` if present and falls back to environment variables (`ROBOTS_PROXY__*`).
## Configuration
Defaults are defined in `src/settings.rs` (also described in `docs/architecture.md`). Key sections:
- `listen_addr`: interface/port to bind.
- `user_agent`: crawler identity for robots.txt evaluation.
- `robots`: enable flag, cache TTL, and `fail_open` toggle.
- `robots.fail_open`: when `true`, network/HTTP failures fetching `robots.txt` (timeouts, 5xx, DNS errors) allow the request to continue; when `false`, those failures return a 403.
- `robots.missing_policy`: choose whether a missing `robots.txt` (HTTP 404) blocks the request (`deny`) or is treated as allow (`allow`).
- `middleware.layers`: ordered middleware list (current kinds: `robots`, `logging`, `rate_limit`). Unknown kinds are skipped with a log warning so you can preconfigure upcoming policies.
- `mitm`: controls HTTPS interception. Set `enabled: true`, point to a CA cert/key pair, and set per-host cert cache TTL + validity window.
Example config:
```yaml
listen_addr: 0.0.0.0:8080
user_agent: my-crawler
robots:
enabled: true
cache_ttl_secs: 600
fail_open: false
missing_policy: allow # or "deny" to reject when robots.txt is missing
middleware:
layers:
- kind: robots
enabled: true
- kind: logging
enabled: false
mitm:
enabled: true
ca_cert_path: ./dev_ca.pem
ca_key_path: ./dev_ca.key
cert_cache_ttl_secs: 600
cert_validity_days: 7
```
### Generating and installing the MITM CA
1. **Create a private CA** generate a keypair and self-signed root certificate and store both securely. Example (adjust subject fields as needed):
```sh
openssl req -x509 -newkey rsa:4096 -days 365 -nodes \
-keyout dev_ca.key -out dev_ca.pem \
-subj "/CN=robots-proxy-dev-ca"
```
The proxy only needs PEM files on disk; guard `dev_ca.key` with OS file permissions or a secrets manager because compromise lets attackers mint trusted certificates.
2. **Point the proxy at the CA** update your config `mitm.ca_cert_path`/`mitm.ca_key_path` to reference the files above and restart the proxy. The proxy now issues leaf certs per CONNECT target using this CA.
3. **Distribute the root to clients** crawlers must trust the proxy CA or TLS handshakes will fail. Install `dev_ca.pem` into each clients trust store, e.g.:
- macOS: `sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain dev_ca.pem`
- Debian/Ubuntu: copy the PEM to `/usr/local/share/ca-certificates/robots-proxy.crt` then run `sudo update-ca-certificates`
- Custom crawler bundles: append the PEM to the library/application trust bundle and redeploy.
When automating, version the CA, roll it periodically, and remove stale trust entries when rotating.
4. **Verify and monitor** run an HTTPS request through the proxy with an agent that trusts the root and confirm robots decisions appear in logs. Keep the CA private key offline or in an HSM for production deployments.
## Development workflow
- `cargo fmt`
- `cargo clippy --all-targets --all-features`
- `cargo test`
CI (`.github/workflows/ci.yml`) runs all of the above on push/PR. `docs/architecture.md` outlines the design, middleware pipeline, and open questions.
### Adding a middleware layer
1. Implement a Tower `Layer`/`Service` in `src/proxy/service.rs` (or a new module) that wraps `Request<Incoming>` → `ProxyResponse`.
2. Add a new `LayerKind` variant in `src/settings.rs` and expose configuration knobs (e.g., rate limits, logging options).
3. Update `build_service` to recognize the new kind, stack your layer, and ensure it is guarded behind a config flag so it can be toggled/ordered via `middleware.layers`.
4. Write unit tests for the layer's behavior (similar to the validation/robots tests already in `src/proxy/service.rs`).
### Integration tests
`tests/proxy_integration.rs` exercises the proxy service end-to-end by spinning up a local HTTP origin server and calling the middleware pipeline. Run via `cargo test --test proxy_integration`. If the environment forbids binding loopback sockets (e.g., in sandboxed CI), the tests skip automatically.
## Roadmap
- Rate limiting middleware (per-host token bucket with optional Redis backend)
- Logging/tracing middleware wired to `tracing` subscribers
- Admin/API for cache inspection and manual invalidation
- Distributed cache support (Redis or persistent store)
- Integration tests hitting a local HTTP server to validate robots enforcement end-to-end
## HTTPS MITM feature plan
- [x] Research MITM TLS requirements: root CA generation, secure storage, and client trust distribution.
- [x] Design certificate issuance pipeline: per-origin leaf cert caching, CSR signing, rotation strategy.
- [x] Integrate TLS termination for CONNECT: accept CONNECT, perform TLS handshake with client using generated cert, establish upstream TLS, pipe data.
- [x] Ensure robots enforcement hooks into decrypted HTTP traffic (HTTP/1.1 initially).
- [x] Add configuration for MITM (enable flag, CA paths, key protections) and document security considerations.
- [x] Extend tests to cover MITM flow (unit tests for cert issuance, integration test with trusted root).