robots_proxy/README.md

# Robots Proxy Project Checklist

- [x] Requirements & architecture doc – define proxy features, request flow, robots caching strategy, and interfaces for future policies. (See `docs/architecture.md`.)
- [x] Skeleton repo – set up runtime, dependencies, linters, test harness, and CI stub.
- [x] Core proxy scaffold – implement a basic forward proxy that logs and transparently forwards requests/responses.
- [x] Robots policy module – fetch and cache robots.txt per host, parse rules, and enforce disallow logic per user-agent.
- [x] Config + middleware wiring – enable robots enforcement via config and add hooks for logging and future rate limiting policies.
- [x] Tests + tooling – add unit tests for parser/cache plus an integration test that checks enforcement against a local server.
- [ ] Documentation – expand README with setup, usage, and roadmap for rate limiting, admin API, and distributed cache options.
## Setup

Requires Rust 1.75+ (stable toolchain). Clone the repo and run:

```sh
cargo build
```

## Running the proxy

```sh
cargo run --release -- --config config.yaml
```

If no config path is provided, the proxy loads `config.{yaml,toml,json}` if present and falls back to environment variables (`ROBOTS_PROXY__*`).

## Configuration

Defaults are defined in `src/settings.rs` (also described in `docs/architecture.md`). Key sections:

- `listen_addr`: interface/port to bind.
- `user_agent`: crawler identity for robots.txt evaluation.
- `robots`: enable flag, cache TTL, and `fail_open` toggle.
- `robots.fail_open`: when `true`, network/HTTP failures fetching `robots.txt` (timeouts, 5xx, DNS errors) allow the request to continue; when `false`, those failures return a 403.
- `robots.missing_policy`: choose whether a missing `robots.txt` (HTTP 404) blocks the request (`deny`) or is treated as allow (`allow`).
- `middleware.layers`: ordered middleware list (current kinds: `robots`, `logging`, `rate_limit`). Unknown kinds are skipped with a log warning so you can preconfigure upcoming policies.
- `mitm`: controls HTTPS interception. Set `enabled: true`, point to a CA cert/key pair, and set per-host cert cache TTL + validity window.

Example config:

```yaml
listen_addr: 0.0.0.0:8080
user_agent: my-crawler
robots:
  enabled: true
  cache_ttl_secs: 600
  fail_open: false
  missing_policy: allow # or "deny" to reject when robots.txt is missing
middleware:
  layers:
    - kind: robots
      enabled: true
    - kind: logging
      enabled: false
mitm:
  enabled: true
  ca_cert_path: ./dev_ca.pem
  ca_key_path: ./dev_ca.key
  cert_cache_ttl_secs: 600
  cert_validity_days: 7
```

### Generating and installing the MITM CA

1. **Create a private CA** – generate a keypair and self-signed root certificate and store both securely. Example (adjust subject fields as needed):
   ```sh
   openssl req -x509 -newkey rsa:4096 -days 365 -nodes \
     -keyout dev_ca.key -out dev_ca.pem \
     -subj "/CN=robots-proxy-dev-ca"
   ```
   The proxy only needs PEM files on disk; guard `dev_ca.key` with OS file permissions or a secrets manager because compromise lets attackers mint trusted certificates.
2. **Point the proxy at the CA** – update your config `mitm.ca_cert_path`/`mitm.ca_key_path` to reference the files above and restart the proxy. The proxy now issues leaf certs per CONNECT target using this CA.
3. **Distribute the root to clients** – crawlers must trust the proxy CA or TLS handshakes will fail. Install `dev_ca.pem` into each client’s trust store, e.g.:
   - macOS: `sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain dev_ca.pem`
   - Debian/Ubuntu: copy the PEM to `/usr/local/share/ca-certificates/robots-proxy.crt` then run `sudo update-ca-certificates`
   - Custom crawler bundles: append the PEM to the library/application trust bundle and redeploy.
   When automating, version the CA, roll it periodically, and remove stale trust entries when rotating.
4. **Verify and monitor** – run an HTTPS request through the proxy with an agent that trusts the root and confirm robots decisions appear in logs. Keep the CA private key offline or in an HSM for production deployments.

## Development workflow

- `cargo fmt`
- `cargo clippy --all-targets --all-features`
- `cargo test`

CI (`.github/workflows/ci.yml`) runs all of the above on push/PR. `docs/architecture.md` outlines the design, middleware pipeline, and open questions.

### Adding a middleware layer

1. Implement a Tower `Layer`/`Service` in `src/proxy/service.rs` (or a new module) that wraps `Request<Incoming>` → `ProxyResponse`.
2. Add a new `LayerKind` variant in `src/settings.rs` and expose configuration knobs (e.g., rate limits, logging options).
3. Update `build_service` to recognize the new kind, stack your layer, and ensure it is guarded behind a config flag so it can be toggled/ordered via `middleware.layers`.
4. Write unit tests for the layer's behavior (similar to the validation/robots tests already in `src/proxy/service.rs`).

### Integration tests

`tests/proxy_integration.rs` exercises the proxy service end-to-end by spinning up a local HTTP origin server and calling the middleware pipeline. Run via `cargo test --test proxy_integration`. If the environment forbids binding loopback sockets (e.g., in sandboxed CI), the tests skip automatically.

## Roadmap

- Rate limiting middleware (per-host token bucket with optional Redis backend)
- Logging/tracing middleware wired to `tracing` subscribers
- Admin/API for cache inspection and manual invalidation
- Distributed cache support (Redis or persistent store)
- Integration tests hitting a local HTTP server to validate robots enforcement end-to-end

## HTTPS MITM feature plan

- [x] Research MITM TLS requirements: root CA generation, secure storage, and client trust distribution.
- [x] Design certificate issuance pipeline: per-origin leaf cert caching, CSR signing, rotation strategy.
- [x] Integrate TLS termination for CONNECT: accept CONNECT, perform TLS handshake with client using generated cert, establish upstream TLS, pipe data.
- [x] Ensure robots enforcement hooks into decrypted HTTP traffic (HTTP/1.1 initially).
- [x] Add configuration for MITM (enable flag, CA paths, key protections) and document security considerations.
- [x] Extend tests to cover MITM flow (unit tests for cert issuance, integration test with trusted root).