214 lines
8.0 KiB
Markdown
214 lines
8.0 KiB
Markdown
# WARC/ARC Injection Flow
|
|
|
|
This document summarizes how WARC/ARC injection works in this codebase,
|
|
which components are involved, and what parsing behaviors and limitations
|
|
exist today.
|
|
|
|
## Entry Points
|
|
|
|
- HTTP injection endpoint: `sendPageInject()` in `src/PageInject.cpp` parses
|
|
request params into an `InjectionRequest` and forwards it via `Msg7`.
|
|
- CLI injection command: `inject` in `src/main.cpp` parses input files and
|
|
sends HTTP requests directly to `/inject` or `/admin/inject`.
|
|
- UDP injection handler: `handleRequest7()` in `src/PageInject.cpp` receives
|
|
serialized `InjectionRequest` (msg type `0x07`) and calls
|
|
`XmlDoc::injectDoc()` in `src/XmlDoc.cpp`.
|
|
- WARC/ARC detection: `Url::isWarc()` / `Url::isArc()` in `src/Url.cpp` detect
|
|
`.warc`, `.warc.gz`, `.arc`, `.arc.gz`. This influences indexing behavior.
|
|
- WARC load-balancing: `getHostToHandleInjection()` in `src/PageInject.cpp`
|
|
treats WARC URLs specially to spread archive work across hosts.
|
|
|
|
## High-Level Flow
|
|
|
|
1) Client submits `/admin/inject` (or CLI equivalent) with a target URL.
|
|
2) `sendPageInject()` builds an `InjectionRequest` via
|
|
`setInjectionRequestFromParms()` and sends it to a host using
|
|
`Msg7::sendInjectionRequestToHost()`.
|
|
3) The CLI `inject` path builds HTTP requests and sends them via
|
|
`s_tcp.sendMsg()` without using the UDP injection request format.
|
|
3) The receiving host runs `handleRequest7()`, deserializes the request,
|
|
builds an `XmlDoc`, and calls `XmlDoc::injectDoc()`.
|
|
4) `XmlDoc::injectDoc()` calls `set4()` to configure the document and then
|
|
`indexDoc()`.
|
|
5) `XmlDoc::indexDoc()` checks if the URL is a WARC/ARC container
|
|
(`m_firstUrl.isWarc()` or `m_firstUrl.isArc()`). If so, it calls
|
|
`indexWarcOrArc()` and does not index the container itself.
|
|
|
|
## WARC/ARC Streaming and Parsing
|
|
|
|
### CLI `inject` Parsing (Separate Codepath)
|
|
|
|
The CLI `inject` command has its own parser for WARC/ARC and for the
|
|
`+++URL:` format:
|
|
|
|
- WARC/ARC parsing lives in `doInjectWarc()` / `doInjectArc()` in
|
|
`src/main.cpp`.
|
|
- It reads fixed-size blocks from disk, scans for WARC/ARC headers, and then
|
|
sends per-record HTTP requests to `/admin/inject`.
|
|
- This code does not reuse `XmlDoc::indexWarcOrArc()` or the streaming
|
|
pipe/`getUtf8ContentInFile()` logic.
|
|
|
|
### Download / Stream
|
|
|
|
`XmlDoc::getUtf8ContentInFile()` streams the WARC/ARC content:
|
|
|
|
- For normal URLs, this code is only used for large files; WARC/ARC always
|
|
uses it.
|
|
- It constructs a shell pipeline: `wget ... | zcat | mbuffer ...`
|
|
and opens it with `gbpopen()` for streaming reads.
|
|
- The file descriptor is set non-blocking and a read callback is registered.
|
|
|
|
### Buffering
|
|
|
|
`XmlDoc::readMoreWarc()` incrementally reads from the pipe into `m_fileBuf`:
|
|
|
|
- Uses a fixed buffer sized to `(5 * MAXWARCRECSIZE) + 1`.
|
|
- Tracks processed bytes via `m_fptr`/`m_fptrEnd`.
|
|
- Uses `memmove()` only when necessary to avoid repeated shifting.
|
|
- Can skip over oversized records by advancing `m_fptr`.
|
|
|
|
### Record Parsing
|
|
|
|
`XmlDoc::indexWarcOrArc()` scans the buffer and extracts record metadata:
|
|
|
|
WARC:
|
|
- Finds `WARC/` within a small window from `m_fptr`.
|
|
- Splits headers at `\r\n\r\n`.
|
|
- Looks up:
|
|
- `Content-Length` (required)
|
|
- `WARC-Target-URI` (required)
|
|
- `WARC-Type` (must be `response`)
|
|
- `Content-Type` (must be `application/http; msgtype=response`)
|
|
- `WARC-Date` (optional)
|
|
- `WARC-IP-Address` (optional)
|
|
- Skips any non-`response` records.
|
|
|
|
ARC:
|
|
- Scans for the next `\nhttp://` or `\nhttps://`.
|
|
- Parses a single header line containing:
|
|
- URL
|
|
- IP string
|
|
- timestamp
|
|
- content type
|
|
- content length
|
|
- Skips non-indexable content types.
|
|
|
|
### Subdocument Injection
|
|
|
|
For each WARC/ARC record, `indexWarcOrArc()`:
|
|
|
|
- Builds a `Msg7` and `InjectionRequest`.
|
|
- Copies the HTTP payload into `msg7->m_contentBuf` to avoid buffer reuse
|
|
issues.
|
|
- Injects the record via `Msg7::sendInjectionRequestToHost()`.
|
|
- Sets metadata fields, including:
|
|
- `m_firstIndexed`/`m_lastSpidered` from `WARC-Date`
|
|
- `m_injectDocIp` from `WARC-IP-Address`
|
|
- Additional JSON metadata key `gbcapturedate`
|
|
- Waits for all sub-injections to complete before finishing.
|
|
|
|
## Current Parsing Limitations and Improvement Ideas
|
|
|
|
These are suggestions based on the current parsing logic in
|
|
`src/XmlDoc.cpp` and related helpers.
|
|
|
|
1) Header matching is case-sensitive and string-based.
|
|
- WARC fields should be matched case-insensitively.
|
|
- Parse Content-Type parameters instead of strict string equality.
|
|
2) `WARC/` discovery only scans a small window.
|
|
- The loop only checks ~10 bytes for `WARC/`, which can fail if
|
|
records are not aligned at `m_fptr`.
|
|
3) `Content-Length` and record bounds are trusted too much.
|
|
- Validate that `recContentLen` is sane and does not exceed buffer
|
|
bounds before using it to advance pointers.
|
|
4) No support for WARC 1.1 or non-standard line endings.
|
|
- Accept `\n\n` header terminators and `WARC/1.1`.
|
|
5) ARC parsing expects a very strict header layout.
|
|
- Handle extra whitespace and `\r\n` and report structured errors.
|
|
6) Content-Type filtering is narrow.
|
|
- Consider allowing `application/http; msgtype=response` with optional
|
|
parameters or mixed casing.
|
|
7) Skipping large records can desync parsing.
|
|
- The skip-ahead logic adjusts pointers but does not validate that the
|
|
next record boundary is sane.
|
|
8) Error reporting is log-only.
|
|
- Track metrics for skipped records, parse failures, and oversized
|
|
entries to aid debugging.
|
|
9) URI normalization and validation is minimal.
|
|
- Trim whitespace around `WARC-Target-URI` and validate UTF-8.
|
|
10) Unit tests are missing for parser edge cases.
|
|
- Add fixture-based tests for WARC/ARC parsing, especially for:
|
|
- Mixed-case headers
|
|
- Extra whitespace and line endings
|
|
- Large `Content-Length`
|
|
- Non-response records
|
|
- Missing required fields
|
|
|
|
11) CLI parsing is duplicated and divergent.
|
|
- Refactor WARC/ARC parsing into a shared helper used by both
|
|
`src/main.cpp` (CLI) and `src/XmlDoc.cpp` (server indexing).
|
|
- The helper should accept a stream or buffer window and yield
|
|
record metadata + payload slices without modifying the underlying
|
|
buffer (so both streaming and file-based readers can share logic).
|
|
|
|
## Draft Helper API Proposal
|
|
|
|
Goal: centralize WARC/ARC parsing so both server-side injection
|
|
(`XmlDoc::indexWarcOrArc()`) and CLI injection (`doInjectWarc/doInjectArc`)
|
|
use the same parsing and validation logic.
|
|
|
|
### Suggested Types
|
|
|
|
- `struct ArchiveRecord`:
|
|
- `char *url`, `int32_t urlLen`
|
|
- `char *payload`, `int64_t payloadLen`
|
|
- `int64_t captureTime`
|
|
- `int32_t ip` (0 if unknown)
|
|
- `char contentType` or parsed `int32_t ct`
|
|
- `bool hasHttpResponse`
|
|
|
|
- `enum ArchiveType { ARCHIVE_WARC, ARCHIVE_ARC }`
|
|
|
|
- `class ArchiveParser`:
|
|
- `bool init(ArchiveType type)`
|
|
- `ParseResult parseNext(const char *buf, int64_t bufLen, int64_t *consumed)`
|
|
- `ArchiveRecord getRecord() const`
|
|
- `int32_t lastError() const`
|
|
|
|
### Integration Points
|
|
|
|
- `XmlDoc::indexWarcOrArc()`:
|
|
- Use `ArchiveParser::parseNext()` on `m_fileBuf` windows.
|
|
- When a record is ready, build `InjectionRequest` and inject.
|
|
- Maintain existing stream buffering but remove ad-hoc header parsing.
|
|
|
|
- `doInjectWarc()` / `doInjectArc()` in `src/main.cpp`:
|
|
- Replace current scanning logic with `ArchiveParser`.
|
|
- Keep the CLI file-read loop but let parser decide record boundaries.
|
|
|
|
### Behavior Notes
|
|
|
|
- Parser should accept:
|
|
- `WARC/1.0` and `WARC/1.1`
|
|
- Header field names case-insensitively
|
|
- `\r\n\r\n` and `\n\n` header terminators
|
|
- Content-Type parameters (e.g., `application/http; msgtype=response; charset=...`)
|
|
- Record rejection should return a structured error and the number of bytes
|
|
to skip so callers can continue streaming safely.
|
|
|
|
## Implementation Status
|
|
|
|
- [x] Add shared `ArchiveParser` types and first-pass WARC/ARC parsing
|
|
- [x] Wire CLI `inject` (`doInjectWarc`/`doInjectArc`) to `ArchiveParser`
|
|
- [ ] Migrate server-side `XmlDoc::indexWarcOrArc()` to `ArchiveParser`
|
|
- [ ] Add parser-focused tests and fixtures for edge cases
|
|
|
|
## Key Files
|
|
|
|
- `src/PageInject.cpp`
|
|
- `src/PageInject.h`
|
|
- `src/XmlDoc.cpp`
|
|
- `src/XmlDoc.h`
|
|
- `src/Url.cpp`
|
|
- `src/Url.h`
|