8.0 KiB
WARC/ARC Injection Flow
This document summarizes how WARC/ARC injection works in this codebase, which components are involved, and what parsing behaviors and limitations exist today.
Entry Points
- HTTP injection endpoint:
sendPageInject()insrc/PageInject.cppparses request params into anInjectionRequestand forwards it viaMsg7. - CLI injection command:
injectinsrc/main.cppparses input files and sends HTTP requests directly to/injector/admin/inject. - UDP injection handler:
handleRequest7()insrc/PageInject.cppreceives serializedInjectionRequest(msg type0x07) and callsXmlDoc::injectDoc()insrc/XmlDoc.cpp. - WARC/ARC detection:
Url::isWarc()/Url::isArc()insrc/Url.cppdetect.warc,.warc.gz,.arc,.arc.gz. This influences indexing behavior. - WARC load-balancing:
getHostToHandleInjection()insrc/PageInject.cpptreats WARC URLs specially to spread archive work across hosts.
High-Level Flow
- Client submits
/admin/inject(or CLI equivalent) with a target URL. sendPageInject()builds anInjectionRequestviasetInjectionRequestFromParms()and sends it to a host usingMsg7::sendInjectionRequestToHost().- The CLI
injectpath builds HTTP requests and sends them vias_tcp.sendMsg()without using the UDP injection request format. - The receiving host runs
handleRequest7(), deserializes the request, builds anXmlDoc, and callsXmlDoc::injectDoc(). XmlDoc::injectDoc()callsset4()to configure the document and thenindexDoc().XmlDoc::indexDoc()checks if the URL is a WARC/ARC container (m_firstUrl.isWarc()orm_firstUrl.isArc()). If so, it callsindexWarcOrArc()and does not index the container itself.
WARC/ARC Streaming and Parsing
CLI inject Parsing (Separate Codepath)
The CLI inject command has its own parser for WARC/ARC and for the
+++URL: format:
- WARC/ARC parsing lives in
doInjectWarc()/doInjectArc()insrc/main.cpp. - It reads fixed-size blocks from disk, scans for WARC/ARC headers, and then
sends per-record HTTP requests to
/admin/inject. - This code does not reuse
XmlDoc::indexWarcOrArc()or the streaming pipe/getUtf8ContentInFile()logic.
Download / Stream
XmlDoc::getUtf8ContentInFile() streams the WARC/ARC content:
- For normal URLs, this code is only used for large files; WARC/ARC always uses it.
- It constructs a shell pipeline:
wget ... | zcat | mbuffer ...and opens it withgbpopen()for streaming reads. - The file descriptor is set non-blocking and a read callback is registered.
Buffering
XmlDoc::readMoreWarc() incrementally reads from the pipe into m_fileBuf:
- Uses a fixed buffer sized to
(5 * MAXWARCRECSIZE) + 1. - Tracks processed bytes via
m_fptr/m_fptrEnd. - Uses
memmove()only when necessary to avoid repeated shifting. - Can skip over oversized records by advancing
m_fptr.
Record Parsing
XmlDoc::indexWarcOrArc() scans the buffer and extracts record metadata:
WARC:
- Finds
WARC/within a small window fromm_fptr. - Splits headers at
\r\n\r\n. - Looks up:
Content-Length(required)WARC-Target-URI(required)WARC-Type(must beresponse)Content-Type(must beapplication/http; msgtype=response)WARC-Date(optional)WARC-IP-Address(optional)
- Skips any non-
responserecords.
ARC:
- Scans for the next
\nhttp://or\nhttps://. - Parses a single header line containing:
- URL
- IP string
- timestamp
- content type
- content length
- Skips non-indexable content types.
Subdocument Injection
For each WARC/ARC record, indexWarcOrArc():
- Builds a
Msg7andInjectionRequest. - Copies the HTTP payload into
msg7->m_contentBufto avoid buffer reuse issues. - Injects the record via
Msg7::sendInjectionRequestToHost(). - Sets metadata fields, including:
m_firstIndexed/m_lastSpideredfromWARC-Datem_injectDocIpfromWARC-IP-Address- Additional JSON metadata key
gbcapturedate
- Waits for all sub-injections to complete before finishing.
Current Parsing Limitations and Improvement Ideas
These are suggestions based on the current parsing logic in
src/XmlDoc.cpp and related helpers.
-
Header matching is case-sensitive and string-based.
- WARC fields should be matched case-insensitively.
- Parse Content-Type parameters instead of strict string equality.
-
WARC/discovery only scans a small window.- The loop only checks ~10 bytes for
WARC/, which can fail if records are not aligned atm_fptr.
- The loop only checks ~10 bytes for
-
Content-Lengthand record bounds are trusted too much.- Validate that
recContentLenis sane and does not exceed buffer bounds before using it to advance pointers.
- Validate that
-
No support for WARC 1.1 or non-standard line endings.
- Accept
\n\nheader terminators andWARC/1.1.
- Accept
-
ARC parsing expects a very strict header layout.
- Handle extra whitespace and
\r\nand report structured errors.
- Handle extra whitespace and
-
Content-Type filtering is narrow.
- Consider allowing
application/http; msgtype=responsewith optional parameters or mixed casing.
- Consider allowing
-
Skipping large records can desync parsing.
- The skip-ahead logic adjusts pointers but does not validate that the next record boundary is sane.
-
Error reporting is log-only.
- Track metrics for skipped records, parse failures, and oversized entries to aid debugging.
-
URI normalization and validation is minimal.
- Trim whitespace around
WARC-Target-URIand validate UTF-8.
- Trim whitespace around
-
Unit tests are missing for parser edge cases.
- Add fixture-based tests for WARC/ARC parsing, especially for:
- Mixed-case headers
- Extra whitespace and line endings
- Large
Content-Length - Non-response records
- Missing required fields
- Add fixture-based tests for WARC/ARC parsing, especially for:
-
CLI parsing is duplicated and divergent.
- Refactor WARC/ARC parsing into a shared helper used by both
src/main.cpp(CLI) andsrc/XmlDoc.cpp(server indexing). - The helper should accept a stream or buffer window and yield record metadata + payload slices without modifying the underlying buffer (so both streaming and file-based readers can share logic).
- Refactor WARC/ARC parsing into a shared helper used by both
Draft Helper API Proposal
Goal: centralize WARC/ARC parsing so both server-side injection
(XmlDoc::indexWarcOrArc()) and CLI injection (doInjectWarc/doInjectArc)
use the same parsing and validation logic.
Suggested Types
-
struct ArchiveRecord:char *url,int32_t urlLenchar *payload,int64_t payloadLenint64_t captureTimeint32_t ip(0 if unknown)char contentTypeor parsedint32_t ctbool hasHttpResponse
-
enum ArchiveType { ARCHIVE_WARC, ARCHIVE_ARC } -
class ArchiveParser:bool init(ArchiveType type)ParseResult parseNext(const char *buf, int64_t bufLen, int64_t *consumed)ArchiveRecord getRecord() constint32_t lastError() const
Integration Points
-
XmlDoc::indexWarcOrArc():- Use
ArchiveParser::parseNext()onm_fileBufwindows. - When a record is ready, build
InjectionRequestand inject. - Maintain existing stream buffering but remove ad-hoc header parsing.
- Use
-
doInjectWarc()/doInjectArc()insrc/main.cpp:- Replace current scanning logic with
ArchiveParser. - Keep the CLI file-read loop but let parser decide record boundaries.
- Replace current scanning logic with
Behavior Notes
- Parser should accept:
WARC/1.0andWARC/1.1- Header field names case-insensitively
\r\n\r\nand\n\nheader terminators- Content-Type parameters (e.g.,
application/http; msgtype=response; charset=...)
- Record rejection should return a structured error and the number of bytes to skip so callers can continue streaming safely.
Implementation Status
- Add shared
ArchiveParsertypes and first-pass WARC/ARC parsing - Wire CLI
inject(doInjectWarc/doInjectArc) toArchiveParser - Migrate server-side
XmlDoc::indexWarcOrArc()toArchiveParser - Add parser-focused tests and fixtures for edge cases
Key Files
src/PageInject.cppsrc/PageInject.hsrc/XmlDoc.cppsrc/XmlDoc.hsrc/Url.cppsrc/Url.h