Files

Zachary D. Rowitsch 90c6e75e3b Begind unifying warc handling between cmd line and spidering injection

2025-12-29 21:01:12 -05:00

8.0 KiB

Raw Permalink Blame History

WARC/ARC Injection Flow

This document summarizes how WARC/ARC injection works in this codebase, which components are involved, and what parsing behaviors and limitations exist today.

Entry Points

HTTP injection endpoint: sendPageInject() in src/PageInject.cpp parses request params into an InjectionRequest and forwards it via Msg7.
CLI injection command: inject in src/main.cpp parses input files and sends HTTP requests directly to /inject or /admin/inject.
UDP injection handler: handleRequest7() in src/PageInject.cpp receives serialized InjectionRequest (msg type 0x07) and calls XmlDoc::injectDoc() in src/XmlDoc.cpp.
WARC/ARC detection: Url::isWarc() / Url::isArc() in src/Url.cpp detect .warc, .warc.gz, .arc, .arc.gz. This influences indexing behavior.
WARC load-balancing: getHostToHandleInjection() in src/PageInject.cpp treats WARC URLs specially to spread archive work across hosts.

High-Level Flow

Client submits /admin/inject (or CLI equivalent) with a target URL.
sendPageInject() builds an InjectionRequest via setInjectionRequestFromParms() and sends it to a host using Msg7::sendInjectionRequestToHost().
The CLI inject path builds HTTP requests and sends them via s_tcp.sendMsg() without using the UDP injection request format.
The receiving host runs handleRequest7(), deserializes the request, builds an XmlDoc, and calls XmlDoc::injectDoc().
XmlDoc::injectDoc() calls set4() to configure the document and then indexDoc().
XmlDoc::indexDoc() checks if the URL is a WARC/ARC container (m_firstUrl.isWarc() or m_firstUrl.isArc()). If so, it calls indexWarcOrArc() and does not index the container itself.

WARC/ARC Streaming and Parsing

CLI `inject` Parsing (Separate Codepath)

The CLI inject command has its own parser for WARC/ARC and for the +++URL: format:

WARC/ARC parsing lives in doInjectWarc() / doInjectArc() in src/main.cpp.
It reads fixed-size blocks from disk, scans for WARC/ARC headers, and then sends per-record HTTP requests to /admin/inject.
This code does not reuse XmlDoc::indexWarcOrArc() or the streaming pipe/getUtf8ContentInFile() logic.

Download / Stream

XmlDoc::getUtf8ContentInFile() streams the WARC/ARC content:

For normal URLs, this code is only used for large files; WARC/ARC always uses it.
It constructs a shell pipeline: wget ... | zcat | mbuffer ... and opens it with gbpopen() for streaming reads.
The file descriptor is set non-blocking and a read callback is registered.

Buffering

XmlDoc::readMoreWarc() incrementally reads from the pipe into m_fileBuf:

Uses a fixed buffer sized to (5 * MAXWARCRECSIZE) + 1.
Tracks processed bytes via m_fptr/m_fptrEnd.
Uses memmove() only when necessary to avoid repeated shifting.
Can skip over oversized records by advancing m_fptr.

Record Parsing

XmlDoc::indexWarcOrArc() scans the buffer and extracts record metadata:

WARC:

Finds WARC/ within a small window from m_fptr.
Splits headers at \r\n\r\n.
Looks up:
- Content-Length (required)
- WARC-Target-URI (required)
- WARC-Type (must be response)
- Content-Type (must be application/http; msgtype=response)
- WARC-Date (optional)
- WARC-IP-Address (optional)
Skips any non-response records.

ARC:

Scans for the next \nhttp:// or \nhttps://.
Parses a single header line containing:
- URL
- IP string
- timestamp
- content type
- content length
Skips non-indexable content types.

Subdocument Injection

For each WARC/ARC record, indexWarcOrArc():

Builds a Msg7 and InjectionRequest.
Copies the HTTP payload into msg7->m_contentBuf to avoid buffer reuse issues.
Injects the record via Msg7::sendInjectionRequestToHost().
Sets metadata fields, including:
- m_firstIndexed/m_lastSpidered from WARC-Date
- m_injectDocIp from WARC-IP-Address
- Additional JSON metadata key gbcapturedate
Waits for all sub-injections to complete before finishing.

Current Parsing Limitations and Improvement Ideas

These are suggestions based on the current parsing logic in src/XmlDoc.cpp and related helpers.

Header matching is case-sensitive and string-based.
- WARC fields should be matched case-insensitively.
- Parse Content-Type parameters instead of strict string equality.
WARC/ discovery only scans a small window.
- The loop only checks ~10 bytes for WARC/, which can fail if records are not aligned at m_fptr.
Content-Length and record bounds are trusted too much.
- Validate that recContentLen is sane and does not exceed buffer bounds before using it to advance pointers.
No support for WARC 1.1 or non-standard line endings.
- Accept \n\n header terminators and WARC/1.1.
ARC parsing expects a very strict header layout.
- Handle extra whitespace and \r\n and report structured errors.
Content-Type filtering is narrow.
- Consider allowing application/http; msgtype=response with optional parameters or mixed casing.
Skipping large records can desync parsing.
- The skip-ahead logic adjusts pointers but does not validate that the next record boundary is sane.
Error reporting is log-only.
- Track metrics for skipped records, parse failures, and oversized entries to aid debugging.
URI normalization and validation is minimal.
- Trim whitespace around WARC-Target-URI and validate UTF-8.
Unit tests are missing for parser edge cases.
- Add fixture-based tests for WARC/ARC parsing, especially for:
  - Mixed-case headers
  - Extra whitespace and line endings
  - Large Content-Length
  - Non-response records
  - Missing required fields
CLI parsing is duplicated and divergent.
- Refactor WARC/ARC parsing into a shared helper used by both src/main.cpp (CLI) and src/XmlDoc.cpp (server indexing).
- The helper should accept a stream or buffer window and yield record metadata + payload slices without modifying the underlying buffer (so both streaming and file-based readers can share logic).

Draft Helper API Proposal

Goal: centralize WARC/ARC parsing so both server-side injection (XmlDoc::indexWarcOrArc()) and CLI injection (doInjectWarc/doInjectArc) use the same parsing and validation logic.

Suggested Types

struct ArchiveRecord:
- char *url, int32_t urlLen
- char *payload, int64_t payloadLen
- int64_t captureTime
- int32_t ip (0 if unknown)
- char contentType or parsed int32_t ct
- bool hasHttpResponse
enum ArchiveType { ARCHIVE_WARC, ARCHIVE_ARC }
class ArchiveParser:
- bool init(ArchiveType type)
- ParseResult parseNext(const char *buf, int64_t bufLen, int64_t *consumed)
- ArchiveRecord getRecord() const
- int32_t lastError() const

Integration Points

XmlDoc::indexWarcOrArc():
- Use ArchiveParser::parseNext() on m_fileBuf windows.
- When a record is ready, build InjectionRequest and inject.
- Maintain existing stream buffering but remove ad-hoc header parsing.
doInjectWarc() / doInjectArc() in src/main.cpp:
- Replace current scanning logic with ArchiveParser.
- Keep the CLI file-read loop but let parser decide record boundaries.

Behavior Notes

Parser should accept:
- WARC/1.0 and WARC/1.1
- Header field names case-insensitively
- \r\n\r\n and \n\n header terminators
- Content-Type parameters (e.g., application/http; msgtype=response; charset=...)
Record rejection should return a structured error and the number of bytes to skip so callers can continue streaming safely.

Implementation Status

Add shared ArchiveParser types and first-pass WARC/ARC parsing
Wire CLI inject (doInjectWarc/doInjectArc) to ArchiveParser
Migrate server-side XmlDoc::indexWarcOrArc() to ArchiveParser
Add parser-focused tests and fixtures for edge cases

Key Files

src/PageInject.cpp
src/PageInject.h
src/XmlDoc.cpp
src/XmlDoc.h
src/Url.cpp
src/Url.h

8.0 KiB Raw Permalink Blame History