Utility for handling WARC (Web ARChive) files written in rust
Go to file
2023-11-23 06:57:29 -05:00
src Handle errors. Probably not quite idiomatically... 2023-11-22 09:14:24 -05:00
.gitignore Initial commit 2023-11-21 12:58:06 -05:00
build_deb.sh Add some build scripts so i don't have to remember how to build the packages 2023-11-22 11:54:50 -05:00
build_packages.sh Add some build scripts so i don't have to remember how to build the packages 2023-11-22 11:54:50 -05:00
build_rpm.sh Add some build scripts so i don't have to remember how to build the packages 2023-11-22 11:54:50 -05:00
build.rs Add RPM generation, along with man page, and bash completions 2023-11-21 14:25:16 -05:00
Cargo.toml Put debian completions in /usr/share/bash-completion/completions instead of /etc/bash_completion.d 2023-11-22 13:40:00 -05:00
LICENSE Initial commit 2023-11-21 12:58:06 -05:00
README.md update readme 2023-11-23 06:57:29 -05:00

warcat

Utility for handling WARC (Web ARChive) files written in rust.

Inspired by a python utility of the same name: https://pypi.org/project/Warcat/

Usage

A number of filtering options are available for the list/extract/concat commands. These can each be specified to filter the output of the command. Multiple filters can be specified and each filter can be specified multiple times. Multiple filters of the same type are OR'd together, while multiple filters of different types are AND'd together.

The concat command can be used to filter a single (or many) WARC file(s) and output a new WARC file containing only the matching records. Writing gzip compressed WARC files is multi-threaded.

A command-line tool for working with WARC files.

Usage: warcat [OPTIONS] <COMMAND>

Commands:
  concat   Naively join archives into one
  extract  Extract files from an archive
  list     List contents of an archive
  verify   Verify digest and validate archive conformance
  help     Print this message or the help of the given subcommand(s)

Options:
  -c, --filter-by-content-type <FILTER_BY_CONTENT_TYPE>
          filter by content-type header value, e.g. "text/html"
  -w, --filter-by-warc-type <FILTER_BY_WARC_TYPE>
          filter by warc-type header value, e.g. "response"
  -u, --filter-by-uri <FILTER_BY_URI>
          filter warc-target-uri with regex, e.g. ".*\.gov"
  -r, --filter-by-recordid <FILTER_BY_RECORDID>
          filter by warc-record-id, e.g. "29d2a1bb-21a3-4074-a0a9-68c8bc301b85"
  -v, --verbose
          verbose output (does not effect all commands)
  -h, --help
          Print help
  -V, --version
          Print version

Verify

Verify the integrity of a WARC file. This will check the WARC header and digest of each record in the WARC file. Does not honor any filters.

List

Show the contents of a WARC file. Outputs the following for each (unfiltered) record in the WARC file:

Record: <urn:uuid:003ed87e-9577-4b60-a4c7-d74aad67de8a>
  URI: <https://www.moldybits.net/>
  Data: 2023-11-20T14:51:55Z
  Type: response
  Content Type: application/http;msgtype=response
  Content length: 790

Extract

Extract the contents of a WARC file. This will extract the contents of each record in the WARC file to the current directory by default. This command also allows a -o option to specify an output directory. The output directory will be created if it does not exist.

$ warcat extract -o ~/test/ ~/Downloads/CC-MAIN-20230921073711-20230921103711-00000.warc.gz

The files are placed into the output directory with a directory structure replicated from the URI of the record:

$ find ~/test/footballaustralia.info
~/test/footballaustralia.info
~/test/footballaustralia.info/Galileanaboil214156.html
~/test/footballaustralia.info/Galileanaboil214156.html/aboil214156.html

Concat

Concatenate multiple WARC files into a single WARC file.

$ warcat -u footballaustralia --verbose concat ~/test/big.warc.gz /Users/zrowitsch/Downloads/CC-MAIN-20230921073711-20230921103711-00000.warc.gz /Users/zrowitsch/Downloads/CC-MAIN-20230921073711-20230921103711-00001.warc.gz`
Concatenating 2 files to ~/test/big.warc.gz
Adding ~/Downloads/CC-MAIN-20230921073711-20230921103711-00000.warc.gz
Added 3 records from ~/Downloads/CC-MAIN-20230921073711-20230921103711-00000.warc.gz
Adding ~/Downloads/CC-MAIN-20230921073711-20230921103711-00001.warc.gz
Added 0 records from ~/Downloads/CC-MAIN-20230921073711-20230921103711-00001.warc.gz

We can see the three records copied from the 2 WARC archives using list:

$ warcat list ~/test/big.warc.gz`
Record: <urn:uuid:55d8a2fd-5efa-4818-a42e-11ea68591b98>
  URI: http://footballaustralia.info/Galilean/aboil214156.html
  Data: 2023-09-21T09:47:15Z
  Type: request
  Content Type: application/http; msgtype=request
  Content length: 283
Record: <urn:uuid:9710e5d0-18d2-4123-9aad-9c0d64ed82c3>
  URI: http://footballaustralia.info/Galilean/aboil214156.html
  Data: 2023-09-21T09:47:15Z
  Type: response
  Content Type: application/http; msgtype=response
  Content length: 48147
Record: <urn:uuid:074efef8-420a-4b2d-ad4a-6bdd48112a15>
  URI: http://footballaustralia.info/Galilean/aboil214156.html
  Data: 2023-09-21T09:47:15Z
  Type: metadata
  Content Type: application/warc-fields
  Content length: 287

Building

Building from source is easy - just clone the repo and build with cargo. Furthermore man page(s) and shell completions are automatically generated.

git clone <this repo>
cd warcat
cargo build --release

RPM

The RPM can be built with the build_rpm.sh script in the project directory. This will produce an rpm in the target/generate-rpm/ directory.

cargo install cargo-generate-rpm
cargo build --release
strip -s target/release/warcat
cargo generate-rpm

Installing

MacOS (Homebrew)

brew tap twistdroach/warcat
brew install warcat

Redhat Based Distros

The RPM has been tested on Fedora 39. Download from the releases tab and run:

sudo dnf install target/generate-rpm/warcat-*.x86_64.rpm

Debian Based Distros

The .deb has been tested with Debian. Download from the releases tab and run:

sudo dpkg -i warcat-*.amd64.deb