src | ||
.gitignore | ||
build_deb.sh | ||
build_packages.sh | ||
build_rpm.sh | ||
build.rs | ||
Cargo.toml | ||
LICENSE | ||
README.md |
warcat
Utility for handling WARC (Web ARChive) files written in rust.
Inspired by a python utility of the same name: https://pypi.org/project/Warcat/
Usage
A number of filtering options are available for the list
/extract
/concat
commands. These can each be specified to filter the output of the command. Multiple filters can be specified and each filter can be specified multiple times. Multiple filters of the same type are OR'd together, while multiple filters of different types are AND'd together.
The concat
command can be used to filter a single (or many) WARC file(s) and output a new WARC file containing only the matching records. Writing gzip compressed WARC files is multi-threaded.
A command-line tool for working with WARC files.
Usage: warcat [OPTIONS] <COMMAND>
Commands:
concat Naively join archives into one
extract Extract files from an archive
list List contents of an archive
verify Verify digest and validate archive conformance
help Print this message or the help of the given subcommand(s)
Options:
-c, --filter-by-content-type <FILTER_BY_CONTENT_TYPE>
filter by content-type header value, e.g. "text/html"
-w, --filter-by-warc-type <FILTER_BY_WARC_TYPE>
filter by warc-type header value, e.g. "response"
-u, --filter-by-uri <FILTER_BY_URI>
filter warc-target-uri with regex, e.g. ".*\.gov"
-r, --filter-by-recordid <FILTER_BY_RECORDID>
filter by warc-record-id, e.g. "29d2a1bb-21a3-4074-a0a9-68c8bc301b85"
-v, --verbose
verbose output (does not effect all commands)
-h, --help
Print help
-V, --version
Print version
Verify
Verify the integrity of a WARC file. This will check the WARC header and digest of each record in the WARC file. Does not honor any filters.
List
Show the contents of a WARC file. Outputs the following for each (unfiltered) record in the WARC file:
Record: <urn:uuid:003ed87e-9577-4b60-a4c7-d74aad67de8a>
URI: <https://www.moldybits.net/>
Data: 2023-11-20T14:51:55Z
Type: response
Content Type: application/http;msgtype=response
Content length: 790
Extract
Extract the contents of a WARC file. This will extract the contents of each record in the WARC file to the current directory by default. This command also allows a -o
option to specify an output directory. The output directory will be created if it does not exist.
$ warcat extract -o ~/test/ ~/Downloads/CC-MAIN-20230921073711-20230921103711-00000.warc.gz
The files are placed into the output directory with a directory structure replicated from the URI of the record:
$ find ~/test/footballaustralia.info
~/test/footballaustralia.info
~/test/footballaustralia.info/Galileanaboil214156.html
~/test/footballaustralia.info/Galileanaboil214156.html/aboil214156.html
Concat
Concatenate multiple WARC files into a single WARC file.
$ warcat -u footballaustralia --verbose concat ~/test/big.warc.gz /Users/zrowitsch/Downloads/CC-MAIN-20230921073711-20230921103711-00000.warc.gz /Users/zrowitsch/Downloads/CC-MAIN-20230921073711-20230921103711-00001.warc.gz`
Concatenating 2 files to ~/test/big.warc.gz
Adding ~/Downloads/CC-MAIN-20230921073711-20230921103711-00000.warc.gz
Added 3 records from ~/Downloads/CC-MAIN-20230921073711-20230921103711-00000.warc.gz
Adding ~/Downloads/CC-MAIN-20230921073711-20230921103711-00001.warc.gz
Added 0 records from ~/Downloads/CC-MAIN-20230921073711-20230921103711-00001.warc.gz
We can see the three records copied from the 2 WARC archives using list
:
$ warcat list ~/test/big.warc.gz`
Record: <urn:uuid:55d8a2fd-5efa-4818-a42e-11ea68591b98>
URI: http://footballaustralia.info/Galilean/aboil214156.html
Data: 2023-09-21T09:47:15Z
Type: request
Content Type: application/http; msgtype=request
Content length: 283
Record: <urn:uuid:9710e5d0-18d2-4123-9aad-9c0d64ed82c3>
URI: http://footballaustralia.info/Galilean/aboil214156.html
Data: 2023-09-21T09:47:15Z
Type: response
Content Type: application/http; msgtype=response
Content length: 48147
Record: <urn:uuid:074efef8-420a-4b2d-ad4a-6bdd48112a15>
URI: http://footballaustralia.info/Galilean/aboil214156.html
Data: 2023-09-21T09:47:15Z
Type: metadata
Content Type: application/warc-fields
Content length: 287
Building
Building from source is easy - just clone the repo and build with cargo. Furthermore man page(s) and shell completions are automatically generated.
git clone <this repo>
cd warcat
cargo build --release
RPM
The RPM can be built with the build_rpm.sh
script in the project directory. This will produce an rpm in the target/generate-rpm/
directory.
cargo install cargo-generate-rpm
cargo build --release
strip -s target/release/warcat
cargo generate-rpm
Installing
MacOS (Homebrew)
brew tap twistdroach/warcat
brew install warcat
Redhat Based Distros
The RPM has been tested on Fedora 39. Download from the releases tab and run:
sudo dnf install target/generate-rpm/warcat-*.x86_64.rpm
Debian Based Distros
The .deb has been tested with Debian. Download from the releases tab and run:
sudo dpkg -i warcat-*.amd64.deb