CLI reference
Every command and flag the meguri binary exposes.
The meguri binary is the front door to the frontier engine and its files. The crawl-loop commands (seed, run, serve) drive the engine; the file commands (inspect, schedule, stats, map, pack, compact) read and reshape .meguri partitions; bench projects the cost to fleet scale.
meguri
meguri [command] [--flags]
Run with no command for the help screen. Global flags:
| Flag | Meaning |
|---|---|
-v, --version |
Print the version, commit, and build date. |
-h, --help |
Print help for the binary or any subcommand. |
meguri seed
Build a .meguri checkpoint from a CDX JSONL list of URLs. seed reads Common Crawl CDX records (ccrawl search ... -o jsonl) from --input or stdin, inserts each URL into a fresh frontier, and writes the checkpoint to --out.
meguri seed -i urls.jsonl -o frontier.meguri
| Flag | Default | Meaning |
|---|---|---|
-i, --input |
stdin | CDX JSONL file to read. |
-o, --out |
Path to write the .meguri checkpoint. |
|
--priority |
0.5 |
Initial priority for every seeded URL. |
--crawl-delay |
10 |
Default per-host crawl delay, in deciseconds. |
meguri run
Drive the frontier engine over a checkpoint or seed list. run loads a partition (--input a .meguri checkpoint or --seed a CDX JSONL list), drives the staged engine loop to drain it in priority-then-politeness order with the offline fetcher, and writes the result to --out. The production fetcher is ami, bound through the fetch.Fetcher interface.
meguri run -i frontier.meguri -o crawled.meguri
| Flag | Default | Meaning |
|---|---|---|
-i, --input |
.meguri checkpoint to recover and run. |
|
--seed |
CDX JSONL seed list to run a fresh frontier from. | |
-o, --out |
Path to write the post-run .meguri checkpoint. |
|
--priority |
0.5 |
Initial priority for seeded URLs. |
--crawl-delay |
10 |
Default per-host crawl delay, in deciseconds. |
--workers |
0 |
Polite-host fetch parallelism (0 = default). |
--wall |
off | Use a wall clock (real politeness waits) instead of the logical clock. |
meguri serve
Open a directory as a durable partition and drive its crawl loop. serve opens --dir as a log-structured partition store, recovers its frontier (seeding from --seed on a fresh directory), drives the staged engine loop with the offline fetcher, and checkpoints back on shutdown. --manifest reads a fleet catalog and reports where this partition's range routes.
meguri serve -d ./part-7 --seed urls.jsonl
| Flag | Default | Meaning |
|---|---|---|
-d, --dir |
(required) | Partition store directory to open or create. |
--seed |
CDX JSONL seed list to load into a fresh partition. | |
--manifest |
Fleet manifest to report this partition's routing against. | |
--resident-budget |
0 |
Maximum resident URL records (0 = unbounded). |
--priority |
0.5 |
Initial priority for seeded URLs. |
--crawl-delay |
10 |
Default per-host crawl delay, in deciseconds. |
--workers |
0 |
Polite-host fetch parallelism (0 = default). |
--wall |
off | Use a wall clock (real politeness waits) instead of the logical clock. |
meguri inspect
Print the structure and stats of a .meguri file: the header facts, the region layout, the column counts, the checksum and codec, and the at-a-glance stats. The summary is computed from the header and the footer, so the cost is two small reads regardless of file size.
meguri inspect crawled.meguri
inspect takes the file path as its only argument and no flags beyond -h. The output is documented field by field on the quick start page.
meguri schedule
Show what is due to be crawled, by due time. With --data a directory, schedule recovers the live frontier and prints each due URL with its canonical string and due hour. With --data a .meguri file, it reads cold through the durable schedule index (the timing wheel, when present) so it touches only the near buckets, not the whole frontier.
meguri schedule --data crawled.meguri --limit 20
| Flag | Default | Meaning |
|---|---|---|
--data |
(required) | Partition directory or .meguri file to read. |
--before |
0 |
Due-time horizon in epoch-hours (0 = now). |
--host |
Filter to one host key (hex 0x... or decimal). |
|
--limit |
50 |
Maximum URLs to list (0 = all). |
meguri stats
Print the counters of a partition directory or a .meguri file. A directory recovers the live frontier and prints the full per-status distribution, the pending and due counts, and the seen-set occupancy; a single file prints the footer summary (URL and host counts, due range, region presence) without recovery.
meguri stats --data crawled.meguri
| Flag | Default | Meaning |
|---|---|---|
--data |
(required) | Partition directory or .meguri file to read. |
meguri map
Print the partition map from a fleet manifest. map reads a meguri.manifest catalog (--manifest) and prints each partition's host-key range, URL and host counts, bytes/url, and epoch, then whether the ranges tile the key space cleanly. With --host it routes that single host through the map and prints its owning partition.
meguri map --manifest meguri.manifest
meguri map --manifest meguri.manifest --host 0x3fffffffffffffff
| Flag | Default | Meaning |
|---|---|---|
--manifest |
(required) | meguri.manifest catalog to read the map from. |
--host |
Resolve a single host key (hex 0x... or decimal) through the map. |
meguri pack
Write a partition's live state to a fresh .meguri file, the explicit checkpoint command. --data is the partition directory to read; --out is where to write the file. The directory is opened read-only and dropped without a checkpoint, so packing never mutates the live partition.
meguri pack --data ./part-7 --out part-7.meguri
| Flag | Default | Meaning |
|---|---|---|
--data |
(required) | Partition directory to read. |
--out |
<data>/pack.meguri |
Path to write the .meguri file. |
meguri compact
Merge .meguri files, re-run the cascade, GC tombstones. compact merges its inputs into one partition (consolidation, the file side of rebalancing), re-runs the columnar cascade so the file packs to tens of bytes per URL, and with --gc drops the Gone tombstones past their re-probe horizon and reclaims the string arena. Inputs must own disjoint, ordered host-key ranges; an overlap is reported rather than producing a file a reader would reject.
meguri compact part-7.meguri part-8.meguri --out merged.meguri --gc
| Flag | Default | Meaning |
|---|---|---|
<file...> |
(required) | One or more .meguri files to merge. |
--out |
compact.meguri |
Path to write the compacted file (by the first input). |
--gc |
off | Garbage-collect Gone tombstones and reclaim the string arena. |
meguri bench
Measure per-partition costs on a corpus slice and project to 100B URLs. bench reads CDX records from --input or stdin, builds a real partition, measures the deterministic .meguri bytes/url and seen-set bits/url with its achieved false-positive rate, and prints the fleet projection as measured-times-count against the three named scaling walls.
meguri bench -i urls.jsonl
| Flag | Default | Meaning |
|---|---|---|
-i, --input |
stdin | CDX JSONL file to read. |
--priority |
0.5 |
Initial priority for every seeded URL. |
--crawl-delay |
10 |
Default per-host crawl delay, in deciseconds. |
--total-urls |
1e11 |
Fleet total URL count to project to. |
--urls-per-partition |
3e7 |
Per-partition capacity, the projection lever. |
--rebalance-to |
16 |
Partitions to grow the slice to for the rebalance-vs-bandwidth arm. |
--rebalance-bw |
1200 |
Device read bandwidth in MB/s the shipped bytes are divided by. |
--scheduler-sel-rate |
1e6 |
Measured scheduler selections/s to report the politeness ceiling against. |
meguri completion
Generate a shell autocompletion script for bash, zsh, fish, or PowerShell.
meguri completion zsh > "${fpath[1]}/_meguri"