Skip to content
meguri

Seeding a frontier

Turn a Common Crawl URL list into a .meguri checkpoint, and choose the starting priority and crawl delay.

A crawl starts from a set of URLs. meguri reads them as Common Crawl CDX records, the JSONL a ccrawl search produces, and folds each one into a fresh frontier. The result is a .meguri checkpoint you can run, inspect, or hand to a fleet.

From a search to a checkpoint

The simplest path pipes a search straight into seed:

ccrawl search 'example.com/*' --limit 50000 -o jsonl | meguri seed -o frontier.meguri

If you already have the records on disk, read them with -i:

meguri seed -i urls.jsonl -o frontier.meguri

seed deduplicates as it goes: a URL that canonicalises to one already in the frontier is dropped, so the checkpoint holds distinct work even if the input repeats. When it finishes, meguri inspect frontier.meguri shows the URL and host counts and the host-key range the partition covers.

Setting the starting priority

Every seeded URL enters with the same starting importance, --priority (default 0.5, on a 0-to-1 scale):

meguri seed -i urls.jsonl -o frontier.meguri --priority 0.8

The priority is the importance signal the engine orders on, within the set of URLs politeness currently allows. Seed a high-value list higher so it drains ahead of a broad background crawl folded in later.

Setting the crawl delay

--crawl-delay is the default per-host spacing, in deciseconds (default 10, so one second):

meguri seed -i urls.jsonl -o frontier.meguri --crawl-delay 30

This is the floor the engine applies before it learns a host's own rate from robots and from how the host responds. Raise it to be gentler on a fragile site; the per-host value the engine derives at run time can only make a host slower, never faster than this floor.

What you have

A .meguri checkpoint is a complete, self-describing partition. You can:

  • drain it with meguri run,
  • read its shape with meguri inspect,
  • list what is due with meguri schedule --data frontier.meguri,
  • or measure its per-URL cost with meguri bench.

Next: run a crawl loop.