Fan out agents over one dataset

Run many agents on separate machines against one corpus: a work queue they claim from without double-processing, collision-free outputs, and a gather step.

One corpus, one big task, N agents on N machines. This guide turns a workspace into the coordination layer: a shared corpus, a work queue agents claim from without double-processing, and an output layout that never collides.

When you need this

  • The task is too big for one agent: thousands of documents to analyze, a large repo to migrate, a dataset to label.
  • Workers are ephemeral and may crash, and no work item may be processed twice or lost.
  • Results must land in one place for a final aggregation.

Skip this guide for a single agent on a single machine; plain directories already do everything you need.

The layout

/mnt/work/
  corpus/      the dataset, read by every worker
  queue/       one file per work item, written once up front
  claimed/     workers move items here to take ownership
  outputs/     one directory per worker

1. Seed the workspace once

From any machine with the workspace mounted:

# The corpus; for bulk imports see Import data.
cp -r ./documents /mnt/work/corpus/

# The queue: one file per work item. Here, batches of 50 filenames each.
mkdir -p /mnt/work/queue
ls /mnt/work/corpus | split -l 50 - /mnt/work/queue/batch-

2. Boot the workers

Every worker mounts the same workspace with the same command and loops:

WORKER=worker-$(hostname)
mkdir -p /mnt/work/claimed/$WORKER /mnt/work/outputs/$WORKER

for item in /mnt/work/queue/*; do
  # Claim by rename: exactly one worker wins each item.
  if mv "$item" /mnt/work/claimed/$WORKER/ 2>/dev/null; then
    name=$(basename "$item")
    run-agent --task /mnt/work/claimed/$WORKER/$name \
              --out  /mnt/work/outputs/$WORKER/$name.json
    mv /mnt/work/claimed/$WORKER/$name /mnt/work/claimed/$WORKER/$name.done
  fi
done

The mv out of queue/ is atomic across machines: if two workers grab the same item, one succeeds and the other's mv fails, so it moves on to the next item. No lock is held while the agent works, so a crashed worker never blocks the fleet.

3. Recover crashed workers

A worker that died mid-item leaves a claimed file without the .done suffix. Requeue those before the gather step:

find /mnt/work/claimed -type f ! -name '*.done' \
  -exec mv {} /mnt/work/queue/ \;

4. Gather

When queue/ is empty and every claimed item is .done, aggregate and publish:

jq -s '.' /mnt/work/outputs/*/*.json > /mnt/work/.tmp.report.json
mv /mnt/work/.tmp.report.json /mnt/work/report.json

The final rename publishes the report atomically: anything watching the workspace sees report.json appear complete or not at all.