Fan out agents over one dataset
Run many agents on separate machines against one corpus: a work queue they claim from without double-processing, collision-free outputs, and a gather step.
One corpus, one big task, N agents on N machines. This guide turns a workspace into the coordination layer: a shared corpus, a work queue agents claim from without double-processing, and an output layout that never collides.
When you need this
- The task is too big for one agent: thousands of documents to analyze, a large repo to migrate, a dataset to label.
- Workers are ephemeral and may crash, and no work item may be processed twice or lost.
- Results must land in one place for a final aggregation.
Skip this guide for a single agent on a single machine; plain directories already do everything you need.
The layout
/mnt/work/
corpus/ the dataset, read by every worker
queue/ one file per work item, written once up front
claimed/ workers move items here to take ownership
outputs/ one directory per worker
1. Seed the workspace once
From any machine with the workspace mounted:
# The corpus; for bulk imports see Import data.
cp -r ./documents /mnt/work/corpus/
# The queue: one file per work item. Here, batches of 50 filenames each.
mkdir -p /mnt/work/queue
ls /mnt/work/corpus | split -l 50 - /mnt/work/queue/batch-
2. Boot the workers
Every worker mounts the same workspace with the same command and loops:
WORKER=worker-$(hostname)
mkdir -p /mnt/work/claimed/$WORKER /mnt/work/outputs/$WORKER
for item in /mnt/work/queue/*; do
# Claim by rename: exactly one worker wins each item.
if mv "$item" /mnt/work/claimed/$WORKER/ 2>/dev/null; then
name=$(basename "$item")
run-agent --task /mnt/work/claimed/$WORKER/$name \
--out /mnt/work/outputs/$WORKER/$name.json
mv /mnt/work/claimed/$WORKER/$name /mnt/work/claimed/$WORKER/$name.done
fi
done
The mv out of queue/ is atomic across machines: if two workers grab the same item, one succeeds and the other's mv fails, so it moves on to the next item. No lock is held while the agent works, so a crashed worker never blocks the fleet.
3. Recover crashed workers
A worker that died mid-item leaves a claimed file without the .done suffix. Requeue those before the gather step:
find /mnt/work/claimed -type f ! -name '*.done' \
-exec mv {} /mnt/work/queue/ \;
4. Gather
When queue/ is empty and every claimed item is .done, aggregate and publish:
jq -s '.' /mnt/work/outputs/*/*.json > /mnt/work/.tmp.report.json
mv /mnt/work/.tmp.report.json /mnt/work/report.json
The final rename publishes the report atomically: anything watching the workspace sees report.json appear complete or not at all.
Related
- Share files across machines: what each worker's mount sees and when
- File locks and coordination: claim-by-rename and when you'd want a real lock
- Import data: moving a large corpus in efficiently