File locks and coordination

Cross-machine mutual exclusion for agent fleets: flock that works between machines, claim-by-rename for work queues, and when you don't need a lock at all.

This page is for when two machines must not do the same thing at the same time: exactly one agent may claim a work item, exactly one process may rebuild an index, a cleanup job must not run twice. The tools are file locks and atomic renames, both of which work across machines.

If your writers each write their own files and nothing contends for a shared resource, skip this page; that design needs no coordination at all (see Concurrent writes).

Locks work across machines

flock and fcntl byte-range locks behave between machines the way they behave between two processes on one machine: a lock held on machine A blocks machine B until it's released. This is on by default; nothing to configure.

# On machine A: hold the lock for 60 seconds.
flock /mnt/work/jobs.lock -c "sleep 60"
# On machine B, meanwhile:
flock --nonblock /mnt/work/jobs.lock -c "process-jobs"
# Exits immediately: the lock is held elsewhere.

Standard pidfile-style tooling works unchanged.

What a network lock can and can't promise

The lock is held through the workspace, not inside one machine's kernel. If the holder disappears without releasing it (crash, kill, network cut), the lock is reclaimed when that machine's mount session is cleaned up, not instantly. Two consequences:

  • Keep critical sections short.
  • Don't build on "the holder can never lose the lock". If stale holders must be fenced out on a timeout you control, use a lock service you operate (Redis SETNX with a TTL, a Postgres advisory lock) instead of a file lock.

Coordination without locks

Most agent workflows can make contention impossible instead of managing it:

  • One file per writer. Each writer claims its own path, e.g. outputs/agent-7/result.json. Nothing collides, nothing coordinates.

  • Claim by rename. To hand work items to exactly one worker each, let workers claim by moving the item into their own directory:

    mv /mnt/work/queue/item-42 /mnt/work/claimed/worker-3/item-42

    Rename is atomic across machines: when two workers grab the same item, one mv succeeds and the other fails with "no such file" and moves on to the next item. No lock is held while the work runs, so a crashed worker never blocks the queue (its claimed-but-unfinished items are visible under claimed/ for a janitor to requeue).

  • Publish by rename. Write results to a hidden path and mv into place as the last step; readers see complete results or nothing. The multi-file variant is in Concurrent writes.