Import data
Move existing data into a Tonbo Artifacts workspace from a Mac folder, an S3 bucket, or local disk.
A native artifacts migrate command is on the v1 list; for v0, use the
existing battle-tested S3 / rsync tooling. Pick the recipe that
matches where your data lives now.
These recipes are backend-agnostic. /mnt/work is a FUSE mount,
so any tool that writes a regular Linux filesystem works. The chunks
flow to wherever the workspace points, Tonbo Artifacts' managed bucket
by default, or your own BYO bucket. You don't have to think about chunk
storage at import time.
Any staging bucket in the recipes below is independent from the workspace's chunk storage. It's wherever your data has to live temporarily so a Linux host can pull from it. Tigris, R2, AWS S3, MinIO, any S3-compatible bucket you can read works.
From a Mac folder
The v0 mount client is Linux-only, so Mac → Tonbo Artifacts is a two-hop: stage to any S3-compatible bucket, then pull from a Linux host that has the workspace mounted.
- (if your data is an archive) extract it first
The next step expects a directory tree. If you have an archive, expand it locally:
# tar.zst (zstd), fast, common from `tar --zstd` mkdir -p ~/dataset && zstd -dc dataset.tar.zst | tar -xC ~/dataset # tar.gz tar -xzf dataset.tar.gz -C ~/dataset # zip unzip dataset.zip -d ~/dataset - Pick (or create) a staging bucket and credentials
Any S3-compatible bucket the Linux mount host can read works. Tigris, R2, AWS S3, MinIO, pick whatever you already have keys for. If you're starting from scratch, AWS S3 is one curl away:
aws s3 mb s3://my-staging-bucket --region us-west-1 # Lock down public access on the new bucket: aws s3api put-public-access-block --bucket my-staging-bucket \ --public-access-block-configuration \ BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=trueMint an IAM user with read-write on
my-staging-bucketand copy its access key + secret. The same key pair gets used byrcloneon the Mac (next step) and byaws s3 cpon the Linux host (step after). - On the Mac: rclone stage to the bucket
brew install rcloneConfigure rclone. The example uses AWS S3 us-west-1; for Tigris swap
provider=Tigrisand setendpoint=https://fly.storage.tigris.dev:rclone config create staging s3 \ provider=AWS \ access_key_id=<staging-access-key> \ secret_access_key=<staging-secret-key> \ region=us-west-1Upload:
rclone copy ~/dataset staging:my-staging-bucket/ \ --transfers 32 --progressOn home broadband expect roughly uplink-Mbps × 0.1 GB/min; a 5–6 GB / 40k-file dataset over a ~50 Mbps uplink lands in about 15 minutes.
- On the Linux mount host: pull from staging into /mnt/work
The Linux box needs read access to the staging bucket. Export the same access key you used for
rclone:export AWS_ACCESS_KEY_ID=<staging-access-key> export AWS_SECRET_ACCESS_KEY=<staging-secret-key> export AWS_REGION=us-west-1aws s3 cp s3://my-staging-bucket/ /mnt/work/ \ --recursive --quiet--quietavoids one log line per file (matters at 40k+ files). For non-AWS providers add--endpoint-url https://<provider-endpoint>.Same-region intra-AWS this typically runs at ~10 MB/s per stream; a 5 GB pull is ~10 minutes wall-clock.
- Validate, then drop the staging bucket
Pick a representative file and verify cold + warm reads (see Validation below). Once you're satisfied, drop the staging bucket, the workspace owns the data now via its chunk backend (managed or BYO) plus Tonbo Artifacts' metadata service.
aws s3 rm s3://my-staging-bucket --recursive aws s3 rb s3://my-staging-bucket aws iam delete-access-key --user-name <staging-iam-user> --access-key-id <ak> aws iam delete-user-policy --user-name <staging-iam-user> --policy-name staging-rw aws iam delete-user --user-name <staging-iam-user>
From an S3 source
If your data already lives in another S3-compatible bucket (Tigris, R2, AWS S3, MinIO), single hop. Make sure the Linux host has read credentials for the source bucket:
export AWS_ACCESS_KEY_ID=<source-key>
export AWS_SECRET_ACCESS_KEY=<source-secret>
export AWS_REGION=<source-region>
aws s3 cp s3://<source-bucket>/ /mnt/work/ \
--recursive --quiet \
--endpoint-url https://<source-endpoint>rclone copy <source-remote>:<bucket> /mnt/work/ \
--transfers 32 --progressDrop --endpoint-url if you're on AWS S3 (the SDK figures it out
from AWS_REGION).
Writes flow through /mnt/work (FUSE) to the workspace's chunk
backend; metadata (inode tree, chunk pointers) lands in Tonbo Artifacts'
metadata service.
You don't see the chunk-side bucket from inside the mount.
From a local disk / NFS / etc.
If your data already sits on the same Linux host as the mount (or
you can scp / rsync it there), this is the shortest path:
rsync -avP --info=progress2 /local/source/ /mnt/work/
For directory trees with deep parallelism opportunities:
find /local/source -type f \
| parallel -j 32 rsync -aR --info=progress2 {} /mnt/work/
(Requires GNU parallel. The -aR flag preserves relative paths so
the structure mirrors under /mnt/work.)
If your data is a Mac-side tarball and you don't want to set up a
staging bucket, scp it to the Linux host first, expand under
/tmp/..., and use this recipe. Same total cost as the Mac-folder
path, one fewer service to configure.
Validation
Always sanity-check after a bulk import:
# Pick a file that's in your real workload (not synthetic).
TARGET=/mnt/work/<some-real-file>
ls -la "$TARGET"
# Cold read: first time pulls the chunk.
time cat "$TARGET" >/dev/null
# Warm read: should be single-digit ms.
time cat "$TARGET" >/dev/null
After 60 s of idle, confirm zero object-storage errors via the mount's stats file:
sleep 60
grep -E 'object_request_errors|staging_blocks' /mnt/work/.stats
# expected:
# juicefs_object_request_errors 0
# juicefs_staging_blocks 0
Quick spot-check that the chunks landed in the workspace's bucket (managed example shown; for BYO substitute your bucket + creds):
VOL=$(artifacts --format json workspace show <name> | jq -r .volume_name)
aws s3 ls --recursive --summarize \
s3://tonbo-managed-storage/$VOL/ --region us-west-1 | tail -2
# Total objects + total bytes should roughly match what you imported.
What about an artifacts migrate command?
A first-class wrapper that handles all three patterns above with progress, validation, and resume in one CLI is on the v1 list. v0 intentionally leans on the existing tools because they're mature and your Linux host already has them.
If your migration ergonomics are blocking your benchmark or production cutover, ping us. We'll prioritise based on what's actually painful.