Replacing rsync with restic for production backups: lessons from a 4 TB office migration

For years our office file server was backed up the boring way: a cron job, rsync -aH --link-dest, and a rotation of hardlinked snapshots on a second disk. It worked. It was also showing its age — the dataset crossed 4 TB, the snapshot tree had millions of inodes, and a single du could take half an hour. We finally migrated to restic and learned a few things worth writing down.

Why we moved off rsync

rsync with --link-dest is elegant: each snapshot is a directory tree of hardlinks to unchanged files, and only changed files consume new space. The problems start when the dataset grows:

Inode pressure. Tens of snapshots times millions of files means tens of millions of hardlinks. find, du, and even ls -R become painfully slow. Filesystem checks on the backup volume turned into multi-hour events.
No content-level dedup. rsync sees files. Rename a 2 GB Outlook PST and you store it twice. Move a folder and the next snapshot is huge.
No encryption at rest. Fine for a backup disk in the same rack, not fine for offsite copies on rented S3.
Pruning is awkward. Deleting an old snapshot is just rm -rf, but if you want to prune a single file across snapshots — you can’t, really.

restic addresses all four: chunk-level deduplication via content-defined chunking, AES-256 encryption, an S3-compatible backend, and a real forget/prune model.

The migration plan

We ran rsync and restic in parallel for two weeks before cutting over. The rough plan:

Provision a restic repo on object storage and a second one on a local NAS over SFTP (the 3-2-1 rule still applies).
Seed the first full backup from the file server directly, not from the rsync mirror — we wanted original mtimes and ACLs.
Schedule incremental backups in parallel with the existing rsync job for two weeks.
Run a sample restore test from each repo, end to end, before retiring rsync.

Initializing the repos

export RESTIC_PASSWORD_FILE=/etc/restic/passwd
export RESTIC_REPOSITORY=s3:https://s3.example.net/office-backup

restic init

Keep that password file readable only by root (chmod 600) and store the password somewhere outside the backup itself. A restic repo without its password is unrecoverable — that is the whole point of the encryption, and it is also the easiest way to lose your data.

The first backup

Four terabytes over a 1 Gbps uplink to S3 is not fast. Our first full took the better part of two days. Two flags mattered:

restic backup /srv/office \
  --exclude-file=/etc/restic/excludes \
  --one-file-system \
  --tag full --tag initial

--one-file-system keeps you from accidentally walking into a mounted SMB share or a /proc bind mount. The exclude file pulled out the obvious noise: *.tmp, ~$* Office lock files, browser caches, Thunderbird ImapMail directories. Trim aggressively here — every excluded file is bytes you don’t pay to store and seconds you don’t wait to restore.

What surprised us

Dedup ratio was better than advertised

On an office dataset full of duplicated PDFs, copies of the same Excel template, and several years of email exports, restic’s logical-to-physical ratio settled around 2.3x after a month of daily snapshots. rsync’s hardlink approach gets nowhere near that, because it dedupes only across snapshots, not within them.

Backups got faster after the first one

The initial scan was I/O bound. Subsequent runs only re-read files whose mtime or size changed, then chunked just those. A typical nightly run on our dataset finishes in under 20 minutes, most of it spent walking the tree.

Repo locking is a real operational concern

restic uses lock files in the repo to coordinate writers. If a backup process is killed hard — OOM, network drop, someone yanking a cable — the lock can be left behind. Future runs refuse to start until you intervene:

restic unlock

Only run unlock when you are sure no other restic process is touching the repo. Removing a lock that belongs to a live backup will corrupt the repo.

We wrap our cron job in a flock and a timeout, and alert if a run exits non-zero. Two failed runs in a row pages someone.

Pruning is expensive — schedule it

restic forget is cheap (it just rewrites snapshot metadata). restic prune is the one that actually reclaims space, and on a multi-TB repo it rewrites pack files and can run for hours. We split them:

# nightly, after backup
restic forget \
  --keep-daily 14 \
  --keep-weekly 8 \
  --keep-monthly 12 \
  --keep-yearly 3 \
  --prune=false

# weekly, Sunday morning
restic prune --max-unused 10%

--max-unused 10% tells restic not to bother repacking pack files that are mostly still in use. It trades a little extra storage for a much shorter prune window.

`check` is not optional

The whole value proposition of an encrypted, deduped repo evaporates if a single corrupted pack file silently breaks a chain of snapshots. Schedule integrity checks:

# fast: metadata only, nightly
restic check

# slow: re-reads a subset of pack data, monthly
restic check --read-data-subset=10%

Do a full --read-data at least once a quarter if your repo size allows it. We learned this after finding — during a restore drill, not a real incident — that a stale S3 multipart upload had left one pack unreadable. check would have caught it weeks earlier.

The runbook we wish we had

After the dust settled, we wrote a short operational checklist and pinned it in the wiki:

Two repos, always. One offsite (S3-compatible), one on-prem (SFTP or a local disk). Back up to both from the same host.
Password escrow. Repo passwords in the password manager and printed in the safe. Test recovery from the printed copy once a year.
Monitor exit codes. A restic backup that prints warnings still exits 0 if it succeeded. Parse --json output if you need finer-grained alerting.
Test restores monthly. Pick a random snapshot, restore a random subdirectory to a scratch volume, diff a few files. If you have not restored from a backup, you do not have a backup.
Document the repo layout. Future-you, at 2 a.m., needs to know where the repo lives, what backend it uses, and how to authenticate without reading source code.

Would we do it again?

Yes, without hesitation — but we would budget more time for the initial seed and write the prune schedule on day one rather than discovering it on day thirty when the repo had grown larger than the source. rsync is still a fine tool for what it does. For a backup system that needs encryption, dedup, and a real retention policy, restic earns its keep.