Shasplit
Efficient backups by SHA-based data splitting

Shasplit takes a large data block, splits it into smaller parts, and puts those into an SHA-based content-addressed store. Reassembling those parts is a trivial cat invocation. Repeating parts (e.g. from previous split operations) are stored only once, which allows for efficient incremental backups of whole LVM snapshots via rsync. Shasplit shows its strengths on encrypted block devices, but might be useful for non-encrypted data, too.

If you like this tool, feel free to donate:

Have fun!

Preparation

Installation for a single user (assuming that ~/bin is in PATH):

git clone https://github.com/vog/shasplit.git
ln -s $(pwd)/shasplit/shasplit.py ~/bin/shasplit

Shasplit stores everything in the ~/.shasplit directory.

By default, Shasplit splits the data into parts of size 4 MiB and hashes each part with SHA-256, but will work equally well with any other part size and any other strong secure hash algorithm.

Backup

Add a new backup from /dev/vg0/foobar with name foobar, keeping at most 7 completed backups:

shasplit add foobar 7 < /dev/vg0/foobar

If ~/.shasplit is located on a remote file system such as NFS or SSHFS, you are done. Otherwise, you'll have to sync the ~/.shasplit directory to the backup system.

When using rsync, you should use the options -W for improved performance and --delete-after to keep the old backups until the new backups are complete:

rsync -aW --delete-after ~/.shasplit/ backup@backupserver:.shasplit/

Status and Checks

Show status information for all instances:

shasplit status

(Not yet implemented) Perform a thorough integrity check and report all parts and instances that are incomplete or inconsistent:

shasplit check

(Not yet implemented) Run the internal tests:

shasplit test

Recovery

Recover the latest complete backup of foobar with:

shasplit recover foobar > /dev/vg0/foobar

Recover the backup of foobar at 2013-05-23T03:42:42:

shasplit recover foobar 2013-05-23T03:42:42 > /dev/vg0/foobar

Manual Recovery and Checking

If Shasplit is not available on the target system, it is very simple to recover your data manually, using standard Unix tools.

First, you have to decide which instance you want to look at:

cd ~/.shasplit/foobar/2013-05-23T034242

Then, recovery is a trivial cat invocation:

cat */* > /dev/vg0/foobar

Before recovery, you may want to run a fast check for completeness by hand:

wc -c */*; cat size

To be safe, you can also run an integrity check for that instance by hand:

cat */* | shasum -a 256; cat hash

Debugging

You can always enable debug output via the SHASPLIT_DEBUG environment variable:

SHASPLIT_DEBUG=1 shasplit add vg0 foobar 7

Directory Layout

Design goals:

  1. Trivial recovery by hand via cat
  2. Integrity check possible by hand
  3. Human-readable directory layout
  4. Support for large, possibly encrypted, block devices
  5. Efficient transfer by standard tools like rsync, NFS or SSHFS or (i.e. reuse repeating parts from previous splits)
  6. Fast check for completeness (i.e. robust against interrupted transfers)
  7. Avoid creating more than 4096 entries per directory

Base directory layout:

Directory layout of each instance:

Directory layout of .data:

Fork me on GitHub