Cumulus: Efficient Filesystem Backup to the Cloud

How to Build
------------

Dependencies:
  - libuuid (sometimes part of e2fsprogs)
  - sqlite3

Building should be a simple matter of running "make".  This will produce
an executable called "cumulus".


Setting up Backups
------------------

Two directories are needed for backups: one for storing the backup
snapshots themselves, and one for storing bookkeeping information to go
with the backups.  In this example, the first will be "/cumulus", and
the second "/cumulus.db", but any directories will do.  Only the first
directory, /cumulus, needs to be stored somewhere safe.  The second is
only used when creating new snapshots, and is not needed when restoring.

  1. Create the snapshot directory and the local database directory:
        $ mkdir /cumulus /cumulus.db

  2. Initialize the local database using the provided script schema.sql
     from the source:
        $ sqlite3 /cumulus.db/localdb.sqlite
        sqlite> .read schema.sql
        sqlite> .exit

  3. If encrypting or signing backups with gpg, generate appropriate
     keypairs.  The keys can be kept in a user keyring or in a separate
     keyring just for backups; this example does the latter.
        $ mkdir /cumulus.db/gpg; chmod 700 /cumulus.db/gpg
        $ gpg --homedir /cumulus.db/gpg --gen-key
            (generate a keypair for encryption; enter a passphrase for
            the secret key)
        $ gpg --homedir /cumulus.db/gpg --gen-key
            (generate a second keypair for signing; for automatic
            signing do not use a passphrase to protect the secret key)
     Be sure to store the secret key needed for decryption somewhere
     safe, perhaps with the backup itself (the key protected with an
     appropriate passphrase).  The secret signing key need not be stored
     with the backups (since in the event of data loss, it probably
     isn't necessary to create future backups that are signed with the
     same key).

     To achieve better compression, the encryption key can be edited to
     alter the preferred compression algorithms to list bzip2 before
     zlib.  Run
        $ gpg --homedir /cumulus.db/gpg --edit-key <encryption key>
        Command> pref
            (prints a terse listing of preferences associated with the
            key)
        Command> setpref
            (allows preferences to be changed; copy the same preferences
            list printed out by the previous command, but change the
            order of the compression algorithms, which start with "Z",
            to be "Z3 Z2 Z1" which stands for "BZIP2, ZLIB, ZIP")
        Command> save

    Copy the provided encryption filter program, cumulus-filter-gpg,
    somewhere it may be run from.

  4. Create a script for launching the Cumulus backup process.  A simple
     version is:

        #!/bin/sh
        export LBS_GPG_HOME=/cumulus.db/gpg
        export LBS_GPG_ENC_KEY=<encryption key>
        export LBS_GPG_SIGN_KEY=<signing key>
        cumulus --dest=/cumulus --localdb=/cumulus.db --scheme=test \
            --filter="cumulus-filter-gpg --encrypt" --filter-extension=.gpg \
            --signature-filter="cumulus-filter-gpg --clearsign" \
            /etc /home /other/paths/to/store

    Make appropriate substitutions for the key IDs and any relevant
    paths.  Here "--scheme=test" gives a descriptive name ("test") to
    this collection of snapshots.  It is possible to store multiple sets
    of backups in the same directory, using different scheme names to
    distinguish them.  The --scheme option can also be left out
    entirely.


Backup Maintenance
------------------

Segment cleaning must periodically be done to identify backup segments
that are mostly unused, but are storing a small amount of useful data.
Data in these segments will be rewritten into new segments in future
backups to eliminate the dependence on the almost-empty old segments.

The provided cumulus-util tool can perform the necessary cleaning.  Run
it with
    $ cumulus-util --localdb=/cumulus.db clean
Cleaning is still under development, and so may be improved in the
future, but this version is intended to be functional.

Old backup snapshots can be pruned from the snapshot directory
(/cumulus) to recover space.  A snapshot which is still referenced by
the local database should not be deleted, however.  Deleting an old
backup snapshot is a simple matter of deleting the appropriate snapshot
descriptor file (snapshot-*.lbs) and any associated checksums
(snapshot-*.sha1sums).  Segments used by that snapshot, but not any
other snapshots, can be identified by running the clean-segments.pl
script from the /cumulus directory--this will perform a scan of the
current directory to identify unreferenced segments, and will print a
list to stdout.  Assuming the list looks reasonable, the segments can be
quickly deleted with
    $ rm `./clean-segments.pl`
A tool to make this easier will be implemented later.

The clean-segments.pl script will also print out a warning message if
any snapshots appear to depend upon segments which are not present; this
is a serious error which indicates that some of the data needed to
recover a snapshot appears to be lost.


Listing and Restoring Snapshots
-------------------------------

A listing of all currently-stored snapshots (and their sizes) can be
produced with
    $ cumulus-util --store=/cumulus list-snapshot-sizes

If data from a snapshot needs to be restored, this can be done with
    $ cumulus-util --store=/cumulus restore-snapshot \
        test-20080101T121500 /dest/dir <files...>
Here, "test-20080101T121500" is the name of the snapshot (consisting of
the scheme name and a timestamp; this can be found from the output of
list-snapshot-sizes) and "/dest/dir" is the path under which files
should be restored (this directory should initially be empty).
"<files...>" is a list of files or directories to restore.  If none are
specified, the entire snapshot is restored.


Remote Backups
--------------

The cumulus-util command can operate directly on remote backups.  The
--store parameter accepts, in addition to a raw disk path, a URL.
Supported URL forms are
    file:///path        Equivalent to /path
    s3://bucket/path    Storage in Amazon S3
        (Expects the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
        environment variables to be set appropriately)

To copy backup snapshots from one storage area to another, the
cumulus-sync command can be used, as in
    $ cumulus-sync file:///cumulus s3://my-bucket/cumulus

Support for directly writing backups to a remote location (without using
a local staging directory and cumulus-sync) is slightly more
experimental, but can be achieved by replacing
    --dest=/cumulus
with
    --upload-script="cumulus-store s3://my-bucket/cumulus"


Alternate Restore Tool
----------------------

The contrib/restore.pl script is a simple program for restoring the
contents of a Cumulus snapshot.  It is not as full-featured as the
restore functionality in cumulus-util, but it is far more compact.  It
could be stored with the backup files so a tool for restores is
available even if all other data is lost.

The restore.pl script does not know how to decompress segments, so this
step must be performed manually.  Create a temporary directory for
holding all decompressed objects.  Copy the snapshot descriptor file
(*.lbs) for the snapshot to be restored to this temporary directory.
The snapshot descriptor includes a list of all segments which are needed
for the snapshot.  For each of these snapshots, decompress the segment
file (with gpg or the appropriate program based on whatever filter was
used), then pipe the resulting data through "tar -xf -" to extract.  Do
this from the temporary directory; the temporary directory should be
filled with one directory for each segment decompressed.

Run restore.pl giving two arguments: the snapshot descriptor file
(*.lbs) in the temporary directory, and a directory where the restored
files should be written.