This document aims to describe the goals and constraints of the Cumulus design. ======================================================================== OVERALL GOALS: Efficient creation and storage of multiple backup snapshots of a filesystem tree. Logically, each snapshot is self-contained and stores the state of all files at a single point in time. However, it should be possible for snapshots to share the same underlying storage, so that data duplicated in many snapshots need not be stored multiple times. It should be possible to delete old snapshots, and recover (most of) the storage associated with them. It must be possible to delete old backups in any order; for example, it must be possible to delete intermediate backups before long-term backups. It should be possible to recover the files in a snapshot without transferring significantly more data than that stored in the files to be recovered. CONSTRAINTS: The system should not rely upon a smart server at the remote end where backups are stored. It should be possible to create new backups using a single primitive: StoreFile, which stores a string of bytes at the backup server using a specified filename. Thus, backups can be run over any file transfer protocol, without requiring special software be installed on the storage server. ======================================================================== DESIGN APPROACHES STORING INCREMENTAL BACKUPS One simple approach is to simply store a copy of every file one the remote end, and construct a listing which tells where each file in the source ends up on the remote server. For subsequent backups, if a file is unchanged, the listing can simply point to the location of the file from the previous backup. Deleting backups is simple: delete the listing file for a particular snapshot, then garbage collect all files which are no longer referenced. This approach does not as efficiently handle partial changes to large files. If a file is changed at all, it needs to be transferred in its entirety. One approach is to represent intra-file changes by storing patches. The original file is kept, and a smaller file is transferred that stores the differences between the original and the new. Some care is needed, however. A series of small changes could accumulate over many snapshots. If each snapshot refers to the original file, much data will be duplicated between the patches in different snapshots. If each patch can refer to previous patches as well, a long chain of patches can build up, which complicates removing old backups to reclaim storage. An alternative approach is to break files apart into smaller units (blocks) and to represent files in a snapshot as the concatenation of (possibly many) blocks. Small change to files can be represented by replacing a few of the blocks, but referring to most blocks used in the old file directly. Some care is needed with this approach as well--there is additional overhead needed to specify even the original file, since the entire list of blocks must be specified. If the block size is too small, this can lead to a large overhead, but if the block size is too large, then sharing of file data may not be achieved. In this scheme, data blocks do not depend on other data blocks, so chains of dependencies do not arise as in the incremental patching scheme. Each snapshot is independent, and so can easily be removed. One minor modification to this scheme is to permit the list of blocks to specify that only a portion of a block should be used to reconstruct a file; if, say, only the end of a block is changed, then the new backup can refer to most of the old block, and use a new block for the small changed part. Doing so does allow the possibility that a block might be kept around even though a portion of it is being used, leading to wasted space. DATA STORAGE The simplest data storage format would place each file, patch, or block in a separate file on the storage server. Doing so maximizes the ability to reclaim storage when deleting old snapshots, and minimizes the amount of extra data that must be transferred to recover a snapshot. Any other format which combines data from multiple files/patches/blocks together risks having needed data grouped with unwanted data. However, there are reasons to consider grouping, since there is overhead associated with storing many small files. In any transfer protocol which is not pipelined, transferring many small files may be slower than transferring the same quantity of data in larger files. Small files may also lead to more wasted storage space due to internal fragmentation. Grouping files together gives the chance for better compression, taking advantage of inter-file similarity. Grouping is even more important if the snapshot format breaks files apart into blocks for storage, since the number of blocks could be far larger than the number of files being backed up. ======================================================================== SELECTED DESIGN At a high level, the selected design stores snapshots by breaking files into blocks for storage, and does not use patches. These data blocks, along with the metadata fragments (collectively, the blocks and metadata are referred to as objects) are grouped together for storage purposes (each storage group is called a segment). TAR is chosen as the format for grouping objects together into segments rather than inventing a new format. Doing so makes it easy to manipulate the segments using other tools, if needed. Data blocks for files are stored as-is. Metadata is stored in a text format, to make it more transparent. (This should make debugging easier, and the hope is that this will make understanding the format simpler.)