Backups with duplicity and Amazon S3

Last month I accidentally deleted ~700 emails from my inbox. And these sevenhundred emails weren't spam, no, I've read some of them just a minute ago. Even worse, I couldn't figure out what had happened to them. During an offlineimap-sync from my mailserver to a local harddrive, offlineimap reported that it has deleted my mails, on the mailserver and on my harddrive. And after some frustating hours of trying to recover these mails I decided to go on and find a better way to backup my inbox (and other files).

So these are basically the requirements the new backup-routine has to fulfill:

Full and incremental backups should be possible
Backups should be unattended
Backup archives should be compressed and on insecure locations be encrypted
Communications with the backup server should be encrypted
The complete archive as well as specific files from the archive should be restorable

There is lots of research going on at my university on cloud computing and some friends from uni even wrote their final thesis on that kind of topic.

This is how I got in touch with Amazon Web Service S3:

Amazon Simple Storage Service provides a fully redundant data storage infrastructure for storing and retrieving any amount of data, at any time, from anywhere on the Web.

S3 is great. You don't have to keep in mind how much free webspace you have, how you can access your data or what happens when a harddisk fails. You pay only for the amount of storage you actually use (plus the traffic) and can access your date over a SOAP or REST API. That is why I chose S3 as the storage medium for my backups.

Next topic is the software which should actually move the archives to S3. I chose duplicity (thanks for the tip, Christoph) and the wrapper-script duply to handle this. It provides everything I need:

Duplicity backs directories by producing encrypted tar-format volumes and uploading them to a remote or local file server. Because duplicity uses librsync, the incremental archives are space efficient and only record the parts of files that have changed since the last backup. Because duplicity uses GnuPG to encrypt and/or sign these archives, they will be safe from spying and/or modification by the server.

Okay, now we have the software and the technology, let's take a look on the setup. I'm using Debian Lenny, so to install duplicity we need to adjust the sources.list because duplicity from the main repository is already two years old. To do that open /etc/apt/sources.list and add this at the end:

deb http://www.backports.org/debian lenny-backports main contrib non-free

We will then update and install it with aptitude as usual:

aptitude update
aptitude -t lenny-backports install duplicity

To use the Amazon S3 service with duplicity we have to install the library python-boto as well:

aptitude install python-boto

That is great so far, but duplicity is a complex piece of software and that is why there are lots of wrapper-scripts available on the internet. We want to use duply. Since duply is not in the debian repositories we have to install it by hand. Grab a copy from duply.net, unpack it and move it to a location you like (as long as it is available in your PATH):

wget http://sourceforge.net/projects/ftplicity/
tar -xvf duply_*.tgz
cp duply_*/duply ~/bin/duply

Next step is the configuration of the two. To create the folder-structure and the configuration files for duply, simply run:

mkdir -p /etc/duply
duply s3backup create

This will create a directory either in ~/.duply/ (as user) or in /etc/duply (as root) with a default conf-file. We can now edit the configuration to fit our needs:

GPG_KEY='12345678'
GPG_PW='averylongpassword'
TARGET='s3+http://[user:password]@bucket_name[/prefix]'
SOURCE='/'
MAX_FULLBKP_AGE=1M
DUPL_PARAMS="$DUPL_PARAMS --full-if-older-than $MAX_FULLBKP_AGE"
VOLSIZE=100
DUPL_PARAMS="$DUPL_PARAMS --volsize $VOLSIZE"
DUPL_PARAMS="$DUPL_PARAMS --s3-use-new-style --s3-european-buckets"

With line one and two we specifiy the gnupg-key and passphrase to encrypt the archives. The passphrase is saved in plaintext, but you can read the file only with root privileges. If this is to unsecure for you, the FAQ of GnuPG describes another way to use your gpg-key in an automated environment. Line 3 specifies that we want to use Amazons S3 service over HTTPS (despites the url says http). You get your username and password from your AWS account known as the Access Key ID and the Secret Access Key. The bucket-name is an unique identifier throughout S3 and you cannot change it, so choose wisely. Than we have to select the source, which I set to '/' and means backup everything under root (more on that later). The next two variables say that duplicity should create a full backup every month and by default remove all prior backups except one (you can change that by setting MAX_AGE and MAX_FULL_BACKUPS). With the variable volsize you can set the size of each chunk (default is 25 MByte). This is important because S3 doesn't allow files bigger than 5 GByte. If you set it to a small value, duplicity will create a huge bunch of files but this could speed up restoring. On the other hand a large number could save some time during the backup process. Finally, we have to add the params '--s3-use-new-style --s3-european-buckets' to tell S3 that I want to store my files in the European region (Ireland).

Now back to the data we want to backup. The file /etc/duply/s3backup/exclude (you have to create that file by hand) is the exclude global globbing filelist. You can specifiy with - (minus) what to exclude and with + (plus) what to include. By default, everything under SOURCE is included. For more information on the syntax, read the manpage. You can also put some commands for preparation, like dumping a database to file etc. in /etc/duply/s3backup/pre. Likewise, /etc/duply/s3backup/post is for doing some work after the backup, like send a mail to the admin.

As I mentioned earlier, I want to do incremental backups every day. To do that, create a file duply-s3backup in /etc/cron.daily/ (chmod +x)

#!/bin/sh
#
# duply s3backup cron daily
/root/bin/duply s3backup backup_verify_purge --force > /var/log/duplicity.log 2>&1
exit 0

If you need to save bandwidth you can remove "verify", but in order to know that you have a consistent backup you shouldn't. The script logs to /var/log/duplicity.log and because we don't want to waste our disk space logrotate should manage the duplicity-logfiles. To do that, just create a file duplicity in /etc/logrotate.d/ with the following:

/var/log/duplicity.log {
       weekly
       rotate 14
       compress
}

This tells logrotate to rotate the logfile very week, compress the old ones and delete it after 14 rotations (about 3 months).

You can now run the script for the first time, this will create a full backup. Depending on the size of your backup, your system and internet connection, you can most certainly grab a coffee or two.

./etc/cron.daily/duply-s3backup

To verify that the script did what it has to do, run:

duply s3backup status

You can also take a look at your management console, duplicity should have created a bucket with your first backup.

We now have a great setup to backup our data. Duplicity will do an incremental backup every day, a full backup every month and will remove the old and unused archives from the backup-server every time cron runs the script. But we do backups not to just have backups. We backup our data to restore it in some way if the system crashes. And duplicity is really smart here. To restore the whole archive simply run

duply s3backup restore ./restore-dir 7d

and you'll get the 7-days old complete archive unencrypted in the folder restore-dir. If you want to restore just one specific file, use

duply s3backup fetch etc/passwd /etc/passwd-yesterday 1D

This will restore the file passwd from 1 day ago to /etc/passwd-yesterday. Duply and duplicity have a lot more powerful commands which a documented in the manpages very well.

The only thing you have to do now, is to copy the duply-confdir (/etc/duply or ~/.duply) and your gpg-key to a save place. If the whole system crashes you won't be able to restore your backups unless you have the configuration and/or your private key.

The sources to this post and some links on the topic: