Automating Backups in AWS
In Day 9’s post we learned about some ideas for how to do proper backups when using AWS services.
In today’s post we’ll take a hands-on approach to automating creating resources and performing the action needs to achieve these kinds of backups, using some bash scripts and the Boto python library for AWS.
Ephemeral Storage to EBS volumes with rsync
Since IO performance is key for many applications and services, it is common to use your EC2 instance’s ephermeral storage and Linux software raid for your instance’s local data storage. While EBS volumes can have erratic performance, they are useful to provide backup storage that’s not tied to your instance, but is still accessible through a filesystem.
The approach we’re going to take is as follows:
- Make a two EBS volume software raid1 and mount as /backups
- Make a shell script to rsync /data to /backups
- Set the shell script up to run as a cron job
Making the EBS volumes
Adding the EBS volumes to your instance can be done with a simple Boto script
add-volumes.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
#!/usr/bin/env python # creates two ebs volumes and attaches them to an instance # assumes you know the instance-id import boto.ec2 import time INSTANCE_ID="i-0c7abe3e" REGION="us-west-2" VOLUME_SIZE="5" # in gigabytes VOLUME_AZ="us-west-2a" # should be same as instance AZ # adjust these based on your instance types and number of disks VOL1_DEVICE="/dev/sdh" VOL2_DEVICE="/dev/sdi" c = boto.ec2.connect_to_region(REGION) # create your two volumes VOLUME1 = c.create_volume(VOLUME_SIZE, VOLUME_AZ) time.sleep(5) print "created", VOLUME1.id VOLUME2 = c.create_volume(VOLUME_SIZE, VOLUME_AZ) time.sleep(5) print "created", VOLUME2.id # attach volumes to your instance VOLUME1.attach(INSTANCE_ID, VOL1_DEVICE) time.sleep(5) print "attaching", VOLUME1.id, "to", INSTANCE_ID, "as", VOL1_DEVICE VOLUME2.attach(INSTANCE_ID, VOL2_DEVICE) time.sleep(5) print "attaching", VOLUME2.id, "to", INSTANCE_ID, "as", VOL2_DEVICE |
Once you’ve run this script you’ll have two new volumes attached as local devices on your EC2 instance.
Making the RAID1
Now you’ll want to make a two volume RAID1 from the EBS volumes and make a filesystem on it.
The following shell script takes care of this for you
make-raid1-format.sh
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
#!/bin/bash VOL1="/dev/sdh" VOL2="/dev/sdi" MOUNTPOINT="/backups" VOLUME_SIZE="5GB" parted $VOL1 mklabel gpt parted $VOL1 mkpart primary 0GB $VOLUME_SIZE parted $VOL2 mklabel gpt parted $VOL2 mkpart primary 0GB $VOLUME_SIZE partprobe $VOL1 partprobe $VOL2 mdadm --create /dev/md0 --level=1 --raid-devices=2 $VOL1 $VOL2 mkfs.ext4 /dev/md0 mkdir -p /backups mount /dev/md0 /backups |
Now you have a /backups/ you can rsync files and folders to for your backup process.
rsync shell script
rsync is the best method for syncing data on Linux servers.
The following shell script will use rsync to make backups for you.
rsync-backups.sh
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
#!/bin/bash SOURCE_DIRS="/home/ /var/" DESTINATION="/backups" RSYNC_CMD="rsync -avP" for ITEM in $SOURCE_DIRS; do if [ ! -d "$DESTINATION$ITEM" ]; then mkdir -p "$DESTINATION$ITEM" fi $RSYNC_CMD $ITEM "$DESTINATION$ITEM" done |
making a cron job
To make this a cron job that runs once a day, you can add a file like the following, which assumes you put rsync-backups.sh in /usr/local/bin
This cron job will run as root, at 12:15AM in the timezone of the instance.
/etc/cron.d/backups
1 2 |
MAILTO="me@me.biz" 15 00 * * * root /usr/bin/flock -w 10 /var/lock/backups /usr/local/bin/rsync-backups.sh > /dev/null 2>&1 |
Data Rotation, Retention, Etc
To improve on how your data is rotated and retained you can explore a number of open source tools, including:
EBS Volumes to S3 with boto-rsync
Now that you’ve got your data backed up to EBS volumes, or you’re using EBS volumes as your main source of datastore, you’re going to want to ensure a copy of your data exists elsewhere. This is where S3 is a great fit.
As you’ve seen, rsync is often the key tool in moving data around on and between Linux filesystems, so it makes sense that we’d use an rsync style utility that talks to S3.
For this we’ll look at how we can use boto-rsync.
boto-rsync is a rough adaptation of boto’s s3put script which has been reengineered to more closely mimic rsync. Its goal is to provide a familiar rsync-like wrapper for boto’s S3 and Google Storage interfaces.
By default, the script works recursively and differences between files are checked by comparing file sizes (e.g. rsync’s –recursive and –size-only options). If the file exists on the destination but its size differs from the source, then it will be overwritten (unless the -w option is used).
boto-rsync is simple to use, being as easy as boto-rsync [OPTIONS] /local/path/ s3://bucketname/remote/path/
, which assumes you have your AWS key put in ~/.boto
or the ENV variables set.
boto-rsync has a number of options you’ll be familiar with from rsync and you should consult the README to get more familiar with this.
As you can see, you can easily couple boto-rsync with a cron job and some script to get backups going to S3.
Lifecycle policies for S3 to Glacier
One of the recent features added to S3 was the ability to use lifecycle policies to archive your S3 objects to Glacier
You can create a lifecycle policy to archive data in an S3 bucket to glacier very easily with the following boto code.
s3-glacier.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
#!/usr/bin/env python import boto.s3 REGION="us-west-2" BUCKET="mybucket" c = boto.s3.connect_to_region(REGION) bucket = c.get_bucket(BUCKET) from boto.s3.lifecycle import Lifecycle, Transition, Rule to_glacier = Transition(days=30, storage_class='GLACIER') rule = Rule('ruleid', 'logs/', 'Enabled', transition=to_glacier) lifecycle = Lifecycle() lifecycle.append(rule) bucket.configure_lifecycle(lifecycle) current = bucket.get_lifecycle_config() print current[0].transition |
Conclusion
As you can see, there are many options for automating your backups on AWS in comprehensive and flexible ways, and this post is only the tip of the iceberg.