years past
a recap of previous years
AWS Advent 2014 is a wrap! The response this year has had me so excited!
A huge thanks to everyone who has been involved this year, which includes:
For those who may have missed it, here is a recap of the content this year:
Ted Timmons is a long-term devops nerd and works for Stanson Health, a healthcare startup with a fully remote engineering team.
One key goal of a successful devops process – and successful usage of AWS – is to create automated, repeatable processes. It may be acceptable to spin up EC2 instances by hand in the early stage of a project, but it’s important to convert this from a manual experiment to a fully described system before the project reaches production.
There are several great tools to describe the configuration of a single instance- Ansible, Chef, Puppet, Salt- but these tools aren’t well-suited for describing the configuration of an entire system. This is where Amazon’s CloudFormation comes in.
CloudFormation was launched in 2011. It’s fairly daunting to get started with, errors in CloudFormation templates are typically not caught until late in the process, and since it is fed by JSON files it’s easy to make mistakes. Proper JSON is unwieldy (stray commas, unmatched closing blocks), but it’s fairly easy to write YAML and convert it to JSON.
Let’s start with a simple CloudFormation template to create an EC2 instance. In this example many things are hardcoded, like the instance type and AMI. This cuts down on the complexity of the example. Still, it’s a nontrivial example that creates a VPC and other resources. The only prerequisite for this example is to create a keypair in the US-West-2 region called “advent2014”.
1 2 |
#include (<a href="https://github.com/tedder/aws-advent-2014-yml-cloudformation/blob/master/simple-ec2.json">https://github.com/tedder/aws-advent-2014-yml-cloudformation/blob/master/simple-ec2.json</a>) |
As you look at this template, notice both the quirks of CloudFormation (especially “Ref” and “Fn::GetAtt”) and the quirks of JSON. Even with some indentation the brackets are complex, and correct comma placement is difficult while editing a template.
Next, let’s convert this JSON example to YAML. There’s a quick converter in this article’s repository, with python and pip installed, the only other dependency should be to install PyYAML with pip.
1 2 |
./json-to-yaml.py < simple-ec2.json > simple-ec2.yml |
Since JSON doesn’t maintain position of hashes/dicts, the output order may vary. Here’s what it looks like immediately after conversion:
1 2 |
#include (<a href="https://github.com/tedder/aws-advent-2014-yml-cloudformation/blob/master/simple-ec2.yml">https://github.com/tedder/aws-advent-2014-yml-cloudformation/blob/master/simple-ec2.yml</a>) |
Only a small amount of reformatting is needed to make this file pleasant: I removed unnecessary quotes, combined some lines, and moved the ‘Type’ line to the top of each resource.
1 2 |
#include (<a href="https://github.com/tedder/aws-advent-2014-yml-cloudformation/blob/master/simple-ec2-formatted.yml">https://github.com/tedder/aws-advent-2014-yml-cloudformation/blob/master/simple-ec2-formatted.yml</a>) |
It’s fairly easy to see the advantages of YAML in this case- it has a massive reduction in brackets and quotes and no need for commas. However, we need to convert this back to JSON for CloudFormation to use. Again, the converter is in this article’s repository.
1 2 3 |
./yaml-to-json.py < simple-ec2-formatted.yml > simple-ec2-autogenerated.json aws --region us-west-2 cloudformation create-stack --template-body file://simple-ec2-autogenerated.json --stack-name advent |
That’s it!
If you would like to use Ansible to prepare and publish to CloudFormation, my company shared an Ansible module to compile YAML into a single JSON template. The shared version of the script is entirely undocumented, but it compiles a full directory structure of YAML template snippets into a template. This significantly increases readability. Just placecloudformation_assemble
in your library/
folder and use it like any other module.
If there’s interest, I’ll help to document and polish this module so it can be submitted to Ansible. Just fork and send a pull request.
Today’s post on using Ansible to help you get the most out of CloudFormation comes to use from Soenke Ruempler, who’s helping keep things running smoothly at Jimdo.
No more outdated information, a single source of truth. Describing almost everything as code, isn’t this one of the DevOps dreams? Recent developments have made this dream even closer. In the Era of APIs, tools like TerraForm and Ansible have evolved which are able to codify the creation and maintenance of entire “organizational ecosystems”.
This blog post is a brief description of the steps we have taken to come closer to this goal at my employer Jimdo. Before we begin looking at particular implementations, let’s take the helicopter view and have a look at the current state and the problems with it.
We began to move to AWS in 2011 and have been using CloudFormation from the beginning. While we currently describe almost everything in CloudFormation, there are some legacy pieces which were just “clicked” through the AWS console. In order to to have some primitive auditing and documentation for those, we usually document all “clicked” settings with a Jenkins job, which runs Cucumber scenarios that do a live inspection of the settings (by querying the AWS APIs with a read-only user).
While this setup might not look that bad and has a basic level of codification, there are several drawbacks, especially with CloudFormation itself, which we are going to have a look at now.
Maybe you have experienced this same issue: You start off with some new technology or provider and initially use the UI to play around. And suddenly, those clicked spikes are in production. At least this is the story how we came to AWS at Jimdo 😉
So you might say: “OK, then let’s rebuild the clicked resources into a CloudFormation stack.” Well, the problem is that we didn’t describe basic components like VPC and Subnets as CloudFormation stacks in the first place, and as other production setups rely on those resources, we cannot change this as easily anymore.
Here is another issue: The usual AWS feature release process is that a component team releases a new feature (e.g. ElastiCache replica groups), but the CloudFormation part is missing (the CloudFormation team at AWS is a separate team with its own roadmap). And since CloudFormation isn’t open source, we cannot add the missing functionality by ourselves.
So, in order to use those “Non-CloudFormation” features, we used to click the setup as a workaround, and then again document the settings with Cucumber.
But the click-and-document-with-cucumber approach seems to have some drawbacks:
So we need something which could be extended as a CloudFormation stack with resources that we couldn’t (yet) express in CloudFormation. And we need them to be grouped together semantically, as code.
Some resources require post-processing in order to be fully ready. Imagine the creation of an RDS MySQL database with CloudFormation. The physical database was created by CloudFormation, but what about databases, users, and passwords? This cannot be done with CloudFormation, so we need to work around this as well.
Our current approaches vary from manual steps documented in a wiki to a combination of Puppet and hiera-aws: Puppet – running on some admin node – retrieves RDS instance endpoints by tags and then iterates over them and executes shell scripts. This is a form of post-processing entirely decoupled from the CloudFormation stack, actually in terms of time (hourly Puppet run) and in also in terms of “location” (it’s in another repository). A very complicated way just for the sake of automation.
Currently we use the AWS CLI tools in a plain way. Some coworkers use the old tools, some use the new ones. And I guess there are even folks with their own wrappers / bash aliases.
A “good” example is the missing feature of changing tags of CloudFormation stacks after creation. So if you forgot to do this in the first place, you’d need to recreate the entire stack! The CLI tools do not automatically add tags to stacks, so this is easily forgotten and should be automated. As a result we need to think of a wrapper around CloudFormation which automates those situations.
The idea of “single source information” or “single source of truth” is to never have a representation of data saved in more than one location. In the database world, it’s called “database normalization”. This is a very common pattern which should be followed unless you have an excellent excuse.
But, if you may not know better, you are under time pressure, or your tooling is still immature, it’s hard to keep the data single-sourced. This usually leads to copying and pasting hardcoding data.
Examples regarding AWS are usually resource IDs like Subnet-IDs, Security Groups or – in our case- our main VPC ID.
While this may not be an issue at first, it will come back to you in the future, e.g. if you want to rollout your stacks in another AWS region, perform disaster recovery, or you have to grep for hardcoded data in several codebases when doing refactorings, etc.
So we needed something to access information of other CloudFormation stacks and/or otherwise created resources (from the so called “clicked infrastructure”) without ever referencing IDs, Security Groups, etc. directly.
Now we have a good picture of what our current problems are and we can actually look for solutions!
My research resulted in 3 possible tools: Ansible, TerraForm and Salt.
As of writing this Ansible seems to be the only currently available tool which can deal with existing CloudFormation stacks out of the box and also seems to meet the other criteria at first glance, I decided to move on with it.
One of the mentioned problems are the inconvenient CloudFormation CLI tools: To create/update a stack, you would have to synthesize at least the stack name, template file name, and parameters, which is no fun and error-prone. For example:
1 2 |
$ cfn-[create|update]-stack webpool-saturn-dev-eu-west-1 --capabilities CAPABILITY_IAM --parameters "VpcID=vpc-123456" --template-file webpool-saturn-dev-eu-west-1.json --tags "jimdo:role=webpool,jimdo:owner=independence-team,jimdo:environment=dev" |
With Ansible, we can describe a new or existing CloudFormation stack with a few lines as an Ansible Playbook, here one example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
<span class="pl-s1">-<span class="pl-s1">--</span></span> <span class="pl-s1">- <span class="pl-ent">hosts:</span> <span class="pl-s1">localhost</span></span> <span class="pl-s1"><span class="pl-ent">connection:</span> <span class="pl-s1">local</span></span> <span class="pl-s1"><span class="pl-ent">gather_facts:</span> <span class="pl-s1">no</span></span> <span class="pl-s1"><span class="pl-ent">vars:</span></span> <span class="pl-s1"><span class="pl-ent">jimdo_environment:</span> <span class="pl-s1">dev</span></span> <span class="pl-s1"><span class="pl-ent">aws_region:</span> <span class="pl-s1">eu-west-1</span></span> <span class="pl-s1"><span class="pl-ent">stack_name:</span> <span class="pl-s1"><span class="pl-pds">"</span>webpool-saturn-{{ jimdo_environment }}-{{ aws_region }}<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">tasks:</span></span> <span class="pl-s1">- <span class="pl-ent">name:</span> <span class="pl-s1">create CloudFormation stack</span></span> <span class="pl-s1"><span class="pl-ent">cloudformation:</span></span> <span class="pl-s1"><span class="pl-ent">stack_name:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ stack_name }}<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">state:</span> <span class="pl-s1"><span class="pl-pds">"</span>present<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">region:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ aws_region }}<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">template:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ stack_name }}.json<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">tags:</span></span> <span class="pl-s1"><span class="pl-pds">"</span>jimdo:role<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>webpool<span class="pl-pds">"</span></span> <span class="pl-s1"><span class="pl-pds">"</span>jimdo:owner<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>independence-team<span class="pl-pds">"</span></span> <span class="pl-s1"><span class="pl-pds">"</span>jimdo:environment<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>{{ jimdo_environment }}<span class="pl-pds">"</span></span> |
Creating and updating (converging) the CloudFormation stack becomes as straightforward as:
1 2 |
$ ansible-playbook webpool-saturn-dev-eu-west-1.yml |
Awesome! We finally have great tooling! The YAML syntax is machine and human readable and our single source of truth from now on.
As for added power, it should be easier to implement AWS functionality that’s currently missing from CloudFormation as an Ansible module than a CloudFormation external resource […] and performing other out of band tasks, letting your ticketing system know about a new stack for example, is a lot easier to integrate into Ansible than trying to wrap the cli tools manually.
The above example stack uses the AWS ElastiCache feature of Redis replica groups, which unfortunately isn’t currently supported by CloudFormation. We could only describe the main ElastiCache cluster in CloudFormation. As a workaround, we used to click this missing piece and documented it with Cucumber as explained above.
A short look at the Ansible documentation reveals there is currently no support for ElastiCache replica groups in Ansible as well. But a quick research shows we have the possibility to extend Ansible with custom modules.
So I started spiking my own Ansible module to handle ElastiCache replica groups, inspired by the existing “elasticache” module. This involved the following steps:
1 2 3 4 5 6 |
<span class="pl-s1">-<span class="pl-s1">--</span></span> <span class="pl-s1"><span class="pl-ent">tasks:</span></span> <span class="pl-s1">- <span class="pl-ent">name:</span> <span class="pl-s1">webpool saturn</span></span> <span class="pl-s1"><span class="pl-ent">cloudformation:</span></span> ... <span class="pl-s1"><span class="pl-ent">register:</span> <span class="pl-s1">webpool_cfn</span></span> |
cloudformation
task:
1 2 3 4 5 |
<span class="pl-s1">- <span class="pl-ent">name:</span> <span class="pl-s1">ElastiCache replica groups</span></span> <span class="pl-s1"><span class="pl-ent">elasticache_replication_group:</span></span> <span class="pl-s1"><span class="pl-ent">state:</span> <span class="pl-s1"><span class="pl-pds">"</span>present<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">name:</span> <span class="pl-s1"><span class="pl-pds">"</span>saturn-dev-01n1<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">primary_cluster_id:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ webpool_cfn['stack_outputs']['WebcacheNode1Name'] }}<span class="pl-pds">"</span></span></span> |
Pretty awesome: Ansible works as a glue language while staying very readable. Actually it’s possible to read through the playbook and have an idea what’s going on.
Another great thing is that we can even extend core functionality of Ansible without any friction (as waiting for upstream to accept a commit, build/deploy new packages, etc) which should increase the tool acceptance across coworkers even more.
This topic touches another use-case: The possibility to “chain” CloudFormation stacks with Ansible: Reusing Outputs from Stacks as parameters for other stacks. This is especially useful to split big monolithic stacks into smaller ones which as a result can be managed and reused independently (separation of concerns).
Last but not least, it’s now easy to extend the Ansible playbook with post processing tasks (remember the RDS/Database example above).
As mentioned above, one issue with CloudFormation is a a way to import existing infrastructure into a stack. Luckily, Ansible supports most of the AWS functionality so we can create a playbook to express existing infrastructure as code.
To discover the possibilities, I converted a fraction of our current production VPC/subnet setup into an Ansible playbook:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
<span class="pl-s1">-<span class="pl-s1">--</span></span> <span class="pl-s1">- <span class="pl-ent">hosts:</span> <span class="pl-s1">localhost</span></span> <span class="pl-s1"><span class="pl-ent">connection:</span> <span class="pl-s1">local</span></span> <span class="pl-s1"><span class="pl-ent">gather_facts:</span> <span class="pl-s1">no</span></span> <span class="pl-s1"><span class="pl-ent">vars:</span></span> <span class="pl-s1"><span class="pl-ent">aws_region:</span> <span class="pl-s1">eu-west-1</span></span> <span class="pl-s1"><span class="pl-ent">tasks:</span></span> <span class="pl-s1">- <span class="pl-ent">name:</span> <span class="pl-s1">Main shared Jimdo VPC</span></span> <span class="pl-s1"><span class="pl-ent">ec2_vpc:</span></span> <span class="pl-s1"><span class="pl-ent">state:</span> <span class="pl-s1">present</span></span> <span class="pl-s1"><span class="pl-ent">cidr_block:</span> <span class="pl-s1">10.5.0.0/16</span></span> <span class="pl-s1"><span class="pl-ent">resource_tags:</span> </span>{<span class="pl-s1"><span class="pl-pds">"</span>jimdo:environment<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>prod<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>jimdo:role<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>shared_network<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>jimdo:owner<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>unassigned<span class="pl-pds">"</span></span>} <span class="pl-s1"><span class="pl-ent">region:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ aws_region }}<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">dns_hostnames:</span> <span class="pl-s1">no</span></span> <span class="pl-s1"><span class="pl-ent">dns_support:</span> <span class="pl-s1">yes</span></span> <span class="pl-s1"><span class="pl-ent">instance_tenancy:</span> <span class="pl-s1">default</span></span> <span class="pl-s1"><span class="pl-ent">internet_gateway:</span> <span class="pl-s1">yes</span></span> <span class="pl-s1"><span class="pl-ent">subnets:</span></span> <span class="pl-s1">- <span class="pl-ent">cidr:</span> <span class="pl-s1">10.5.151.96/27</span></span> <span class="pl-s1"><span class="pl-ent">az:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ aws_region }}a<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">resource_tags:</span> </span>{<span class="pl-s1"><span class="pl-pds">"</span>Name<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>template-team private<span class="pl-pds">"</span></span>} <span class="pl-s1">- <span class="pl-ent">cidr:</span> <span class="pl-s1">10.5.151.128/27</span></span> <span class="pl-s1"><span class="pl-ent">az:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ aws_region }}b<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">resource_tags:</span> </span>{<span class="pl-s1"><span class="pl-pds">"</span>Name<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>template-team private<span class="pl-pds">"</span></span>} <span class="pl-s1">- <span class="pl-ent">cidr:</span> <span class="pl-s1">10.5.151.160/27</span></span> <span class="pl-s1"><span class="pl-ent">az:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ aws_region }}c<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">resource_tags:</span> </span>{<span class="pl-s1"><span class="pl-pds">"</span>Name<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>template-team private<span class="pl-pds">"</span></span>} |
As you can see, there is not even a hardcoded VPC ID! Ansible identifies the VPC by a Tag-CIDR tuple, which meets our initial requirement of “no hardcoded data”.
To stress this, I changed the aws_region variable to another AWS region, and it was possible to create the basic VPC setup in another region, which is another sign for a successful single-source-of-truth.
Now we want to reuse the information of the VPC which we just brought “under control” in the last example. Why should we do this? Well, in order to be fully automated (which is our goal), we cannot afford any hardcoded information.
Let’s start with the VPC ID, which should be one of the most requested IDs. Getting it is relatively easy because we can just extract it from the ec2_vpc
module output and assign it as a variable with the set_fact
Ansible module:
1 2 3 |
<span class="pl-s1">- <span class="pl-ent">name:</span> <span class="pl-s1">Assign main VPC ID</span></span> <span class="pl-s1"><span class="pl-ent">set_fact:</span></span> <span class="pl-s1"><span class="pl-ent">main_vpc_id:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ main_vpc['vpc_id'] }}<span class="pl-pds">"</span></span></span> |
OK, but we also need to reuse the subnet information – and to avoid hardcoding, we need to address them without using subnet IDs. As we tagged the subnets above, we could use the tuple (name-tag, Availability zone) to identify and group them.
With the awesome help from the #ansible IRC channel folks, I could make it work to extract one subnet by ID and Tag from the output:
1 2 3 4 5 6 |
<span class="pl-s1">- <span class="pl-ent">name:</span> <span class="pl-s1">Find the Template team private network subnet id in AZ 1a</span></span> <span class="pl-s1"><span class="pl-ent">local_action:</span></span> <span class="pl-s1"><span class="pl-ent">module:</span> <span class="pl-s1">set_fact</span></span> <span class="pl-s1"><span class="pl-ent">template_team_private_subnet_a:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ item.id }}<span class="pl-pds">"</span></span></span> <span class="pl-s1"><span class="pl-ent">when:</span> <span class="pl-s1">item</span></span>[<span class="pl-s1"><span class="pl-pds">'</span>resource_tags<span class="pl-pds">'</span></span>][<span class="pl-s1"><span class="pl-pds">'</span>Name<span class="pl-pds">'</span></span>] == <span class="pl-s1"><span class="pl-pds">'</span>template-team private<span class="pl-pds">'</span></span> and item[<span class="pl-s1"><span class="pl-pds">'</span>az<span class="pl-pds">'</span></span>] == <span class="pl-s1"><span class="pl-pds">'</span>eu-west-1a<span class="pl-pds">'</span></span> <span class="pl-s1"><span class="pl-ent">with_items:</span> <span class="pl-s1">main_vpc</span></span>[<span class="pl-s1"><span class="pl-pds">'</span>subnets<span class="pl-pds">'</span></span>] |
While this satisfies the single source requirement, it doesn’t seem to scale very well with a bunch of subnets. Imagine you’d have to do this for each subnet (we already have more than 50 at Jimdo).
After some research I found out that it’s possible to add custom filters to Ansible that allow to manipulate data with Python code:
1 2 3 4 5 6 7 8 9 10 11 12 |
<span class="pl-k">from</span> ansible <span class="pl-k">import</span> errors, runner <span class="pl-k">import</span> collections <span class="pl-st">def</span> <span class="pl-en">subnets</span>(<span class="pl-vpf">raw_subnets</span>): subnets <span class="pl-k">=</span> {} <span class="pl-k">for</span> raw_subnet <span class="pl-k">in</span> raw_subnets: subnet_identifier <span class="pl-k">=</span> raw_subnet[<span class="pl-s1"><span class="pl-pds">'</span>resource_tags<span class="pl-pds">'</span></span>][<span class="pl-s1"><span class="pl-pds">'</span>Name<span class="pl-pds">'</span></span>] subnets.setdefault(subnet_identifier, {}) subnets[raw_subnet[<span class="pl-s1"><span class="pl-pds">'</span>resource_tags<span class="pl-pds">'</span></span>][<span class="pl-s1"><span class="pl-pds">'</span>Name<span class="pl-pds">'</span></span>]][raw_subnet[<span class="pl-s1"><span class="pl-pds">'</span>az<span class="pl-pds">'</span></span>]] <span class="pl-k">=</span> raw_subnet[<span class="pl-s1"><span class="pl-pds">'</span>cidr<span class="pl-pds">'</span></span>] <span class="pl-k">return</span> subnets <span class="pl-st">class</span> <span class="pl-en">FilterModule</span> (<span class="pl-e"><span class="pl-s3">object</span></span>): <span class="pl-st">def</span> <span class="pl-en">filters</span>(<span class="pl-vpf">self</span>): <span class="pl-k">return</span> {<span class="pl-s1"><span class="pl-pds">"</span>subnets<span class="pl-pds">"</span></span>: subnets} |
We can now assign the subnets for later usage like this in Ansible:
1 2 3 |
<span class="pl-s1">- <span class="pl-ent">name:</span> <span class="pl-s1">Assign subnets for later usage</span></span> <span class="pl-s1"><span class="pl-ent">set_fact:</span></span> <span class="pl-s1"><span class="pl-ent">main_vpc_subnets:</span> <span class="pl-s1"><span class="pl-pds">"</span>{{ main_vpc['subnets']|subnets()}}<span class="pl-pds">"</span></span></span> |
This is a great way to prepare the subnets for later usage, e.g. in iterations, to create RDS or ElastiCache subnet groups. Actually, almost everything in a VPC needs subnet information.
Those examples should be enough for now to give us confidence that Ansible is a great tool which fits our needs. Takeaways
As of of writing this, Ansible and CloudFormation seem to be a perfect fit for me. The combination turns out to be a solid solution to the following problems:
After spiking the solution, I could imagine the following next steps for us:
I hope this blog post has brought some new thoughts and inspirations to the readers. Happy holidays!
Today’s post on using Terraform to build infrastructure on AWS comes from Justin Downing.
Building interconnected resources on AWS can be challenging. A simple web application can have a load balancer, application servers, DNS records, and a security group. While a sysadmin can launch and manage these resources from the AWS web console or a CLI tool like fog, doing so can be time consuming and tedious considering all the metadata you have to shuffle amongst the other resources.
An elegant solution to this problem has been solved by the fine folks at Hashicorp: Terraform. This tool aims to take the concept of “infrastructure as code” and add the missing pieces that other provisioning tools like fog miss, namely the glue to interconnect your resources. For anyone with a background in software configuration management (Chef, Puppet), then using Terraform should be a natural fit for describing and configuring infrastructure resources.
Terraform can be used with several different providers including AWS, GCE, and Digital Ocean. We will be discussing provisioning resources on AWS. You can read more about the built-in AWS provider here.
Terraform is written in go and distributed as a package of binaries. You can download the appropriate package from the website. If you are using OSX and homebrew, you can simply brew install terraform
to get everything installed and setup.
Now, that you have Terraform installed, let’s build some infrastructure! Terraform configuration files are text files that resemble JSON, but are more readable and can include comments. These files should end in .tf
(more details on configuration is available here). Rather than invent an example to use Terraform with AWS, I’m going to step through the example published by Hashicorp.
NOTE: I am assuming here that you have AWS keys capable of creating/terminating resources. Also, it would help if had the AWS CLI is installed and configured as Terraform will use those credentials to interract with AWS. The example below is using AWS region us-west-2
.
Let’s use the AWS Two-Tier example to build an ELB and EC2 instance:
1 2 3 4 5 |
$ mkdir /tmp/aws-tf $ cd /tmp/aws-tf $ terraform init github.com/hashicorp/terraform/examples/aws-two-tier $ aws ec2 --region us-west-2 create-key-pair --key-name terraform | jq -r ".KeyMaterial" > terraform.pem |
Here, we initialized a new directory with the example. Then, we created a new keypair and saved the private key to our directory. Here, you will note the files with the tf extension. These are the configuration files used to describe the resources we want to build. As the name indicates, one is the main configuration, one contains the variables used, and one describes the desired output. When you build this configuration, Terrraform will combine all .tf
files in the current directory to greate theresource graph.
I encourage you to review the configuration details in main.tf
, variables.tf
, and outputs.tf
. With the help of comments and descriptions, it’s very easy to learn how different resources are intended to work together. You can also run plan to see how Terraform intends to build the resources you declared.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
$ terraform plan -var 'key_name=awsadvent' -var 'key_path=/tmp/aws-tf/awsadvent.pem' Refreshing Terraform state prior to plan... The Terraform execution plan has been generated and is shown below. Resources are shown in alphabetical order for quick scanning. Green resources will be created (or destroyed and then created if an existing resource exists), yellow resources are being changed in-place, and red resources will be destroyed. Note: You didn't specify an "-out" parameter to save this plan, so when "apply" is called, Terraform can't guarantee this is what will execute. + aws_elb.web ... + aws_instance.web ami: "" => "ami-21f78e11" ... + aws_security_group.default description: "" => "Used in the terraform" ... |
This also doubles as a linter by checking the validity of your configuration files. For example, if I comment out the instance_type in main.tf
, we receive an error:
1 2 3 4 5 6 7 8 |
$ terraform plan -var 'key_name=awsadvent' -var 'key_path=/tmp/aws-tf/awsadvent.pem' There are warnings and/or errors related to your configuration. Please fix these before continuing. Errors: * 'aws_instance.web' error: instance_type: required field is not set |
You will note that some pieces of the configuration are parameterized. This is very useful when sharing your Terraform plans, committing them to source control, or protecting sensitive data like access keys. By using variables and setting defaults for some, you allow for better portability when you share your Terraform plan with other members of your team. If you define a variable that does not have a default value, Terraform will require that you provide a value before proceeding. You can either (a) provide the values on the command line or (b) write them to a terraform.tfvars
file. This file acts like a “secrets” file with a key/value pair on each line. For example:
1 2 3 4 5 |
access_key = "ABC123" secret_key = "789xyz" key_name = "terraform" key_path = "terraform.pem" |
Due to the sensitive information included in this file, it is recommended that you includeterraform.tfvars
in your source control ignore list (eg: echo terraform.tfvars >> .gitignore
) if you want to share your plan.
Now, we can build the resources using apply:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
$ terraform apply -var 'key_name=terraform' -var 'key_path=/tmp/aws-tf/terraform.pem' aws_security_group.default: Refreshing state... (ID: sg-74077d11) aws_security_group.default: Creating... description: "" => "Used in the terraform" ingress.#: "" => "2" ingress.0.cidr_blocks.#: "" => "1" ingress.0.cidr_blocks.0: "" => "0.0.0.0/0" ingress.0.from_port: "" => "22" ingress.0.protocol: "" => "tcp" ingress.0.security_groups.#: "" => "0" ingress.0.self: "" => "0" ingress.0.to_port: "" => "22" ingress.1.cidr_blocks.#: "" => "1" ingress.1.cidr_blocks.0: "" => "0.0.0.0/0" ingress.1.from_port: "" => "80" ingress.1.protocol: "" => "tcp" ingress.1.security_groups.#: "" => "0" ingress.1.self: "" => "0" ingress.1.to_port: "" => "80" name: "" => "terraform_example" owner_id: "" => "<computed>" vpc_id: "" => "<computed>" aws_security_group.default: Creation complete aws_instance.web: Creating... ami: "" => "ami-21f78e11" availability_zone: "" => "<computed>" instance_type: "" => "m1.small" key_name: "" => "terraform" private_dns: "" => "<computed>" private_ip: "" => "<computed>" public_dns: "" => "<computed>" public_ip: "" => "<computed>" security_groups.#: "" => "1" security_groups.0: "" => "terraform_example" subnet_id: "" => "<computed>" tenancy: "" => "<computed>" aws_instance.web: Provisioning with 'remote-exec'... aws_instance.web (remote-exec): Connecting to remote host via SSH... aws_instance.web (remote-exec): Host: 54.148.154.146 aws_instance.web (remote-exec): User: ubuntu aws_instance.web (remote-exec): Password: false aws_instance.web (remote-exec): Private key: true aws_instance.web (remote-exec): Connected! Executing scripts... ........................ ... output truncated ... ........................ aws_instance.web (remote-exec): Starting nginx: nginx. aws_instance.web: Creation complete aws_elb.web: Creating... availability_zones.#: "" => "1" availability_zones.0: "" => "us-west-2a" dns_name: "" => "<computed>" health_check.#: "" => "<computed>" instances.#: "" => "1" instances.0: "" => "i-3cddb836" internal: "" => "<computed>" listener.#: "" => "1" listener.0.instance_port: "" => "80" listener.0.instance_protocol: "" => "http" listener.0.lb_port: "" => "80" listener.0.lb_protocol: "" => "http" listener.0.ssl_certificate_id: "" => "" name: "" => "terraform-example-elb" security_groups.#: "" => "<computed>" subnets.#: "" => "<computed>" aws_elb.web: Creation complete Apply complete! Resources: 3 added, 0 changed, 0 destroyed. The state of your infrastructure has been saved to the path below. This state is required to modify and destroy your infrastructure, so keep it safe. To inspect the complete state use the `terraform show` command. State path: terraform.tfstate Outputs: address = terraform-example-elb-419196096.us-west-2.elb.amazonaws.com |
The output above is truncated, but Terraform did a few things for us here:
terraform.tfstate
fileYou should be able to open the ELB public address in a web browser and see “Welcome to Nginx!” (note: this may take a minute or two after initialization in order for the ELB health check to pass).
The terraform.tfstate
file is very important as it tracks the status of your resources. As such, if you are sharing your configurations, it is recommended that you include this file in source control. This way, after initializing some resources, another member of your team will not try and re-initialize those same resources. In fact, she can see the status of the resources with terraform show
. In the event the state has not been kept up-to-date, you can use terraform refresh
to update the state file.
And…that’s it! With a few descriptive text files, Terraform is able to build cooperative resources on AWS in a matter of minutes. You no longer need complicated wrappers around existing AWS libraries/tools to orchestrate the creation or destruction of resources. When you are finished, you can simply run terraform destroy
to remove all the resources described in your .tf
configuration files.
With Terraform, building infrastructure resources is as simple as describing them in text. Of course, there is a lot more you can do with this tool including managing DNS records and configure Mailgun. You can even mix these providers together in a single plan (eg: EC2 instances, DNSimple records, Atlas metadata) and Terraform will manage it all! Check out the documentation and examples for the details.
Terraform Docs: https://terraform.io/docs/index.html
Terraform Examples: https://github.com/hashicorp/terraform/tree/master/examples
Today’s post comes to us from Mark Nunnikhoven, who is the VP of Cloud & Emerging Technologies @TrendMicro.
At this year’s re:Invent, AWS introduced a new service (currently in preview) call Lambda. Mitch Garnaat already introduced the service to the advent audience in the first post of the month.
Take a minute to read Mitch’s post if you haven’t already. He provides a great overview of the service, it’s goals, and he’s created a handy tool, Kappa, that simplifies using the new service.
Of course Mitch’s tool is only useful if you already understand what Lambda does and where best to use it. The goal of this post is to provide that understanding.
I think Mitch is understating things when he says that “there are some rough edges”. Like any AWS service, Lambda is starting out small. Thankfully–like other services–the documentation for Lambda is solid.
There is little point in creating another walk through setting up a Lambda function. This tutorial from AWS does a great job of the step-by-step.
What we’re going to cover today are the current challenges, constraints, and where Lambda might be headed in the future.
During a Lambda workflow 2 IAM roles are used. This is the #1 area where people get caught up.
A role is an identity used in the permissions framework of AWS. Roles typically have policies attached that dictate what the role can do within AWS.
Roles are a great way to provide (and limit) access within passed access and secret keys around.
Lambda uses 2 IAM roles during it’s workflow, an invocation role and an execution role. While the terminology is consistent within computer science it’s needlessly confusing for some people.
Here’s the layman’s version:
This is an important difference because while the execution role is consistent in the permissions it needs, the invocation role (the trigger) will need different permissions depending on where you’re using you Lambda function.
If you’re hooking your Lambda function to an S3 bucket, the invocation role will need the appropriate permissions to have S3 call your Lambda function. This typically includes the lambda:InvokeAsync permission and a trust policy that allows the bucket to assume the invocation role.
If you’re hooking your function into a Kinesis event stream, the same logic applies but in this case you’re going to have to allow the invocation role access to your Kinesis stream since it’s a pull model instead of the S3 push model.
The AWS docs sum this up with the following semi-helpful diagrams:
S3 push model for Lambda permissions
Kinesis pull model for Lambda permissions
Remember that your invocation role always needs to be able to assume a role (sts:AssumeRole) and access the event source (Kinesis stream, S3 bucket, etc.)
TL:DR Thank Mitch for starting Kappa.
The longer explanation is that packaging up the dependencies of your code can be a bit of the pain. That’s because we have little to no visibility into what’s happening.
Until the service and associated tooling matures a bit, we’re back to world of printf or at least
1 2 |
<br /> console.log(“Did execution get this far?”);<br /> |
For Lambda a deployment package is your javascript code and any supporting libraries. These need to be bundled into a .zip file. If you’re just deploying a simple .js file, .zip it and you’re good to go.
If you have addition libraries that you’re providing, buckle up. This ride is about to get real bumpy.
The closest things we have to a step-by-step on providing additional libraries is this step from one of the AWS tutorials.
The instructions here are to install a separate copy of node.js, create a subfolder, and then install the required modules via npm.
Now you’re going to .zip your code file and the modules from the subfolder but not the folder itself. From all appearances the .zip needs to be a flat file.
I’m hopeful there will be more robust documentation on this soon but in the meantime please share your experiences in the AWS forums or on Twitter.
As Lambda is in preview there are additional constraints beyond what you can expect when it is launched into production.
These constraints also leads to some AWS recommendations that are worth reading and taking to heart however one stands out above all the others.
“Write your Lambda function code in a stateless style”, AWS Lambda docs.
This is by far the best piece of advice that one can offer when it comes to Lambda design patterns. Do not try to bolt state on using another service or data store. Treat Lambda as an opportunity to manipulate data mid-stream. Lambda functions execute concurrently.Thinking of it in functional terms will save you a lot of headaches down the road.
One of the most common reactions I’ve heard about AWS Lambda is, “So what?”. That’s understandable but if you look at AWS’ track record, they ship very simple but useful services and iterate very quickly on them.
While Lambda may feel limited today, expect things to change quickly. Kinesis, DynamoDB, and S3 are just the beginning. The “custom” route today provides a quick and easy way to offload some data processing to Lambda but that will become exponentially more useful as “events” start popping up in other AWS services.
Imagine trigger Lambda functions based on SNS messages, CloudWatch Log events, Directory Service events, and so forth.
Look to tagging in AWS as an example. It started very simple in EC2 and over the past 24 months has expanded to almost every service and resource in the environment. Event’s will most likely follow the same trajectory and with every new event Lambda gets even more powerful.
Getting in on the ground floor of Lambda will allow you to shed more and more of your lower level infrastructure as more events are rolled out to production.
Here’s some holiday reading to ensure you’re up to speed:
watch “Getting Started with AWS Lambda”, MBL202 from this year’s re:Invent
read Jeff Barr’s post introducing Lambda, “AWS Lambda – Run Code in the Cloud”
read through the Lambda developer guide
re-read Mitch’s post from earlier this month
Today’s AWS Advent post comes to us from Mitch Garnaat, the creator of the AWS python library boto and who is currently herding clouds and devops over at Scopely. He’s gonna walk us through a quick look at AWS CodeDeploy
Software deployment. It seems like such an easy concept. I wrote some new code and now I want to get it into the hands of my users. But there are few areas in the world of software development where you find a more diverse set of approaches to a such a simple-sounding problem. In all my years in the software business, I don’t think I’ve ever seen two deployment processses that are the same. So many different tools. So many different approaches. It turns out it is a pretty complicated problem with a lot of moving parts.
But there is one over-arching trend in the world of software deployment that seems to have almost universal appeal. More. More deployments. More often.
Ten years ago it was common for a software deployment to happen a few times a year. Software changes would be batched up for weeks or months waiting for a release cycle and once the release process started, development stopped. All attention was focused on finding and fixing bugs and, eventually, releasing the code. It was very much a bimodal process: develop for a while and then release for a while.
Now the goal is to greatly shorten the time it takes to get a code change deployed, to make the software deployment process quick and easy. And the best way to get good at something is to do it a lot of times.
“Repetition is the mother of skill.” – Anthony Robbins
If we force ourselves to do software deployment frequently and repeatedly we will get better and better at it. The process I use to put up holiday lights is appallingly inefficient and cumbersome. But since I only do it once a year, I put up with it. If I had to put those lights up once a month or once a week or once a day, the process would get better in a hurry.
The ultimate goal is Continuous Deployment, a continuous pipeline where each change we commit to our VCS is pushed through a process of testing and then, if the tests succeed, is automatically released to production. This may be an aspirational goal for most people and there may be good reasons not to have a completely automated pipeline (e.g. dependencies on other systems) but the clear trend is towards frequent, repeatable software deployment without the bimodal nature of traditional deployment techniques.
Which brings us to the real topic of today’s post. AWS CodeDeploy is a new service from AWS specifically designed to automate code deployment and eliminate manual operations.
This post will not be a tutorial on how to use AWS CodeDeploy. There is an excellent hands-on sample deployment available from AWS. What this post will focus on is some of the specific features provided by AWS CodeDeploy that might help you achieve the goal of faster and more automated software deployments.
This may seem like a contradiction given that this is a new service from AWS but the underlying technology in AWS CodeDeploy is not new at all. This is a productization of an internal system calledApollo that has been used for software deployments within Amazon and AWS for many years.
Anyone who has worked at Amazon will be familiar with Apollo and will probably rave about it. Its rock solid and has been used to deploy thousands of changes a day across huge fleets of servers within Amazon.
You can control how AWS CodeDeploy will roll out the deployment to your fleet using a deployment configuration
. There are three built-in configurations:
All At Once – Deploy the new revision to all instances in the deployment group at once. This is probably not a good idea unless you have a small fleet or you have very good acceptance tests for your new code.
Half At A Time – Deploy the new revision to half of the instances at once. If a certain number of those instances fail then fail the deployment.
One At A Time – Deploy the new revision to one instance at a time. If deployment to any instance fails, then fail the deployment.
You can also create custom deployment configurations if one of these models doesn’t fit your situation.
If you are deploying your code to more than one instance and you are not currently using Auto Scaling you should stop reading this article right now and go figure out how to integrate it into your deployment strategy. In fact, even if you are only using one instance you should use Auto Scaling. Its a great service that can save you money and allow you to scale with demand.
Assuming that you are using Auto Scaling, AWS CodeDeploy can integrate with your Auto Scaling groups. By using lifecycle hooks in Auto Scaling AWS CodeDeploy can automatically deploy the specified revision of your software on any new instances that Auto Scaling creates in your group.
AWS CodeDeploy uses a YAML-format AppSpec
file to drive the deployment process on each instance. This file allows you to map source files in the deployment package to their destination on the instance. It also allows a variety of hooks
to be run at various times in the process such as:
These hooks can be arbitrary executables such as BASH scripts or Python scripts and can do pretty much anything you need them to do.
Below is an example AppSpec
file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
os: linux files: - source: Config/config.txt destination: webapps/Config - source: source destination: /webapps/myApp hooks: BeforeInstall: - location: Scripts/UnzipResourceBundle.sh - location: Scripts/UnzipDataBundle.sh AfterInstall: - location: Scripts/RunResourceTests.sh timeout: 180 ApplicationStart: - location: Scripts/RunFunctionalTests.sh timeout: 3600 ValidateService: - location: Scripts/MonitorService.sh timeout: 3600 runas: codedeployuser |
AWS CodeDeploy can be driven either from the AWS Web Console or from the AWS CLI. In general, my feeling is that GUI interfaces are great for monitoring and other read-only functions but for command and control I strongly prefer CLI’s and scripts so its great that you can control every aspect of AWS CodeDeploy via the AWSCLI or any of the AWS SDK’s. I will say that the Web GUI for AWS CodeDeploy is quite well done and provides a really nice view of what is happening during a deployment.
There is no extra charge for using AWS CodeDeploy. You obviously pay for all of the EC2 instances you are using just as you do now but you don’t have to pay anything extra to use AWS CodeDeploy.
The previous section highlights some features of AWS CodeDeploy that I think could be particularly interesting to people considering a new deployment tool.
In this section, I want to mention a couple of caveats. These are not really problems but just things you want to be aware of in evaluating AWS CodeDeploy.
AWS CodeDeploy only supports deployments on EC2 instances at this time.
AWS CodeDeploy requires an agent to be installed on any EC2 instance that it will be deploying code to. Currently, they support Amazon Linux, Ubuntu, and Windows Server.
Because of the way AWS CodeDeploy works, there really isn’t a true rollback capability. You can’t deploy code to half of your fleet and then undeploy that latest revision. You can simulate a rollback by simply creating a new deployment of your previous version of software but there is no Green/Blue type rollback available.
We just created a new deployment pipeline at work that implements a type of BLUE/GREEN deployment and is based on Golden AMI’s. We are very happy with that and I don’t think we will be revisiting that anytime soon. However if I was starting that project today, I would certainly give a lot of thought to using AWS CodeDeploy. It has a nice feature set, can be easily integrated into most environmenets and code bases, and is based on rock-solid, proven technology. And the price is right!
Links:
Today’s post on managing EC2 Security Groups with Puppet comes to use from Gareth Rushgrove, the awesome curator of DevOps Weekly and who is currently an engineer at PuppetLabs.
At Puppet Labs we recently shipped a module to make managing AWS easier. This tutorial shows how it can be used to manage your security groups. EC2 Security groups act as a virtual firewall and are used to isolate instances and other AWS resources from each other and the internet.
You can find the full details about installation and configuration for the module in the official READMEbut the basic version, assuming a working Puppet and Ruby setup, is:
1 2 3 |
gem install aws-sdk-core puppet module install puppetlabs-aws |
You’ll also want to have your AWS API credentials in environment variables (or use IAM if you’re running from within AWS).
1 2 3 |
export AWS_ACCESS_KEY_ID=your_access_key_id export AWS_SECRET_ACCESS_KEY=your_secret_access_key |
First lets create a simple security group called test-sg in the us-east-1 region. Save the following to a file called securitygroup.pp
:
1 2 3 4 5 6 7 8 9 10 11 |
<span class="pl-st">ec2_securitygroup</span> { <span class="pl-en">'test-sg'</span>: <span class="pl-c1">region</span> => <span class="pl-s1"><span class="pl-pds">'</span>us-east-1<span class="pl-pds">'</span></span>, <span class="pl-c1">ensure</span> => present, <span class="pl-c1">description</span> => <span class="pl-s1"><span class="pl-pds">'</span>Security group for aws advent<span class="pl-pds">'</span></span>, <span class="pl-c1">ingress</span> => [{ <span class="pl-c1">security_group</span> => <span class="pl-s1"><span class="pl-pds">'</span>test-sg<span class="pl-pds">'</span></span>, }], <span class="pl-c1">tags</span> => { <span class="pl-c1">reason</span> => <span class="pl-s1"><span class="pl-pds">'</span>awsadvent<span class="pl-pds">'</span></span>, }, } |
Now lets run Puppet to create the group:
1 2 |
puppet apply securitygroup.pp --test |
You should see something like the following output:
1 2 3 4 5 6 7 8 |
Info: Loading facts Notice: Compiled catalog for pro.local in environment production in 0.05 seconds Info: Applying configuration version '1418659587' Info: Checking if security group test-sg exists in region us-east-1 Info: Creating security group test-sg in region us-east-1 Notice: /Stage[main]/Main/Ec2_securitygroup[test-sg]/ensure: created Notice: Finished catalog run in 15.22 seconds |
We’re running here with apply and the --test
flag so we can easily see what’s happening, but if you have a Puppet master setup you can run with an agent too.
You will probably change your security groups over time as you’re infrastructure evolves. And managing that evolution is where Puppet’s declarative approach really shines. You can have confidence in the description of your infrastructure in code because Puppet can tell you about any changes when it runs.
Next lets add a new ingress rule to our existing group. Modify the securitygroup.pp file like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
<span class="pl-st">ec2_securitygroup</span> { <span class="pl-en">'test-sg'</span>: <span class="pl-c1">ensure</span> => present, <span class="pl-c1">region</span> => <span class="pl-s1"><span class="pl-pds">'</span>us-east-1<span class="pl-pds">'</span></span>, <span class="pl-c1">description</span> => <span class="pl-s1"><span class="pl-pds">'</span>Security group for aws advent<span class="pl-pds">'</span></span>, <span class="pl-c1">ingress</span> => [{ <span class="pl-c1">protocol</span> => <span class="pl-s1"><span class="pl-pds">'</span>tcp<span class="pl-pds">'</span></span>, <span class="pl-c1">port</span> => 80, <span class="pl-c1">cidr</span> => <span class="pl-s1"><span class="pl-pds">'</span>0.0.0.0/0<span class="pl-pds">'</span></span>, },{ <span class="pl-c1">security_group</span> => <span class="pl-s1"><span class="pl-pds">'</span>test-sg<span class="pl-pds">'</span></span>, }], <span class="pl-c1">tags</span> => { <span class="pl-c1">reason</span> => <span class="pl-s1"><span class="pl-pds">'</span>awsadvent<span class="pl-pds">'</span></span>, }, } |
And again lets run Puppet to modify the group:
1 2 |
puppet apply securitygroup.pp --test |
You should see something like the following output:
1 2 3 4 5 6 7 |
Info: Loading facts Notice: Compiled catalog for pro.local in environment production in 0.04 seconds Info: Applying configuration version '1418659692' Info: Checking if security group test-sg exists in region us-east-1 Notice: /Stage[main]/Main/Ec2_securitygroup[test-sg]/ingress: ingress changed [{'security_group' => 'test-sg'}] to '{"protocol"=>"tcp", "port"=>"80", "cidr"=>"0.0.0.0/0"} {"security_group"=>"test-sg"}' Notice: Finished catalog run in 13.59 seconds |
Note the information about changes to the ingress rules as we expected. You can also check the changes in the AWS console.
The module also has full support for the Puppet resource command, so all of the functionality is available from the command line as well as the DSL. As an example lets clean-up and delete the group created above.
1 2 |
puppet resource ec2_securitygroup test-sg ensure=absent region=us-east-1 |
Hopefully that’s given you an idea of what’s possible with the Puppet AWS module. You can see more examples of the module in action in the main repository.
Some of the advantages of using Puppet for managing AWS resources are:
The current preview release of the module supports EC2 instances, security groups and ELB load balancers, with work on support for VPC, Route53 and Autoscaling Groups available soon. We’re looking for as much feedback as possible at the moment so feel free to report issues on GitHub), ask questions on the puppet-user mailing list or contact me on twitter at @garethr
Our first AWS Advent post comes to us from Mitch Garnaat, the creator of theAWS python library boto and who is currently herding clouds and devops over at Scopely. He’s gonna walk us through how we can discover more about our Amazon Resources using the awesome tool he’s been building, called skew.
If you only have one account in AWS and you only use one service in one region, this article probably isn’t for you. However, if you are like me and manage resources in many accounts, across multiple regions, and in many different services, stick around.
There are a lot of great tools to help you manage your AWS resources. There is the AWS Web Console, the AWSCLI, various language SDK’s like boto, and a host of third-party tools. The biggest problem I have with most of these tools is that they limit your view of resources to a single region, a single account, and a single service at a time. For example, you have to login to the AWS Console with one set of credentials representing a single account. And once you are logged in, you have to select a single region. And then, finally, you drill into a particular service. The AWSCLI and the SDK’s follow this same basic model.
But what if you want to look at resources across regions? Across accounts? Across services? Well, that’s where skew
comes in.
Skew is a Python library built on top of botocore. The main purpose of skew
is to provide a flat, uniform address space for all of your AWS resources.
The name skew
is a homonym for SKU (Stock Keeping Unit). SKU’s are the numbers that show up on the bar codes of just about everything you purchase and that SKU number uniquely identifies the product in the vendor’s inventory. When you make a purchase they scan the barcode containing the SKU and can instantly find the pricing data for the item.
Similary, skew
uses a unique identifier for each one of your AWS resources and allows you to scanthe SKU and quickly find the details for that resource. It also provides some powerful mechanisms to find sets of resources by allowing wildcarding and regular expressions within the SKU’s.
So, what do we use for a unique identifier for all of our AWS resources? Well, as it turns out, AWS has already solved that problem for us. Each resource in AWS can be identified by an Amazon Resource Name or ARN. The general form for ARN’s are:
1 2 3 4 |
arn:aws:service:region:account:resource arn:aws:service:region:account:resourcetype/resource arn:aws:service:region:account:resourcetype:resource |
So, the ARN for an EC2 instances might look like this:
1 2 |
arn:aws:ec2:us-west-2:123456789012:instance/i-12345678 |
This tells us the instance is in the us-west-2
region, running in the account identified by the account number 123456789012
and the instance has an instance ID of i-12345678
.
The easiest way to install skew
is via pip
.
1 2 |
% pip install skew |
Because skew
is based on botocore, as is AWSCLI, it will use the same credentials as those tools. You need to make a small addition to your ~/.aws/config
file to help skew
map AWS account ID’s to the profiles in the config file. Check the README for details on that.
Once we have skew
installed and configured, we can use it to find resources based on their ARN’s. For example, using the example ARN above:
1 2 3 4 5 6 |
>>> import skew >>> arn = skew.scan('arn:aws:ec2:us-west-2:123456789012:instance/i-12345678') >>> arn arn:aws:ec2:us-west-2:123456789012:instance/i-12345678 >>> |
Ok, that wasn’t very exciting. How do I get at my actual resource in AWS? Well, the scan
method returns an ARN
object and this object supports the iterator pattern in Python. This makes sense since as we will see later this ARN can actually return a lot of objects, not just one. So if we want to get our object we can:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
>>> instance = list(arn)[0] >>> instance.id 'i-12345678' >>> instance.data {u'AmiLaunchIndex': 1, u'Architecture': 'x86_64', u'BlockDeviceMappings': [{u'DeviceName': '/dev/sda1', u'Ebs': {u'AttachTime': datetime.datetime(2014, 12, 14, 13, 48, tzinfo=tzutc()), u'DeleteOnTermination': True, u'Status': 'attached', u'VolumeId': 'vol-63276b7b'}}], u'ClientToken': '425f1a07-2e61-4089-a7dc-7344b302731e_us-east-1d_2', u'EbsOptimized': False, u'Hypervisor': 'xen', u'ImageId': 'ami-6227460a', u'InstanceId': 'i-12345678', u'InstanceType': 'c3.2xlarge', ... } >>> |
Iterating on an ARN returns a list of Resource
objects and each of these Resource
objects represents one resource in AWS. Resource
objects have a number of attributes like id
and they also have an attribute called data
that contains all of the data about that resource. This is the same information that would be returned by the AWSCLI or an SDK.
Finding a single resource in AWS is okay but one of the nice things about skew is that it allows you to quickly find lots of resources in AWS. And you don’t have to worry about which region those resources are in or in which account they reside.
For example, let’s say we want to find all EC2 instances running in all regions and in all of my accounts:
1 2 3 4 |
arn = skew.scan('arn:aws:ec2:*:*:instance/*') for instance in arn: print(arn) |
In that one little line of Python code, a lot of stuff is happening. Skew will iterate through all of the regions supported by the EC2 service and, in each region, will authenticate with each of the account profiles listed in your AWS config file. It will then find all EC2 instances and finally return the complete list of those instances as Resource objects.
In addition to wildcards, you can also use regular expressions as components in the ARN. For example:
1 2 |
arn = skew.scan('arn:aws:dynamodb:us-.*:*:table/*') |
This will find all DynamoDB tables in all US regions for all accounts.
Here are some examples of things you can do quickly and easily with skew
that would be difficult in most other tools.
Find all unattached EBS volumes across all regions and accounts and tally the size of wasted space.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import skew total_size = 0 total_volumes = 0 for volume in skew.scan('arn:aws:ec2:*:*:volume/*'): if not volume.data['Attachments']: total_volumes += 1 total_size += volume.data['Size'] print('%s: %dGB' % (volume.arn, volume.data['Size'])) print('Total unattached volumes: %d' % total_volumes) print('Total size (GB): %d' % total_size) |
Audit all EC2 security groups to find CIDR rules that are not whitelisted.
1 2 3 4 5 6 7 8 9 10 11 12 |
import skew # Add whitelisted CIDR blocks here, e.g. 192.168.1.1/32. # Any addresses not in this list will be flagged. whitelist = [] for secgrp in skew.scan('arn:aws:ec2:*:*:security-group/*'): for ipperms in secgrp.data['IpPermissions']: for ip in ipperms['IpRanges']: if ip['CidrIp'] not in whitelist: print('%s: %s is not whitelisted' % (sg.arn, ip['CidrIp'])) |
Find all EC2 instances that are not tagged in any way.
1 2 3 4 5 6 |
import skew for instance in skew.scan('arn:aws:ec2:*:*:instance/*'): if not instance.tags: print('%s is untagged' % instance.arn) |
The ARN provides a great way to uniquely identify AWS resources but it doesn’t exactly roll off the tongue. Skew provides some help for constructing ARN’s interactively.
First, start off with a new ARN object.
1 2 3 4 5 6 |
>>> from skew.arn import ARN >>> arn = ARN() >>> arn arn:aws:*:*:*:* >>> |
Each ARN
object contains 6 components:
arn
aws
All of these are available as attributes of the ARN object.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
>>> arn.scheme arn >>> arn.provider aws >>> arn.service * >>> arn.region * >>> arn.account * >>> arn.resource * >>> |
If you want to build up the ARN interactively, you can ask each of the components what choices are available.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
>>> arn.service.choices ['autoscaling', 'cloudformation', 'cloudfront', 'cloudsearch', 'cloudsearchdomain', 'cloudtrail', 'cloudwatch', 'codedeploy', ... 'storagegateway', 'sts', 'support', 'swf'] >>> |
You can also try out your regular expressions to make sure they return the results you expect.
1 2 3 4 5 6 7 8 9 |
>>> arn.service.match('cloud.*') ['cloudformation', 'cloudfront', 'cloudsearch', 'cloudsearchdomain', 'cloudtrail', 'cloudwatch'] >>> |
To set the value of a particular component, use the pattern
attribute.
1 2 3 4 5 6 7 8 9 10 |
>>> arn.service.pattern = 'cloud.*' >>> arn.service.matches ['cloudformation', 'cloudfront', 'cloudsearch', 'cloudsearchdomain', 'cloudtrail', 'cloudwatch'] >>> |
Once you have the ARN that you want, you can enumerate it like this:
1 2 3 4 |
>>> for resource in arn: <do something amazing> >>> |
A recent feature of skew
allows you to run queries against the resource data. This feature makes use of jmespath which is a really nice JSON query engine. It was originally written in Python for use on the AWSCLI but is now available in a number of other languages. If you have ever used the --query
option of the AWSCLI, then you have used jmespath.
If you append a jmespath query to the end of the ARN (using a |
as a separator) skew
will send the data for each of the returned resources through the jmespath query and store the result in thefiltered_data
attribute of the resource object. The original data is still available as the data
attribute. For example:
1 2 |
arn = skew.scan('arn:aws:ec2:*:*:instance/*|InstanceType' |
Then each resource returned would have the instance type store in the filtered_data
attribute of theResource
object. This is obviously a very simple example but jmespath is very powerful and the interactive query tool available on http://jmespath.org/ allows you to try your queries out beforehand to get exactly what you want.
One other feature of skew
is easy access to CloudWatch metrics for AWS resources. If we refer back to the very first interative session in the post, we can show how you would access those CloudWatch metrics for the instance
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
>>> instance.metric_names ['CPUUtilization', 'NetworkOut', 'StatusCheckFailed', 'StatusCheckFailed_System', 'NetworkIn', 'DiskWriteOps', 'DiskReadBytes', 'DiskReadOps', 'StatusCheckFailed_Instance', 'DiskWriteBytes'] >>> instance.get_metric_data('CPUUtilization') [{u'Average': 0.134, u'Timestamp': '2014-12-13T14:04:00Z', u'Unit': 'Percent'}, {u'Average': 0.066, u'Timestamp': '2014-12-13T13:54:00Z', u'Unit': 'Percent'}, {u'Average': 0.066, u'Timestamp': '2014-12-13T14:09:00Z', u'Unit': 'Percent'}, {u'Average': 0.134, u'Timestamp': '2014-12-13T13:34:00Z', u'Unit': 'Percent'}, {u'Average': 0.066, u'Timestamp': '2014-12-13T14:19:00Z', u'Unit': 'Percent'}, {u'Average': 0.068, u'Timestamp': '2014-12-13T13:44:00Z', u'Unit': 'Percent'}, {u'Average': 0.134, u'Timestamp': '2014-12-13T14:14:00Z', u'Unit': 'Percent'}, {u'Average': 0.066, u'Timestamp': '2014-12-13T13:29:00Z', u'Unit': 'Percent'}, {u'Average': 0.132, u'Timestamp': '2014-12-13T13:59:00Z', u'Unit': 'Percent'}, {u'Average': 0.134, u'Timestamp': '2014-12-13T13:49:00Z', u'Unit': 'Percent'}, {u'Average': 0.134, u'Timestamp': '2014-12-13T13:39:00Z', u'Unit': 'Percent'}] >>> |
We can find the available CloudWatch metrics with the metric_names
attribute and then we can retrieve the desired metric using the get_metric_data
method. The README for skew
contains a bit more information about accessing CloudWatch metrics.
Skew is pretty new and is still changing a lot. It currently supports only a subset of available AWS resource types but more are being added all the time. If you manage a lot of AWS resources, I encourage you to give it a try. Feedback, as always, is very welcome as are pull requests!
Today’s post on how to achieve high availability in AWS with keepalived comes to us from Julian Dunn, who’s currently helping improve things at Chef.
By now, most everyone knows that running infrastructure in AWS is not the same as a traditional data center, thus putting a lie to claims that you can just “lift and shift to the cloud”. In AWS, one normally achieves “high-availability” by scaling horizontally. For example, if you have a WordPress site, you could create several identical WordPress servers and put them all behind an Elastic Load Balancer (ELB), and connect them all to the same database. That way, if one of these servers fails, the ELB will stop directing traffic to it, but your site will still be available.
But about that database – isn’t it also a single-point-of-failure? You can’t very well pull the same horizontal-redundancy trick for services that explicitly have one writer (and potentially many readers). For a database, you could probably use Amazon Relational Database Server (RDS), but suppose Amazon doesn’t have a handy highly-available Platform-as-a-Service variant for the service you need?
In this post, I’ll show you how to use that old standby, keepalived, in conjunction with Virtual Private Cloud (VPC) features, to achieve real high-availability in AWS for systems that can’t be horizontally replicated.
To create high-availability out of two (or more) systems, you need the following components:
In AWS, we’ll use:
There are a few limitations to this approach in AWS. Most important is that all instances and the block storage device must live in the same VPC subnet, which implies that they live in the same availability zone (AZ).
Keepalived for Linux has been around for over ten years, and while it is very robust and reliable, it can be very difficult to grasp because it is designed for a variety of use cases, some very distinct from the one we are going to implement. Software design diagrams like this one do not necessarily aid in understanding how it works.
For the purposes of building an HA system, you need only know a few things about keepalived:
ifconfig
to examine whether the master’s interface has the VIP, as ifconfig
doesn’t use netlink system calls and the VIP won’t show up! Use ip addr
instead.We’ll spin up two identical systems in the same VPC subnet for our master and backup nodes. To avoid passing AWS access and secret keys to the systems, I’ve created an IAM instance profile & role called awsadvent-ha
with a policy document to let the systems manage ENI addresses and EBS volumes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
{ <span class="pl-s1"><span class="pl-pds">"</span>Version<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>2012-10-17<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>Statement<span class="pl-pds">"</span></span>: [ { <span class="pl-s1"><span class="pl-pds">"</span>Effect<span class="pl-pds">"</span></span>: <span class="pl-s1"><span class="pl-pds">"</span>Allow<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>Action<span class="pl-pds">"</span></span>: [ <span class="pl-s1"><span class="pl-pds">"</span>ec2:DescribeInstances<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>ec2:DescribeVolumes<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>ec2:AttachVolume<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>ec2:DetachVolume<span class="pl-pds">"</span></span>, <span class="pl-s1"><span class="pl-pds">"</span>ec2:AssignPrivateIpAddresses<span class="pl-pds">"</span></span> ], <span class="pl-s1"><span class="pl-pds">"</span>Resource<span class="pl-pds">"</span></span>: [ <span class="pl-s1"><span class="pl-pds">"</span>*<span class="pl-pds">"</span></span> ] } ] } |
For this exercise I used Fedora 21 AMIs, because Fedora has a recent-enough version of keepalived with VRRP-over-unicast support:
1 |
$ aws ec2 run-instances --image-id ami-164cd77e --key-name us-east1-jdunn --security-groups internal-icmp,ssh-only --instance-type t1.micro --subnet-id subnet-c0ffee11 --iam-instance-profile awsadvent-ha --count 2 |
You’ll notice that one of the security groups I’ve placed the machines into is entitled internal-icmp
, which is a group I created to allow the instances to ping each other (send ICMP Echo Request and receive ICMP Echo Reply). This is what keepalived will use as a heartbeat mechanism between nodes.
We also need a separate EBS volume for the data, so let’s create one in the same AZ as the instances:
1 |
$ aws ec2 create-volume --size 10 --availability-zone us-east-1a --volume-type gp2 |
Note that the volume needs to be partitioned and formatted at some point; I don’t do that in this tutorial.
Once the two machines are up and reachable, it’s time to install and configure keepalived. SSH to them and type:
1 2 |
$ sudo yum -y install keepalived |
I intend to write the external failover scripts called by keepalived in Ruby, so I’m going to install that, and the fog gem that will let me communicate with the AWS API:
1 |
$ sudo yum -y install ruby rubygem-fog |
keepalived is configured using the /etc/keepalived/keepalived.conf
file. Here’s the configuration I used for this demo:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
global_defs { notification_email { jdunn@chef.io } notification_email_from keepalived@chef.io smtp_server 127.0.0.1 smtp_connect_timeout 30 } vrrp_sync_group VG_1 { group { VI_1 } } vrrp_instance VI_1 { state MASTER ! nopreempt: allow lower priority machine to maintain master role nopreempt interface eth0 virtual_router_id 1 priority 100 notify_backup "/etc/keepalived/awsha.rb backup" notify_master "/etc/keepalived/awsha.rb master" notify_fault "/etc/keepalived/awsha.rb fault" notify_stop "/etc/keepalived/awsha.rb backup" unicast_srcip 172.31.40.96 unicast_peer { 172.31.40.95 } advert_int 1 authentication { auth_type PASS auth_pass generate-a-real-password-here } virtual_ipaddress { 172.31.36.57 dev eth0 } } |
A couple of notes about this configuration:
unicast_srcip
and unicast_peer
clauses, so make sure to change this. (A configuration management system sure would help here…)As previously mentioned, the external script is invoked whenever a master-to-backup or backup-to-master event occurs, via the notify_backup
and notify_master
directives in keepalived.conf
. Upon receiving an event, it will associate and mount (or unmount and disassociate) the EBS volume from the instance, and attach or release the ENI secondary address.
The script is too long to reproduce inline here, so I’ve included it as a separate Gist.
Note: For brevity, I’ve eliminated a lot of error-handling from the script, so it may or may not work out-of-the-box. In a real implementation, you need to check for many error conditions like open files on a disk volume, poll for the EC2 API to attach/release the volume, etc.
Start keepalived on both servers:
1 2 |
$ sudo service keepalived start |
One of them will elect itself the master, assign the ENI secondary IP to itself, and attach and mount the block device on /mnt
. You can see which is which by checking the service status:
1 2 3 4 5 6 7 8 9 10 |
ip-172-31-40-96:~$ sudo systemctl status -l keepalived.service ... Dec 09 21:14:44 ip-172-31-40-96.ec2.internal Keepalived_vrrp[12271]: VRRP_Instance(VI_1) Transition to MASTER STATE Dec 09 21:14:45 ip-172-31-40-96.ec2.internal Keepalived_vrrp[12271]: VRRP_Instance(VI_1) Entering MASTER STATE Dec 09 21:14:45 ip-172-31-40-96.ec2.internal Keepalived_vrrp[12271]: VRRP_Instance(VI_1) setting protocol VIPs. Dec 09 21:14:45 ip-172-31-40-96.ec2.internal Keepalived_vrrp[12271]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 172.31.36.57 Dec 09 21:14:45 ip-172-31-40-96.ec2.internal Keepalived_vrrp[12271]: Opening script file /etc/keepalived/awsha.rb Dec 09 21:14:45 ip-172-31-40-96.ec2.internal Keepalived_healthcheckers[12270]: Netlink reflector reports IP 172.31.36.57 added Dec 09 21:14:45 ip-172-31-40-96.ec2.internal Keepalived_vrrp[12271]: VRRP_Group(VG_1) Syncing instances to MASTER state |
The other machine will say that it’s transitioned to backup state:
1 2 |
Dec 09 21:14:46 ip-172-31-40-95.ec2.internal Keepalived_vrrp[1971]: VRRP_Instance(VI_1) Entering BACKUP STATE |
To force a failover, stop keepalived on the current master. The backup system will detect that the master went away, and transition to primary:
1 2 3 4 5 6 7 8 9 10 |
ip-172-31-40-95:~$ sudo systemctl status -l keepalived.service ... Dec 09 21:25:05 ip-172-31-40-95.ec2.internal Keepalived_vrrp[1971]: VRRP_Instance(VI_1) Transition to MASTER STATE Dec 09 21:25:05 ip-172-31-40-95.ec2.internal Keepalived_vrrp[1971]: VRRP_Group(VG_1) Syncing instances to MASTER state Dec 09 21:25:06 ip-172-31-40-95.ec2.internal Keepalived_vrrp[1971]: VRRP_Instance(VI_1) Entering MASTER STATE Dec 09 21:25:06 ip-172-31-40-95.ec2.internal Keepalived_vrrp[1971]: VRRP_Instance(VI_1) setting protocol VIPs. Dec 09 21:25:06 ip-172-31-40-95.ec2.internal Keepalived_healthcheckers[1970]: Netlink reflector reports IP 172.31.36.57 added Dec 09 21:25:06 ip-172-31-40-95.ec2.internal Keepalived_vrrp[1971]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 172.31.36.57 Dec 09 21:25:06 ip-172-31-40-95.ec2.internal Keepalived_vrrp[1971]: Opening script file /etc/keepalived/awsha.rb |
After a while, the backup should be reachable on the VIP, and have the disk volume mounted under/mnt
.
If you now start keepalived on the old master, it should come back online as the new backup.
As we’ve seen, it’s not always possible to architect systems in AWS for horizontal redundancy. Many pieces of software, particularly those involving one writer and many readers, cannot be set up this way.
In other situations, it’s not desirable to build horizontal redundancy. One real-life example is a highly-available large shared cache system (e.g. squid or varnish) where it would be costly to rebuild terabytes of cache on instance failure. At Chef Software, we use an expanded version of the tools shown here to implement our Chef Server High-Availability solution.
Finally, I also found this presentation by an AWS solutions architect in Japan very useful in identifying what L2 and L3 networking technologies are available in AWS:http://www.slideshare.net/kentayasukawa/ip-multicast-on-ec2