AWS Advent 2014: CloudFormation woes: Keep calm and use Ansible

Today’s post on using Ansible to help you get the most out of CloudFormation comes to use from Soenke Ruempler, who’s helping keep things running smoothly at Jimdo.

No more outdated information, a single source of truth. Describing almost everything as code, isn’t this one of the DevOps dreams? Recent developments have made this dream even closer. In the Era of APIs, tools like TerraForm and Ansible have evolved which are able to codify the creation and maintenance of entire “organizational ecosystems”.

This blog post is a brief description of the steps we have taken to come closer to this goal at my employer Jimdo. Before we begin looking at particular implementations, let’s take the helicopter view and have a look at the current state and the problems with it.

Current state

We began to move to AWS in 2011 and have been using CloudFormation from the beginning. While we currently describe almost everything in CloudFormation, there are some legacy pieces which were just “clicked” through the AWS console. In order to to have some primitive auditing and documentation for those, we usually document all “clicked” settings with a Jenkins job, which runs Cucumber scenarios that do a live inspection of the settings (by querying the AWS APIs with a read-only user).

While this setup might not look that bad and has a basic level of codification, there are several drawbacks, especially with CloudFormation itself, which we are going to have a look at now.

Problems with the current state

Existing AWS resources cannot be managed by CloudFormation

Maybe you have experienced this same issue: You start off with some new technology or provider and initially use the UI to play around. And suddenly, those clicked spikes are in production. At least this is the story how we came to AWS at Jimdo 😉

So you might say: “OK, then let’s rebuild the clicked resources into a CloudFormation stack.” Well, the problem is that we didn’t describe basic components like VPC and Subnets as CloudFormation stacks in the first place, and as other production setups rely on those resources, we cannot change this as easily anymore.

Not all AWS features are immediately available in CloudFormation

Here is another issue: The usual AWS feature release process is that a component team releases a new feature (e.g. ElastiCache replica groups), but the CloudFormation part is missing (the CloudFormation team at AWS is a separate team with its own roadmap). And since CloudFormation isn’t open source, we cannot add the missing functionality by ourselves.

So, in order to use those “Non-CloudFormation” features, we used to click the setup as a workaround, and then again document the settings with Cucumber.

But the click-and-document-with-cucumber approach seems to have some drawbacks:

  • It’s not an enforced policy to document, so colleagues might miss the documentation step or see no value in it
  • It might be incomplete as not all clicked settings are documented
  • It encourages a “clicking culture”, which is the exact opposite of what we want to achieve

So we need something which could be extended as a CloudFormation stack with resources that we couldn’t (yet) express in CloudFormation. And we need them to be grouped together semantically, as code.

Post processors for CloudFormation stacks

Some resources require post-processing in order to be fully ready. Imagine the creation of an RDS MySQL database with CloudFormation. The physical database was created by CloudFormation, but what about databases, users, and passwords? This cannot be done with CloudFormation, so we need to work around this as well.

Our current approaches vary from manual steps documented in a wiki to a combination of Puppet and hiera-aws: Puppet – running on some admin node – retrieves RDS instance endpoints by tags and then iterates over them and executes shell scripts. This is a form of post-processing entirely decoupled from the CloudFormation stack, actually in terms of time (hourly Puppet run) and in also in terms of “location” (it’s in another repository). A very complicated way just for the sake of automation.

Inconvenient toolset

Currently we use the AWS CLI tools in a plain way. Some coworkers use the old tools, some use the new ones. And I guess there are even folks with their own wrappers / bash aliases.

A “good” example is the missing feature of changing tags of CloudFormation stacks after creation. So if you forgot to do this in the first place, you’d need to recreate the entire stack! The CLI tools do not automatically add tags to stacks, so this is easily forgotten and should be automated. As a result we need to think of a wrapper around CloudFormation which automates those situations.

Hardcoded / copy and pasted data

The idea of “single source information” or “single source of truth” is to never have a representation of data saved in more than one location. In the database world, it’s called “database normalization”. This is a very common pattern which should be followed unless you have an excellent excuse.

But, if you may not know better, you are under time pressure, or your tooling is still immature, it’s hard to keep the data single-sourced. This usually leads to copying and pasting hardcoding data.

Examples regarding AWS are usually resource IDs like Subnet-IDs, Security Groups or – in our case- our main VPC ID.

While this may not be an issue at first, it will come back to you in the future, e.g. if you want to rollout your stacks in another AWS region, perform disaster recovery, or you have to grep for hardcoded data in several codebases when doing refactorings, etc.

So we needed something to access information of other CloudFormation stacks and/or otherwise created resources (from the so called “clicked infrastructure”) without ever referencing IDs, Security Groups, etc. directly.

Possible solutions

Now we have a good picture of what our current problems are and we can actually look for solutions!

My research resulted in 3 possible tools: AnsibleTerraForm and Salt.

As of writing this Ansible seems to be the only currently available tool which can deal with existing CloudFormation stacks out of the box and also seems to meet the other criteria at first glance, I decided to move on with it.

Spiking the solution with Ansible

Describing an existing CloudFormation stack as Ansible Playbook

One of the mentioned problems are the inconvenient CloudFormation CLI tools: To create/update a stack, you would have to synthesize at least the stack name, template file name, and parameters, which is no fun and error-prone. For example:

With Ansible, we can describe a new or existing CloudFormation stack with a few lines as an Ansible Playbook, here one example:

Creating and updating (converging) the CloudFormation stack becomes as straightforward as:

Awesome! We finally have great tooling! The YAML syntax is machine and human readable and our single source of truth from now on.

Extending an existing CloudFormation stack with Ansible

As for added power, it should be easier to implement AWS functionality that’s currently missing from CloudFormation as an Ansible module than a CloudFormation external resource […] and performing other out of band tasks, letting your ticketing system know about a new stack for example, is a lot easier to integrate into Ansible than trying to wrap the cli tools manually.

— Dean Wilson

The above example stack uses the AWS ElastiCache feature of Redis replica groups, which unfortunately isn’t currently supported by CloudFormation. We could only describe the main ElastiCache cluster in CloudFormation. As a workaround, we used to click this missing piece and documented it with Cucumber as explained above.

A short look at the Ansible documentation reveals there is currently no support for ElastiCache replica groups in Ansible as well. But a quick research shows we have the possibility to extend Ansible with custom modules.

So I started spiking my own Ansible module to handle ElastiCache replica groups, inspired by the existing “elasticache” module. This involved the following steps:

  1. Put the module under “library/”, e.g. (I published the unfinished skeleton as a Gist for reference)
  2. Add an output to the existing CloudFormation stack which is creating the ElastiCache cluster, in order to return the ID(s) of the cache cluster(s): We need them to create the read replica group(s). Register the output of the cloudformation Ansible task:
  1. Extend the playbook to create the ElastiCache replica group by reusing the output of thecloudformation task:

Pretty awesome: Ansible works as a glue language while staying very readable. Actually it’s possible to read through the playbook and have an idea what’s going on.

Another great thing is that we can even extend core functionality of Ansible without any friction (as waiting for upstream to accept a commit, build/deploy new packages, etc) which should increase the tool acceptance across coworkers even more.

This topic touches another use-case: The possibility to “chain” CloudFormation stacks with Ansible: Reusing Outputs from Stacks as parameters for other stacks. This is especially useful to split big monolithic stacks into smaller ones which as a result can be managed and reused independently (separation of concerns).

Last but not least, it’s now easy to extend the Ansible playbook with post processing tasks (remember the RDS/Database example above).

Describing existing AWS resources as a “Stack”

As mentioned above, one issue with CloudFormation is a a way to import existing infrastructure into a stack. Luckily, Ansible supports most of the AWS functionality so we can create a playbook to express existing infrastructure as code.

To discover the possibilities, I converted a fraction of our current production VPC/subnet setup into an Ansible playbook:

As you can see, there is not even a hardcoded VPC ID! Ansible identifies the VPC by a Tag-CIDR tuple, which meets our initial requirement of “no hardcoded data”.

To stress this, I changed the aws_region variable to another AWS region, and it was possible to create the basic VPC setup in another region, which is another sign for a successful single-source-of-truth.

Single source information

Now we want to reuse the information of the VPC which we just brought “under control” in the last example. Why should we do this? Well, in order to be fully automated (which is our goal), we cannot afford any hardcoded information.

Let’s start with the VPC ID, which should be one of the most requested IDs. Getting it is relatively easy because we can just extract it from the ec2_vpc module output and assign it as a variable with the set_fact Ansible module:

OK, but we also need to reuse the subnet information – and to avoid hardcoding, we need to address them without using subnet IDs. As we tagged the subnets above, we could use the tuple (name-tag, Availability zone) to identify and group them.

With the awesome help from the #ansible IRC channel folks, I could make it work to extract one subnet by ID and Tag from the output:

While this satisfies the single source requirement, it doesn’t seem to scale very well with a bunch of subnets. Imagine you’d have to do this for each subnet (we already have more than 50 at Jimdo).

After some research I found out that it’s possible to add custom filters to Ansible that allow to manipulate data with Python code:

We can now assign the subnets for later usage like this in Ansible:

This is a great way to prepare the subnets for later usage, e.g. in iterations, to create RDS or ElastiCache subnet groups. Actually, almost everything in a VPC needs subnet information.

Those examples should be enough for now to give us confidence that Ansible is a great tool which fits our needs. Takeaways

As of of writing this, Ansible and CloudFormation seem to be a perfect fit for me. The combination turns out to be a solid solution to the following problems:

  • Single source of information / no hardcoded data
  • Combining documentation and “Infrastructure as Code”
  • Powerful wrapper around basic AWS CLI tooling
  • Inception point for other orchestration software (e. g. CloudFormation)
  • Works with existing AWS resources
  • Easy to extend (Modules, Filters, etc: DSL weaknesses can be worked around by hooking in python code)

Next steps / Vision

After spiking the solution, I could imagine the following next steps for us:

  • Write playbooks for all existing stacks and generalize concepts by extracting common concepts (e.g. common tags)
  • Transform all the tests in Cucumber to Ansible playbooks in order to have a single source
  • Remove hardcoded IDs from existing CloudFormation stacks by parameterizing them via Ansible.
  • Remove AWS Console (write) access to our Production AWS account in order to enforce the “Infrastructure as Code” paradigm
  • Bring more clicked infrastructure / ecosystem under IaC-control by writing more Ansible modules (e.g. GitHub Teams and Users, Fastly services, Heroku Apps, Pingdom checks)
  • Spinning up the VPC including some services in another region in order to prove we are fully single-sourced (e. g. no hardcoded IDs) and automated.
  • Trying out Ansible Tower for:
    • Regular convergence runs in order to avoid configuration drift and maybe even revert clicked settings (similar to “Simian army” approach)
    • A “single source of Infrastructure updates”
  • Practices like Game Days to actually test Disaster recovery scenarios

I hope this blog post has brought some new thoughts and inspirations to the readers. Happy holidays!