Providing Static IPs for Non-Trivial Architectures

12. December 2016 2016 0

Author: Oli Wood
Editors: Seth Thomas, Scott Francis

An interesting problem landed on my desk a month ago that seemed trivial to begin with, but once we started digging into the problem it turned out to be more complex than we thought.  A small set of our clients needed to restrict outgoing traffic from their network to a whitelist of IP addresses.  This meant providing a finite set of IPs which we could use to provide a route into our data collection funnel.

Traditionally this has not been too difficult, but once you take into account the ephemeral nature of cloud infrastructures and the business requirements for high availability and horizontal scaling (within reason) it gets more complex.

We also needed to take into account that our backend system (api.example.com) is deployed in a blue/green manner (with traffic being switched by DNS), and that we didn’t want to incur any additional management overhead with the new system.  For more on Blue/Green see http://martinfowler.com/bliki/BlueGreenDeployment.html.

Where we ended up looks complex but is actually several small systems glued together.  Let’s describe the final setup and then dig into each section.

The Destination

A simplified version of the final solution.
A simplified version of the final solution.

 

The View from the Outside World

Our clients can address our system by two routes:

  • api.example.com – our previous public endpoint.  This is routed by Route 53 to either api-blue.example.com or api-green.example.com
  • static.example.com – our new address which will always resolve to a finite set of IP addresses (we chose 4).  This will eventually route through to the same blue or green backend.

The previous infrastructure

api-blue.example.com is an autoscaling group deployed (as part of a wider system) inside its own VPC.  When we blue/green deploy an entire new VPC is created (this is something we’re considering revisiting).  It is fronted by an ELB.  Given the nature of ELBs, the IP addresses of this instance will change over time, which is why we started down this road.

The proxying infrastructure

static.example.com is a completely separate VPC which houses 4 autoscaling groups set to a minimum size of 1 and a maximum size of 1.  The EC2 instances are assigned an EIP on boot (more on this later) and have HAProxy 1.6 installed.  HAProxy is setup to provide two things:

  • A TCP proxy endpoint on port 443
  • A healthcheck endpoint of port 9000

The DNS configuration

The new DNS entry for static.example.com is configured so that it only returns IP addresses for up to 4 of the EIPs, based on the results of their healthcheck (as provided by HAProxy).

How we got there

The DNS setup

static.example.com is based on a set of four Health Checks which form a Traffic Policy that creates the Policy Record (which is the equivalent of your normal DNS entry).

Steps to create Health Checks:

  1. Log into the AWS Console
  2. Head to Route 53
  3. Head to Health Checks
  4. Create new Health Check
    1. What to monitor => Endpoint
    2. Specify endpoint by => IP Address
    3. Protocol => HTTP
    4. IP Address => [Your EIP]
    5. Host name => Ignore
    6. Port => 9001
    7. Path => /health

Repeat four times.  Watch until they all go green.

Steps to create Traffic Policy:

  1. Route 53
  2. Traffic Policies
  3. Create Traffic Policy
    1. Policy name => something sensible
    2. Version description => something sensible

This opens up the GUI editor

  1. Choose DNS type A: IP address
  2. Connect to => Weighted Rule
  3. Add 2 more Weights
  4. On each choose “Evaluate target health” and then one of your Health Checks
  5. Make sure the Weights are all set the same (I chose 10)
  6. For each click “Connect to” => New Endpoint
    1. Type => Value
    2. Value => EIP address
The traffic policy in the GUI
The traffic policy in the GUI

Adding the Policy record

  1. Route 53
  2. Policy Record
  3. Create new Policy Record
    1. Traffic policy => Your new policy created above
    2. Version => it’ll probably be version 1 because you just created it
    3. Hosted zone => chose the domain you’re already managing in AWS
    4. Policy record => add static.example.com equivalent
    5. TTL => we chose 60 seconds

And there you go, static.example will route traffic to your four EIPs, but only if they are available.

The Autoscaling groups

The big question you’re probably wondering here is “why did they create four separate Autoscaling groups?  Why not just use one?”  It’s a fair question, and our choice might not be right for you, but the reasoning is that we didn’t want to build something else to manage which EIPs were assigned to each of the 4 instances.  By using 4 separate Autoscaling groups we can use 4 separate Launch Configurations, and then use the EC2 tags to manage how an instance knows which EIP to launch.

The keys things here are…

  • Each of the Autoscaling Groups is defined separately in our CloudFormation stack
  • Each of the Autoscaling Groups has its own Launch Configuration
  • We place two Autoscaling Groups in each of our Availability Zones
  • We place two Autoscaling Groups in each Public Subnet
  • Tags on the Autoscaling Group are set with “PropagateAtLaunch: true” so that the instances they launch end up with the EIP reference on them
  • Each of the four Launch Configurations includes the same UserData script (Base64 encoded in our CloudFormation template)
  • The LaunchConfiguration includes an IAM Role giving enough permissions to be able to tag the instance

The UserData script

The IAM Role statement

The EC2 instances

We chose c4.xlarge instances to provide a good amount of network throughput.  Because HAProxy is running in TCP mode we struggle to monitor the traffic levels and so we’re using CloudWatch to alert on very high or low Network Output from the four instances.

The EC2 instances themselves are launched from a custom AMI which includes very little except a version of HAProxy (thanks to ITV for https://github.com/ITV/rpm-haproxy).  We’re using this fork because it supplies the slightly newer HAProxy veresion 1.6.4

Unusually for us we’ve baked the config for HAProxy into the AMI.  This is a decision we will revisit at a later date I suspect and have the config pulled from S3 at boot time.

HAProxy is set to start on boot.  Something we shall probably add at a later date is to have the Autoscaling Group use the same healthcheck endpoint that HAProxy provides to Route 53 to determine the instance health. This way we’ll launch another instance if one comes up, but does not provide a healthy HAProxy for some reason.

The HAProxy setup

HAProxy is a fabulously flexible beast and we had a lot of options on what to do here.  We did however wish to keep it as simple as possible.  With that in mind, we opted to not offload SSL at this point but to act as a passthrough proxy direct to our existing architecture.

Before we dive into the config, however, it’s worth mentioning our choice of backend URL.  We opted to route back to api.example.com because this means that when we blue/green deploy our existing setup we don’t need to make any changes to our HAProxy setup.  By using its own health check mechanism and “resolvers” entry we can make sure that the IP addresses that it is routing to (the new ELB) aren’t more than a few seconds out of date.  This loopback took us a while to figure out and is (again) something we might revisit in the future.

Here are the important bits of the config file:

The resolver

Makes use of AWS’s internal DNS service.  This has to be used in conjunction with a health check on the backend server

The front end listener

Super simple.  This would be more complex if you wanted to route traffic from different source addresses to different backends using SNI (see http://blog.haproxy.com/2012/04/13/enhanced-ssl-load-balancing-with-server-name-indication-sni-tls-extension/).

The backend listener

The key things here are the including of the resolver (mydns, as defined above). It’s the combination of the two which causes HAProxy to reevaluate the DNS entry.

The outwards facing health check

This will return a 200 if everything is ok, 503 if the backend is down, and will return a connection failure if HAProxy is down. This will correctly inform the Route 53 health checks and if needed R53 will not include the IP address.

What we did to test it

We ran through various scenarios to check how the system coped:

  • Deleting one of the proxy instances and seeing it vanish from the group returned from static.example.com
  • Doing a blue/green deployment and seeing HAProxy update its backend point
  • Block access to one AZ with a tweak to the Security Group to simulate the AZ becoming unavailable
  • Forcing 10 times our load in using Vegeta
  • Running a soak test at sensible traffic levels over several hours (also with Vegeta)

The end result

While this is only providing 4 EC2 instances which proxy traffic, it’s a pattern which could be scaled out very easily, with each section bringing another piece of the resilience pie to the table.

  • Route 53 does a great job of only including EIPs that are associated with healthy instances
  • The Autoscaling Groups make sure that our proxy instances will bounce back if something nasty happens to them
  • UserData and Tags provide a neat way for the instances to self-manage the allocation of EIPs
  • HAProxy provides both transparent routing and health checks.
  • Route 53 works really well for Blue/Greening our traffic to our existing infrastructure.

It’s not perfect (I imagine we’ll have issues with some client caching DNS records for far too long at some point), and I’ll wager we’ll end up tuning some of the timeouts and HAProxy config at some point in the future, but for now it’s out there and happily providing an end point for our customers (and not taking up any of our time).  We’ve tested how to deploy updates (deploy a new CloudFormation stack and let the new instance “steal” the EIPs) successfully too.

About the Author:

Oli Wood has been deploying systems into AWS since 2010 in businesses ranging from 2 people startups to multi-million dollar enterprises. Previous to that he mostly battled with deploying them onto other service providers, cutting his teeth in a version control and deployment team on a Large Government Project back in the mid 2000s.

Inside of work he spends time, train tickets and shoe leather helping teams across the business benefit from DevOps mentality.

Outside of work he can mostly be found writing about food on https://www.omnomfrickinnom.com/ and documenting the perils of poor posture at work at http://goodcoderbadposture.com/

Online he’s @coldclimate

About the Editors:

Scott Francis has been designing, building and operating Internet-scale infrastructures for the better part of 20 years. He likes BSD, Perl, AWS, security, cryptography and coffee. He’s a good guy to know in a zombie apocalypse. Find him online at  https://linkedin.com/in/darkuncle and https://twitter.com/darkuncle.