A Hybrid of One: Building a Private Cloud in AWS

12. December 2018 2018 0

Introduction

Adoption of the cloud is becoming more and more popular for all types of businesses. When you’re starting, you have a blank canvas to work from – there’s no existing blueprint or guide. But what if you’re not in that position? What if you’ve already got an established security policy in place, or you’re working in a regulated industry that sets limits of what’s appropriate or acceptable for your company’s IT infrastructure?

Being able to leverage the elasticity of the public cloud is one of its biggest – if not, the biggest – advantage over a traditional corporate IT environment. Building a private cloud takes time, money and a significant amount of investment. This investment might not be acceptable to your organisation or ever generate returns…

But what if we can build a “private” cloud using public cloud services?

The Virtual Private Cloud

If you’ve looked at AWS, you’ll be familiar with the concept of a “VPC” – A “Virtual Private Cloud”, the first resource you’ll create in your AWS account (if you don’t use the default VPC created in every region when your account is created, that is!). It’s private in the sense that it’s your little bubble, to do with as you please. You control it, nurture it and manage it (hopefully with automation tools!). But private doesn’t mean isolated, and this does not fit the definition of a “private cloud.”

If you misconfigure your AWS environment, you can accidentally expose your environment to the public Internet, and an intruder may be able to use this as a stepping-stone into the rest of your network.

In this article, we’re going to look at the building blocks of your own “private” cloud in the AWS environment. We’ll cover isolating your VPC from the public internet, controlling what data enters and, crucially, leaves your cloud, as well as ensuring that your users can get the best out of their new shiny cloud.

Connecting to your “private” Cloud

AWS is most commonly accessed over the Internet. You publish ‘services’ to be consumed by your users. This is how many people think of AWS – a Load balancer with a couple of web servers, some databases and perhaps a bit of email or workflow.

In the “private” world, it’s unlikely you’ll want to provide direct access to your services over the Internet. You need to guarantee the integrity and security of your data. To maximise your use of the new environment you want to make sure it’s as close to your users and the rest of your infrastructure as possible.

AWS has two private connectivity methods you can use for this: DirectConnect and AWS managed VPN.

Both technologies allow you to “extend” your network into AWS. When you create your VPC, you allocate an IP range (that doesn’t clash with your internet network), and you can then establish a site-to-site connection to your new VPC. Any instance or service you spin up in your VPC is accessed directly from your internal network, using its private IP address. It’s just as if a new datacenter appeared on your network. Remember, you can still configure your VPC with an Internet Gateway and allocate Public IP addresses (or Elastic IPs) to your instances, which would then give them both an Internet IP and an IP on your internal network – you probably don’t want to do this!

The AWS managed VPN service allows you to establish a VPN over the Internet between your network(s) and AWS. You’re limited by the speed of your internet connection. Also, you’re accessing your cloud environment over the Internet, with all the variable performance and latency that entails.

The diagram below shows an example how of AWS Managed VPN connectivity interfaces with your network:

AWS DirectConnect allows you to establish a private circuit with AWS (like a traditional “leased line”). Your network traffic never touches the Internet or any other uncontrolled public network. You can directly connect to AWS’ routers at one of their shared facilities, or you can use a third-party service to provide the physical connectivity. The right option depends on your connectivity requirements: directly connecting to AWS means you can own the service end-to-end, but using a third party allows you greater flexibility in how you design the resiliency and the connection speed you want to AWS (DirectConnect offers physical 1GbE or 10GbE connectivity options, but you might want something in between, which is where a third party can really help here).

The diagram below shows an example of how you can architect DirectConnect connectivity between your corporate datacenter and the AWS cloud. DirectConnect also allows you to connect directly to Amazon services over your private connection, if required. This ensures that no traffic traverses the public Internet when you’re accessing AWS hosted services (such as API endpoints, S3, etc.). DirectConnect also allows you to access services across different regions, so you could have your primary infrastructure in eu-west-1 and your DR infrastructure in eu-west-2, and use the same DirectConnect to access both regions.

2-direct_connect_overview

Both connectivity options offer the same native approach to access control you’re familiar with. Network ACLs (NACLs) and Security Groups function exactly as before – you can reference your internal network IP addresses/CIDR ranges as normal and control service access by IP and port. There’s no NAT in place between your network and AWS; it’s just like another datacenter on your network.

Pro Tip: You probably want to delete your default VPCs. By default, AWS services will launch into the default VPC for a specific region, and this comes configured with the standard AWS template of ‘public/private’ subnets and internet gateways. Deleting the default VPCs and associated security groups makes it slightly harder for someone to spin up a service in the wrong place accidentally.

Workload Segregation

You’re not restricted to a single AWS VPC (by default, you’re able to create 5 per region, but this limit can be increased by contacting AWS support). VPCs make it very easy to isolate services – services you might not want to be accessed directly from your corporate network. You can build a ‘DMZ-like’ structure in your “private” cloud environment.

One good example of this is in the diagram below – you have a “landing zone” VPC where you host services that should be accessible directly from your corporate network (allowing you to create a bastion host environment), and you run your workloads elsewhere – isolated from your internal corporate network. In the example below, we also show an ‘external’ VPC – allowing us to access Internet-based services, as well as providing a secure inbound zone where we can accept incoming connectivity if required (essentially, this is a DMZ network, and can be used for both inbound and outbound traffic).

Through the use of VPC Peering, you can ensure that your workload VPCs can be reached from your inbound-gateway VPC, but as VPCs do not support transitive networking configurations by default, you cannot connect from the internal network directly to your workload VPC.

3-Multi-VPC Peering and Privatelink

Shared Services

Once your connectivity between your corporate network and AWS is established, you’ll want to deploy some services. Sure, spinning up an EC2 instance and connecting to it is easy, but what if you need to connect to an authentication service such as LDAP or Active Directory? Do you need to route your access via an on-premise web proxy server? Or, what if you want to publish services to the rest of your AWS environment or your corporate network but keep them isolated in your DMZ VPC?

Enter AWS PrivateLink: Launched at re:Invent in 2017, it allows you to “publish” a Network Load Balancer to other VPCs or other AWS Accounts without needing to establish VPC peering. It’s commonly used to expose specific services or to supply MarketPlace services (“SaaS” offerings) without needing to provide any more connectivity over and above precisely what your service requires.

We’re going to offer an example here of using PrivateLink to expose access to an AWS hosted web proxy server to our isolated inbound and workload VPCs. This gives you the ability to keep sensitive services isolated from the rest of your network but still provide essential functionality. AWS prohibit transitive VPCs for network traffic (i.e., you cannot route from VPC A to VPC C via a shared VPC B) but PrivateLink allows you to work around this limitation for individual services (basically, anything you can “hide” behind a Network Load Balancer).

Assuming we’ve created the network architecture as per the diagram above, we need to create our Network Load Balancer first. NLBs are the only load balancer type supported by PrivateLink at present.

4-load balancer.png

Once this is complete, we can then create our ‘Endpoint Service,’ which is in the VPC section of the console:

5-create endpoint service.png

Once the Endpoint Service is created, take note of the Endpoint Service Name, you’ll need this to create the actual endpoints in your VPCs.

6-endpoint service details

The Endpoint Service Name is unique across all VPC endpoints in a specific region. This means you can share this with other accounts, which are then able to discover your endpoint service. By default, you need to accept all requests to your endpoint manually, but this can be disabled (you probably don’t want this, though!). You can also whitelist specific account IDs that are allowed to create a PrivateLink connection to your endpoint.

Once your Endpoint Service is created, you then need to expose this into your VPCs. This is done from the ‘Endpoints’ configuration screen under VPCs in the AWS console. Validate your endpoint service name and select the VPC required – simple!

7-endpoint details

 

You can then use this DNS name to reference your VPC endpoint. It will resolve to an IP address in your VPC (via an Elastic Network Interface), but traffic to this endpoint will be routed directly across the Amazon network to the Network Load Balancer.

What’s in a Name?

Typically, one of the biggest hurdles with connecting between your internal network and AWS is the ability to route DNS queries correctly. DNS is key to many Amazon services, and Amazon Provided DNS (now Route53 Resolver) contains a significant amount of behind-the-scenes intelligence, such as allowing you to reach the correct Availability Zone target for your ALB or EFS mount point.

Hot off the press is the launch of Route53 Resolver, which removes the need to create your own DNS infrastructure to route requests between your AWS network and your internal network, while allowing you to continue to leverage the intelligence built into the Amazon DNS service. Previously, you would need to build your own DNS forwarder on an EC2 instance to route queries to your corporate network. This means that, from the AWS perspective, all your DNS requests are originating from a single server in a specific AZ (which might be different to the AZ of the client system), and so you’d end up getting the endpoint in a different region for your service. With a service such as EFS, this could result in increased latency and a high cross-AZ data transfer bill.

Here’s an example of how the Route53 resolver automatically picks the correct mount point target based on the location of your client system:

Pro Tip: If you’re using a lot of standardised endpoint services (such as proxy servers), using a common DNS name which can be used across VPCs is a real time-saver. This requires you to create a Route53 internal zone for each VPC (such as workload.example.com, inbound.example.com) and update the VPC DHCP Option Set to hand out this domain name via DHCP to your instances. This, then allows you to create a record in each zone with a CNAME to the endpoint service, for example:

From an instance in our workload VPC:

And the same commands from an instance in our inbound VPC:

In this example above, we could use our configuration management system to set the http_proxy environment variable to ‘proxy.privatelink:3128’ and not have to have per-VPC specific logic configured. Neat!

Closing Notes

There are still AWS services that expect to have Internet access available from your VPC by default. One example of this is AWS Fargate – the Amazon-hosted and managed container deployment solution. However, Amazon is constantly migrating more and more services to PrivateLink, meaning this restriction is slowly going away.

A full list of currently available VPC endpoint services is available in the VPC Endpoint documentation. AWS provided VPC Endpoints also give you the option to update DNS to return the VPC endpoint IPs when you resolve the relevant AWS endpoint service name (i.e. ec2.eu-west-1.amazonaws.com -> vpce-123-abc.ec2.eu-west-1.amazonaws.com -> 10.10.0.123) so you do not have to make any changes to your applications in order to use the Amazon provided endpoints.

About the Author

Jon is a freelance cloud devoperative buzzword-hater, currently governing the clouds for a financial investment company in London, helping them expand their research activities into “the cloud.”

Before branching out into the big bad world of corporate consulting, Jon spent five years at Red Hat, focusing on the financial services sector as a Technical Account Manager, and then as an on-site consultant.

When he’s not yelling at the cloud, Jon is a trustee of the charity Service By Emergency Rider Volunteers – Surrey & South London, the “Blood Runners,” who provide free out-of-hours transport services to the UK National Health Service. He is also guardian to two small dogs and a flock of chickens.

Feel free to shout at him on Twitter, or send an old-fashioned email.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Last Minute Naughty/Nice Updates to Santa’s List

11. December 2018 2018 0


The other day I was having a drink with my friend Alabaster. You might have heard of him before, but if not, you’ve heard of his boss, Santa. Alabaster is a highly educated elf who is the Administrator of the Naughty and Nice list for more than five decades now. It’s been his responsibility to manage it since it was still a paper list. He’s known for moving the list to a computer. Last year he moved it to AWS DynamoDB.

“It went great!” he told me with a wince that made me unsure of what he meant. “But then on the 23rd of December, we lost some kids.”

“What?! What do you mean, you lost some kids?”, I asked.

“Well. Let me explain. The process is a little complicated.

Migrating the list to AWS was more than just migrating data. We also had to change the way we manage the naughty and nice list. Before, with our own infrastructure, we didn’t care about the resource utilization we used, as long as the infrastructure could handle it. We were updating the list five times a minute per kid. At 1.8 billion kids that was just over 150 requests per second, constant, easy to manage.

Bushy, the elf that made the toy-making machine, is the kind of person that thinks a half-full glass is just twice the size it should be. Bushy pointed out that information about whether a child was naughty or not and their location for Christmas was only needed on December 24th. He proposed that we didn’t need to be updating the information as frequently.

So we made changes in how we updated the data. It was a big relief but it resulted in a spiky load. In December, we suddenly found ourselves with a lot of data to update. 1.8 billion records to be exact. And it failed. The autoscaling of DynamoDB mostly worked with some manual fiddling to keep increasing the number of writers fast enough. But on December 23rd we had our usual all hands on deck meeting on last-minute changes of behaviour for kids and no one was reacting to the throttling alarms. We didn’t notice until the 25th. By then some records had been lost, some gifts had been delivered to the wrong addresses.

Some kids stopped believing in Santa because someone else actually delivered their gifts late! It was the worst mistake of my career.”

“Oh, thank goodness you didn’t literally lose some kids! But, oh wow. Losing so many kid’s trust and belief must have really impacted morale at the North Pole! That sounds incredibly painful. So what did you learn and how has it changed the process for this year?” I asked.

“Well, the main difference is that we decoupled the writes. DynamoDB likes regular writes and can scale in a reasonable way if the traffic is not all peaks or increasing really fast.

So we send all the information to SQS and then use lambdas to process the writes. That gives us two ways of keeping control without risking a failed write: we can limit the writes and control them by changing the lambda concurrency and can either control the amount of writers needed with auto-scaling or manually.”

“That looks like an interesting way of smoothing a spiky load. Can you share an example?” I asked.

“I can show you the lambda code; I’ve just been playing with it.” He turned his laptop towards me showing me the code. It was empty, just a process_event function that did a write to boto3.

“That’s it?” I asked.

“Yes, we use zappa for it, so it’s mostly configuration, ” he replied.

We paired at the conference hotel bar, as you do when you find an interesting technical solution. First, Alabaster told me I had to create an SQS queue. We visited the SQS console. The main issue was that it looks like AWS has a completely different UI for the north-pole-1 region (which, to be honest, I didn’t know existed). I already had python 3.6 setup, so I only needed to create a virtual environment with python -m venv sqs-test and activate it with . sqs-test/bin/activate.

Then, he asked me to install zappa with pip install zappa. We created a file zappa_settings.json starting with the following as a base (you can use zappa init but you’ll then need to customise it for a non-webapp use-case):

I changed the profile_name and aws_region to match my credentials configuration and also the s3_bucket and the event_source arn to match my newly created SQS queue (as I don’t have access to the north-pole-1 region).

We then just sorted out a baseline with a minimalistic app.py:

This code shows the data of the context and event on CloudWatch logs. Alabaster explained that I could have quick access using zappa tail. Then I can use it to write to the naughty-nice list on DynamoDB or to whatever system I want to limit the activity.

Alabaster showed me the North Pole’s working implementation including how they had the throttling alarms setup in CloudWatch, concurrency configuration of Lambda on the lambda console (choose a function, go to the “Concurrency” panel, click “Reserve concurrency” and set the number to 1 – then increase as needed). While a burst of a million updates was handled with some delay, there was no data loss. I could see the pride in his face.

Hoping everything goes well for him this season, and that all have a good Christmas, and a good night!

About the Author

João Miguel Neves is a Lead Developer at POP https://www.wegotpop.com/, a company that manages people on movie productions. He also writes about python and cloud on his blog https://silvaneves.org/

About the Editors

Ed Anderson is the SRE Manager at RealSelf, organizer of ServerlessDays Seattle, and occasional public speaker. Find him on twitter at @edyesed.

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


AWS SAM – My Exciting First Open Source Experience

10. December 2018 2018 0

My first acquaintance with AWS Cloud happened through a wonderful tool – SAM CLI. It was a lot of fun playing around with it –  writing straightforward YAML based resource templates, deploying them to AWS Cloud or simply invoking Lambda functions locally. But after trying to debug my .NET Core Lambda function, I realized that it was not yet supported by the CLI. At this point, I started looking through the open issues and found this one that described exactly what I was experiencing it turned out to be a hot topic, so it got me interested instantly. As I reviewed the thread to get additional context about the issue, I learned that the project was looking for help.

The ability to debug Lambda on .NET locally was critical for me to fully understand how exactly my function was going to be invoked in the Cloud, and troubleshoot future problems. I decided to answer the call to help and investigate the problem further. Feeling like Superman that came to save the day, I posted a comment on the thread to say that, “I’m looking into this too.” It took no more than 10 minutes to receive a welcoming reply from members of the core team. When you receive a reply in a matter of minutes, you definitely understand that the project is truly alive and the maintainers are open to collaboration –  especially since being hosted on GitHub does not necessarily mean that a project is intended for contribution and community support. Feeling inspired, I began my investigation.

The Problem

Unfortunately the .NET Core host program doesn’t have any “debug-mode” switch (you can track the issue here), meaning that there is no way to start .NET program so it will immediately pause after launching, wait for the debugger to attach to it, and only then proceed (for example Java (using JDWP and its suspend flag) and Node.js are capable of doing this), and SAM needed some way to work around this. After several days of investigation and prototyping, I came up with a prototype of a working solution that used a remote debugger attachment to the .NET Lambda runner inside of the running Docker container (to execute AWS Lambda function locally, SAM internally spins up a Docker container with a Lambda-like environment).

 

I joined the AWS Serverless Application Slack organization to collaborate with the maintainers and understand more about the current CLI design. Thanks to the very welcoming community on the Slack channel, I almost immediately started to feel like I was part of the SAM team! . At this point, I was ready to write up a POC of .NET Core debugging with SAM. I have a README.md you can look through which explains my solution in greater detail.

 

My POC was reviewed by the maintainers of SAM CLI. Together we discussed and resolved all open questions, shaped up the design to perfection and agreed upon the approach. To streamline the process and break it down, one of the members of the core team suggested that I propose a solution and identify the set of tasks through a design document. I had no problems doing so because the SAM CLI project has a well defined and structured template for writing this kind of document. The core team reviewed and approved. Everything was set up and ready to go, and I started implementing the feature.

Solution

Docker Lambda

My first target was the Docker Lambda repo. More specifically, it was the implementation of a “start in break mode” flag for .NET Core 2.0 and 2.1 runner (implementation details here).

I must admit, that this was my very first open source pull request, and it turned out to be overkill, so my apologies to the maintainer. PR included many more changes than required to solve the problem. The unnecessary changes were related to code style refactoring and minor improvements. And don’t get me wrong – best practices and refactoring changes are not undesired nor unwelcome, but they should be implemented in a separate PR to be a good citizen in the open source community. Okay, with that lesson learned, I’ve opened a slim version of my original PR with only the required code changes and with detailed explanations of them. SAM CLI is up next on our list.

 

 

AWS SAM CLI

Thanks to the awesome development guide at SAM CLI repo, it was super easy to set up the environment and get going. Even though I am not a Python guy, the development felt smooth and straightforward. The CLI codebase has a fine-grained modular architecture, and everything was clearly named and documented, so I faced no problems getting around the project. Configured Flake8 and Pylint linters kept an eye on my changes, so following project code-style guidelines was just a matter of fixing warnings if they appear. And of course, decent unit tests code coverage (97% at the time of writing) not only helped me to rapidly understand how each component worked within the system, but also made me feel rock solid confident as I was introducing changes.

 

The core team and contributors all deserve great thumbs up 👍 for keeping the project in a such a wonderful and contribution-friendly state!

 

However, I did encounter some troubles running unit and integration tests locally on my beloved Windows machine. I got plenty of unexpected Python bugs, “windows-not-supported” limitations and stuff like that along the way, at one point I was in complete despair. But help comes if you seek it. After asking the community for guidance, we collectively came up with a Docker solution by running tests inside Ubuntu container. And finally, I was able to setup my local environment and run all required tests before submitting the pull request. Ensuring that you hadn’t broken anything is a critical process while collaborating on any kind of project, and especially open source!

According to my initial approach to .NET Core debugging, I implemented --container-name feature, which should’ve allowed users of SAM CLI to specify a name for the running Lambda container, to identify it later to perform attaching. During review, the core team found some corner cases when this approach introduces some limitations and diversity for the whole debugging experience compared to the other runtimes, so I started to look for possible workarounds.

I looked at the problem from a different perspective and came up with a solution, that enabled .NET Core debugging for AWS SAM CLI with almost no changes made 🎉 (more about this approach here and here). The team immediately loved this approach, because it closely aligned with the way other runtimes deal with debugging currently, providing a consistent experience for the users. I am happy, that I got valuable feedback from the community because exactly that got me thinking about the way to improve my solution. Constructive criticism, when handled properly, makes perfection. Now, the feature is merged into develop and is waiting for the next release to come into play! Thanks to everyone, who’d taken part into this amazing open source journey to better software!

Conclusion

Bottom line, I can say that I had a smooth and pleasant first open source experience. I’m happy that it was with SAM, as I had a super interesting time collaborating with other passionate SAM developers and contributors across the globe through GitHub threads, emails and Slack. I liked how rapid and reactive all of those interactions were – it was a great experience writing up design docs for various purposes, and being part of a strong team collaborating to reach the common goal. It was both challenging and fun to quickly explore SAM codebase and then follow its fully automated code-style rules. And, lastly, it was inspiring to experience how easy it is to contribute to a big and very well-known open source project like SAM framework.

 

Based on my experience, I came up with this battle-tested and bulletproof checklist:

  1. Before getting your hands dirty with code, take your time and engage in discussion with the core team to make sure you understand the issue and try to get as much context on it as possible. 🔍
  2. I suggest writing up a descriptive design doc, which explains the problem (or feature) and proposed solution (or implementation) in great detail, especially if it’s a common practice on the project you’re contributing to
  3. Don’t be afraid or embarrassed to ask for help from the community or the maintainers if you’re stuck. Everyone wants you to succeed 🎉
  4. Unit tests are a great source of truth in decent projects, so look through them to get a better understanding of all the moving parts. Try to cover your changes with required tests, as these tests also help future contributors (which could be you)!
  5. Always try your best to match project code-style – thanks to modern linters this one is a piece of 🍰
  6. Keep your code changes small and concise. Don’t try to pack every single code style and best practices change into a tiny PR intended to fix a minor bug
  7. The PR review process is a conversation between you and other collaborators about the changes you’ve made. Reading and reviewing every comment you receive carefully, helps to build a common understanding of the problem and avoid confusion. Never take those comments personally and try to be polite and respectful while working on them. Don’t be afraid to disagree or have a constructive discussion with reviewers. Try to come to a mutual agreement in the end and, if required, introduce specified changes
  8. Try not to force your pull request as the team may have some important tasks at hand or approaching release deadlines. Remember that good pull requests get pulled and not pushed.

 

It’s good to know that anyone from the community can have an impact just by leaving a comment or fixing a small bug. It’s cool that new features can be brought in by the community with the help of the core team. Together we can build better open source software!

 

Based on my experience, SAM is a good and welcoming project to contribute to thanks to the well-done design and awesome community. If you have any doubt, just give it a try. SAM could be a perfect start for your open source career 😉

 

When you start it is hard to stop: #795, #828, #829, #843

Contribute with caution

Thank You!

I want to give special thanks to @sanathkr, @jfuss, @TheSriram, @mikemorain and @mhart for supporting me continuously along this path.

 

About the Author

My name is Nikita Dobriansky, from sunny Odessa, Ukraine. Throughout my career in software development, I’ve been working with C# desktop applications. Currently, I’m a C# engineer at Lohika, building bleeding-edge UWP apps. I am super excited about .NET Core and container-based applications. For me, AWS is a key to better serverless ☁️ future.

In my free time, I enjoy listening to the great music and making some myself 🎸

@ndobryanskyy

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Using Infrastructure as Code as a Poor Man’s DR

09. December 2018 2018 0

What is DR?

Let’s start by setting the context of what I mean by Disaster Recovery(DR). There are different interpretations of DR and High Availability, with a very thin and moving line between the two. Here I am specifically referring to the ability to recover your infrastructure from a disaster such as when an AWS region is unavailable. I am not talking about situations where immediate failover is needed.

What is Infrastructure as Code

Infrastructure as Code (IaC) is something that has been around for a while now, but many people are just starting to fully embrace and see the benefits of it. Like DR, this is something that many people have taken to mean many different things over the years. People refer to BASH scripts that generate KVM VMs as IaC. While this is technically correct, this is not IaC. I am explicitly talking about tools such as Terraform, that are designed to generate infrastructure based on a configuration file.

What IaC for DR?

Go to any company or organization that does not have a viable DR strategy and ask them why that is the case. Nine times out of ten, the answer that you get will relate to cost. That makes sense, having a true DR environment can be very expensive, additionally, for people that are not technical and have never experienced a true IT disaster, it can be tough to comprehend why this is all needed. These factors make it very difficult for IT to get approval to put a secondary environment in place. This is where IaC comes in.

If your IaC is properly set up, you can essentially get DR for free. How? If a disaster takes out your infrastructure, you just , and you have your infrastructure back.

But wait, my IaC tool deploys to the AWS region that is down

If your tool is improperly configured, you may not benefit from the DR capabilities of IaC. You need to make sure that you are abstracting the providers and regions out of the actual infrastructure configuration. This allows you to quickly change the region that you are pointing at and re-deploy. For example, in Terraform, you would want to have a separate provider.tf file, that has a provider section with the region specified like this: provider “aws” { region = “eu-west-1” } This will allow you to change one line simply, and re-deploy your exact infrastructure to another region. This is as opposed to having the region information embedded in individual .tf files, which unfortunately I see floating around pretty often.

What if all of AWS (or GCP, or Azure) completely goes down and not just a region?

A complete outage of a service provider is another concern that I hear from time to time. I have a couple of different thoughts about this scenario.

My first thought is that the chances of that happening are so vanishingly small, that it is hardly worth thinking about. If all of AWS is down, you likely have something more serious, like worldwide thermo-nuclear war. However, as engineers, sysadmins, and other assorted IT professionals, we have a habit of not being able to not think about these extreme cases.

For those situations, you can have standby code. What do I mean by this? I mean that you can develop code that deploys the equivalent of your current infrastructure in another environment. Now, this is obviously time-consuming, and since none of us have a ton of spare time, that is a cost, and one that personally I don’t think is worth it, but it’s possible, and up to each reader to decide if it is needed for their environment.

Ok, I have my infrastructure back, what about my data?

Well, you are still doing backups, right? I am making a case for replacing a dedicated DR environment with code; I am not making a case for throwing basic common sense out the window.

That being said, there are times where it would take an impractical amount of time to restore data from backups just because there was a 1-2 hour outage. Especially when you can re-deploy your infrastructure from code to another region in minutes.

This is where I advocate for a hybrid approach between a complete IaC DR plan and a tradition DR setup. In this type of solution, you would have a replicated database (or other data source) running at all times in another region that you plan to use for DR purposes. Then, if disaster strikes, your data is sitting there just waiting for you to deploy the networking and compute resources to access it.

Since this does require keeping some infrastructure running at all times, it does cost some money. However, it will cost far less than having a whole second DR site sitting around waiting and may be an easier pill for the people that have to spend the money to swallow.

Conclusion

I hope that after reading this article you will have an understanding of the feasibility of using IaC as a means to have a DR environment in places where it otherwise would not be feasible. I further hope that you see the benefits of this solution in situations where a full disaster recovery solution is possible, but possibly not needed. Perhaps that money could better be spent elsewhere if you already have IaC in place to cover the worst case scenario.

What’s next

This article introduces the DR topic to get those that read it thinking about using Infrastructure as Code as a possible disaster recovery plan and solution. It just begins to scratch the surface of what is possible and the different considerations that need to be made.

Please visit my website at https://www.adair.tech over the next several weeks as I will publish follow up articles there that will delve further into details, tools, and specific plans for accomplishing this. You can also contact me via my website or Twitter to have a 1-1 conversation on the topic and explore your particular use case in more depth.

About the Author

Brad Adair is an experienced IT professional with over a decade of experience in systems engineering and administration, cloud engineering and architecture, and IT management. He is the President of Adair Technology, LLC., which is a Columbus based IT consulting firm specializing in AWS architecture and other IT infrastructure consulting. He is also an AWS Certified Solutions Architect. Outside of the office he enjoys sports, politics, Disney World, and spending time with his wife and kids.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Multi-region Serverless APIs: I’ve got a fever and the only cure is fewer servers

08. December 2018 2018 0

Meet SAM

Let’s talk about the hottest thing in computers for the past few years. No, not Machine Learning. No, not Kubernetes. No, not big data. Fine, one of the hottest things in computers. Right, serverless!

It’s still an emerging and quickly changing field, but I’d like to take some time to demonstrate how easy it is to make scalable and reliable multi-region APIs using just a few serverless tools and services.

It’s actually deceptively simple. Well, for a “blog post”-level application, anyway.

We’re going to be managing this application using the wonderful AWS Serverless Application Model (SAM) and SAM CLI. Far and away the easiest way I have ever used for creating and deploying serverless applications. And, in keeping with contemporary practices, it even has a cute little animal mascot.

SAM is a feature of CloudFormation that provides a handful of short-hand resources that get expanded out to their equivalent long-hand CloudFormation resources upon ChangeSet calculation. You can also drop down into regular CloudFormation whenever you need to to manage resources and configurations not covered by SAM.

The SAM CLI is a local CLI application for developing, testing, and deploying your SAM applications. It uses Docker under the hood to provide as close to a Lambda execution environment as possible and even allows you to run your APIs locally in an APIGateway-like environment. It’s pretty great, IMO.

So if you’re following along, go ahead and install Docker and the SAM CLI and we can get started.

The SAMple App

Once that’s installed, let’s generate a sample application so we can see what it’s all about. If you’re following along on the terminal, you can run sam init -n hello-sam -r nodejs8.10 to generate a sample node app called hello-sam. You can also see the output in the hello-sam-1 folder in the linked repo if you aren’t at a terminal and just want to read along.

The first thing to notice is the README.md that is full of a huge amount of information about the repo. For the sake of brevity, I’m going to leave learning the basics of SAM and the repo structure you’re looking at as a bit of an exercise for the reader. The README and linked documentation can tell you anything you need to know.

The important thing to know is that hello_world/ contains the code and template.yaml contains a special SAM-flavored CloudFormation template that controls the application. Take some time to familiarize yourself with it if you want to.

SAM Local

So what can SAM do, other than give us very short CFN templates? Well the SAM CLI can do a lot to help you in your local development process. Let’s try it out.

Step 0 is to install your npm dependencies so your function can execute:

Alright, now let’s have some fun.

There are a few local invocation commands that I won’t cover here because we’re making an API. The real magic with the CLI is that you can run your API locally with sam local start-api. This will inspect your template, identify your API schema, start a local API Gateway, and mount your functions at the correct paths. It’s by no means a perfect replica of running in production, but it actually does a surprisingly great job.

When we start the API, it will mount our function at /hello, following the path specified in the Events attribute of the resource.

Now you can go ahead and curl against the advertised port and path to execute your function.

Then on the backend, you’ll see it executing your function in Docker:

You can change your code at-will and the next invocation will pick it up. You can also attach a debugger to the process if you aren’t a “debug statement developer.”

Want to try to deploy it? I guess we might as well – it’s easy enough. The only pre-requisite is that we need an S3 bucket for our uploaded code artifact. So go ahead and make that – call it whatever you like.

Now, we’ll run a sam package. This will bundle up the code for all of your functions and upload it to S3. It’ll spit out a rendered “deployment template” that has the local CodeUris swapped out for S3 URLs.

If you check out deploy-template.yaml, you should see…very few remarkable differences. Maybe some of the properties have been re-ordered or blank lines removed. But the only real difference you should see is that the relative CodeUrl of CodeUri: hello_world/ for your function has been resolved to an S3 URL for deployment.

Now let’s go ahead and deploy it!

Lastly, let’s find the URL for our API so we can try it out:

Cool, let’s try it:

Nice work! Now that you know how SAM works, let’s make it do some real work for us.

State and Data

We’re planning on taking this multi-region by the end of this post. Deploying a multi-region application with no state or data is both easy and boring. Let’s do something interesting and add some data to our application. For the purposes of this post, let’s do something simple like storing a per-IP hit counter in DynamoDB.

We’ll go through the steps below, but if you want to jump right to done, check out the hello-sam-2 folder in this repository.

SAM offers a SimpleTable resource that creates a very simple DynamoDB table. This is technically fine for our use-case now, but we’ll need to be able to enable Table Streams in the future to go multi-region. So we’ll need to use the regular DynamoDB::Table resource:

We can use environment variables to let our functions know what our table is named instead of hard-coding it in code. Let’s add an environment variable up in the Globals section. This ensures that any functions we may add in the future automatically have access to this as well.

Change your Globals section to look like:

Lastly, we’ll need to give our existing function access to update and read items from the table. We’ll do that by setting the Policies attribute of the resource, which turns into the execution role. We’ll give the function UpdateItem and GetItem. When you’re done, the resource should look like:

Now let’s have our function start using the table. Crack open hello_world/app.js and replace the content with:

On each request, our function will read the requester’s IP from the event, increment a counter for that IP in the DynamoDB table, and return the total number of hits for that IP to the user.

This’ll probably work in production, but we want to be diligent and test it because we’re responsible, right? Normally, I’d recommend you spin up a DynamoDB-local Docker container, but to keep things simple for the purposes of this post, let’s create a “local dev” table in our AWS account called HitsTableLocal.

And now let’s update our function to use that table when we’re executing locally. We can use the AWS_SAM_LOCAL environment variable to determine if we’re running locally or not. Toss this at the top of your app.js to select that table when running locally:

Now let’s give it a shot! Fire up the app with sam local start-api and let’s do some curls.

Nice! Now let’s deploy it and try it out for real.

Not bad. Not bad at all. Now let’s take this show on the road!

Going Global

Now we’ve got an application, and it even has data that our users expect to be present. Now let’s go multi-region! There are a couple of different features that will underpin our ability to do this.

First is the API Gateway Regional Custom Domain. We need to use the same custom domain name in multiple regions, the edge-optimized custom domain won’t cut it for us since it uses CloudFront. The regional endpoint will work for us, though.

Next, we’ll hook those regional endpoints up to Route53 Latency Records in order to do closest-region routing and automatic failover.

Lastly, we need to way to synchronize our DynamoDB tables between our regions so we can keep those counters up-to-date. That’s where DynamoDB Global Tables come in to do their magic. This will keep identically-named tables in multiple regions in-sync with low latency and high accuracy. It uses DynamoDB Streams under the hood, and ‘last writer wins’ conflict resolution. Which probably isn’t perfect, but is good enough for most uses.

We’ve got a lot to get through here. I’m going to try to keep this as short and as clear as possible. If you want to jump right to the code, you can find it in the hello-sam-3 directory of the repo.

First things first, let’s add in our regional custom domain and map it to our API. Since we’re going to be using a custom domain name, we’ll need a Route53 Hosted Zone for a domain we control. I’m going to pass through the domain name and Hosted Zone Id via a CloudFormation parameter and use it below. When you deploy, you’ll need to supply your own values for these parameters.

Toss this at the top of template.yaml to define the parameters:

Now we can create our custom domain, provision a TLS certificate for it, and configure the base path mapping to add our API to the custom domain – put this in the Resources section:

That !Ref ServerlessRestApi references the implicit API Gateway that is created as part of the AWS::Serverless::Function Event object.

Next, we want to assign each regional custom domain to a specific Route53 record. This will allow us to perform latency-based routing and regional failover through the use of custom healthchecks. Let’s put in a few more resources:

The AWS::Route53::Record resource creates a DNS record and assigns it to a specific AWS region. When your users query for your record, they will get the value for the region closest to them. This record also has a AWS::Route53::HealthCheck attached to it. This healthcheck will check your regional endpoint every 30 seconds. If your regional endpoint has gone down, Route53 will stop considering that record when a user queries for your domain name.

Our Route53 Healthcheck is looking at /health on our API, so we’d better implement that if we want our service to stay up. Let’s just drop a stub healthcheck into app.js. For a real application you could perform dependency checks and stuff, but for this we’ll just return a 200:

The last piece, we, unfortunately, can’t control directly with CloudFormation; we’ll need to use regular AWS CLI commands. Since Global Tables span regions, it kind of makes sense. But before we can hook up the Global Table, each table needs to exist already.

Through the magic of Bash scripts, we can deploy to all of our regions and create the Global Table all in one go!

For a more idempotent (but more verbose) version of this script, check out hello-sam-3/deploy.sh.

Note: if you’ve never provisioned an ACM Certificate for your domain before, you may need to check your CloudFormation output for the validation CNAMEs.

And…that’s all there is to it. You have your multi-region app!

Let’s try it out

So let’s test it! How about we do this:

  1. Get a few hits in on our home region
  2. Fail the healthcheck in our home region
  3. Send a few hits to the next region Route53 chooses for us
  4. Fail back to our home region
  5. Make sure the counter continues at the number we expect

Cool, we’ve got some data. Let’s failover! The easiest way to do this is just to tell Route53 that up actually means down. Find the healthcheck id for your region using aws route53 list-health-checks and run:

Now let’s wait a minute for it to fail over and give it another shot.

Look at that, another region! And it started counting at 3. That’s awesome, our data was replicated. Okay, let’s fail back, you know the drill:

Give it a minute for the healthcheck to become healthy again and fail back. And now let’s hit the service a few more times:

Amazing. Your users will now automatically get routed to not just the nearest region, but the nearest healthy region. And all of the data is automatically replicated between all active regions with very low latency. This grants you a huge amount of redundancy, availability, and resilience to service, network, regional, or application failures.

Now not only can your app scale effortlessly through the use of serverless technologies, it can failover automatically so you don’t have to wake up in the middle of the night and find there’s nothing you can do because there’s a network issue that is out of your control – change your region and route around it.

Further Reading

I don’t want to take up too much more of your time, but here’s some further reading if you wish to dive deeper into serverless:

  • A great Medium post by Paul Johnston on Serverless Best Practices
  • SAM has configurations for safe and reliable deployment and rollback using CodeDeploy!
  • AWS built-in tools for serverless monitoring are lackluster at best, you may wish to look into external services like Dashbird or Thundra once you hit production.
  • ServerlessByDesign is a really great web app that allows you to drag, drop, and connect various serverless components to visually design and architect your application. When you’re done, you can export it to a working SAM or Serverless repository!

About the Author

Norm recently joined Chewy.com as a Cloud Engineer to help them start on their Cloud transformation. Previously, he ran the Cloud Engineering team at Cimpress. Find him on twitter @nromdotcom.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Working with AWS Limits

07. December 2018 2018 0

How rolling out EC2 Nitro system based instance types surfaced DNS query rate limit

Introduction

Amazon Web Services (AWS) is a great cloud platform enabling all kinds of businesses and organizations to innovate and build on a global scale and with great velocity. However, while it is a highly scalable infrastructure, AWS does not have infinite capacity. Each AWS service has a set of service limits to ensure a quality experience for all customers of AWS.

There are also some limits that customers might accidentally discover while deploying new applications or services, or while trying to scale up existing infrastructure.

Keeping Pace with AWS Offerings

We want to share our discovery and mitigation of one such limit. As prudent customers practicing the Well Architected Framework practices, we track new AWS services and updates to existing ones to take advantage of new capabilities and potential cost savings. At re:Invent 2017, the Nitro system architecture was introduced. Consequently, as soon as relevant EC2 instance types became generally available, we started updating our infrastructure to utilize new M5 and C5 instance types. We updated relevant CloudFormation templates, Launch configurations and built new AMIs to enable the new and improved elastic network interface (ENI). We were now ready to start the upgrade process.

Preparing for Success in Production by Testing Infrastructure

We were eager to try out new instance types, so we launched a couple of test instances using our common configuration to start our testing. After some preliminary testing (mostly kicking proverbial tires) we started the update of our test environment.

Our test environment is very similar to the production environment. We try to use the same configuration with modified parameters to account for a lighter load on our test instances (e.g., smaller instances and Auto Scaling groups). We updated our stacks with revised CloudFormation templates, rebuilt Auto Scaling groups using new Launch configurations with success. We did not observe any adverse effects on our infrastructure while running through some tests. The environment worked as expected and developers continued to deploy and test their changes.

Deploying to Production

After testing in our test environment and baking for a couple of weeks, we felt confident that we were ready to deploy the new instance types into production. We purchased reserved M5 and C5 instances and were saving money by utilizing them, and we also observed performance improvements as well. We started with some second-tier applications and services. These upgraded smoothly, and it added to our confidence in the changes we were making to the environment. It was exciting to see, and we could not wait to tackle our core applications and services, including infrastructure running our main site.

Everything was in place: we had new instances running in the test environment, and partially in the production environment; we notified our engineering team about the upgrade.

In our environment, we share on-call responsibilities with development. Every engineer is engaged in the success of the company through shared responsibility for the health of the site. We followed our process of notifying the on-call engineers about the changes to the environment. We pulled up our monitoring and “Golden Signal” dashboards to watch for anomalies. We were ready!

The update process went relatively smoothly, and we replaced older M4 and C4 instances with new and shiny M5 and C5 ones. We saw some performance gains,  e.g., somewhat faster page loads. Dashboards didn’t show any issues or anomalies. We started to check it off our to-do list so we could move on to the next project in our backlog.

It’s the Network…

We were paged. Some of the instances in an availability zone (AZ) were throwing errors that we initially attributed to network connectivity issues. We verified that presumed “network failures” were limited only to a single AZ, so we decided to divert traffic away from that AZ and wait for the network to stabilize. After all, this is why we run multi-zone deployment.

We wanted to do due-diligence to make sure we understood the underlying problem. We started digging into the logs and did not see anything abnormal. We chalked it off to transient network issues and continued to monitor our infrastructure.

Some time passed without additional alerts, and we decided to bring the AZ back into production. No issues observed; our initial assessment must be correct.

Then, we got another alert. This time a second AZ was having issues. We thought that it must be a bad day for AWS networking. We knew how to mitigate it: take that AZ out of production and wait it out. While we were doing that, we got hit with another alert; it looked like yet another AZ was having issues. At this point, we were concerned that something wasn’t working as we expected and that maybe we needed to modify the behavior of our application. We dove deeper into our logs.

Except when it’s DNS…

This event was a great example of our SREs and developers coming together and working on the problem as one team. One engineer jumped on a support call with AWS, while our more experienced engineers started close examination of logged messages and events. Right then we noticed that our application logs contained messages that we hadn’t focused on before: failures to perform DNS name resolution.

We use Route 53 for DNS, so we had not experienced these kinds of errors before on the M4 or C4 instances. We jumped on the EC2 instances and confirmed that name resolution worked as we expected. We were really puzzled about the source of these errors. We checked to see if we had any production code deploys that might have introduced them, and we did not find anything suspicious or relevant.


In the meantime, our customers were experiencing intermittent errors while accessing our site and that fact did not sit well with us.

Luckily, the AWS support team was working through the trouble with us. They checked and confirmed that they did not see any networking outages being reported by their internal tools. Our original assumption was incorrect. AWS support suggested that we run packet capturing focused on DNS traffic between our hosts to get additional data. Coincidentally, one of our SREs was doing exactly that and analyzing captured data. Analysis revealed a very strange pattern: while many of the name resolution queries were successful, some were failing. Moreover, there was no pattern as to which names would fail to resolve. We also observed that we triggered about 400 DNS queries per second.

We shared our findings with AWS support. They took our data and contacted us with a single question. “Have you recently upgraded your instances?”

“Oh yes, we upgraded earlier that day,” we responded.

AWS support then reminded us that each Amazon EC2 instance limits the number of packets sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface (https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html?shortFooter=true#vpc-dns-limits). So the limit was not new, however, on newer instance types AWS had eliminated internal retries and thus made DNS resolution errors more visible to newer instances. To mitigate the impact service team recommended implementing DNS caching on the instances.

At first, we were skeptical about their suggestion. After all, we did not seem to breach the limit with our 400 or so requests per second. However, we did not have any better ideas, so we decided to pursue two solutions. Most important we needed to improve the experience of our customers by rolling back our changes. We did that and immediately stopped seeing DNS errors. Second, we started working on implementing local DNS caching on affected EC2 instances.

AWS support recommended using nscd (https://linux.die.net/man/8/nscd). Based on our personal experiences with various DNS tools and implementations, we decided to use bind (https://www.isc.org/downloads/bind/) configured to act only as a caching server with XML statistics-channels enabled. The reason for that requirement was our desire to understand the nature of DNS queries performed by our services and applications with the aim of improving how they interacted with DNS infrastructure.

Infrastructure as Code and the Magic DNS Address

A fair number of our EC2 instances run Ubuntu. We were hoping to utilize main package repositories to install bind and use our custom configuration to achieve our goal of reducing number of DNS queries per host.

On Ubuntu hosts, we used our configuration management system (Chef) to install bind9 package and configure it so it would listen only on localhost (127.0.0.1, ::1) interface, query AWS VPC DNS server, cache DNS query results, log statistics, and expose them via port 954. Typically, AWS VPC provides DNS server running on a reserved IP address at the base of the VPC IPv4 network range, plus two. We started coding a solution to calculate that IP address based on the network parameters of our instances, when we noticed that there is also an often overlooked “magic” DNS address available to use: 169.254.169.253. That made our effort easier since we could hard code that address in our configuration template file. We also need to remember to preserve the content of the original resolver configuration file and prepend loopback address (127.0.0.1) to it. That way, our local caching server will be queried first, but in case it was not running for some reason, clients had a fallback address to query.

Prepending of loopback address was achieved by adding the following option to the dhclient configuration file (/etc/dhcp/dhclient.conf):

Preservation of the original content of /etc/resolv.conf was done by creating /etc/default/resolvconf with the following statement:

And to apply these changes we needed to restart networking on our instances and unfortunately, the only way that we found to get it done was not what we were hoping for (service networking restart):

We tested our changes using Chef’s excellent test kitchen and inspec tools (TDD!) and were ready to roll out changes once more. This time we were extra cautious and performed canary deploy with subsequent long-term validation before updating the rest of our EC2 Ubuntu fleet. The results were as expected, that is to say, we did not see any negative impact on our site. We were observing better response times and were saving money by utilizing our reserved instances.

Conclusion

We learned our lesson through this experience: respect service limits imposed by AWS. If something isn’t working right, don’t just assume it’s the network (or dns!), check the documentation.

We benefited from our culture of “one team” with site reliability and software development engineers coming together and working towards the common goal of delighting our customers. Having everyone being part of the on-call rotation ensured that all engineers were aware of changes and their impact in the environment. Mutual respect and open communication allowed for quick debugging and resolution when problems arose with everyone participating in the process to restore customer experience.

By treating our infrastructure as code, we were able to target our fixes with high precision and roll back and forward efficiently. Because everything was automated, we could focus on the parts that needed to change (like the local dns cache) and quickly test on both the old and new instances.

We became even better at practicing test driven development. Good tests make our infrastructure more resilient and on-call quieter which improves our overall quality of life.

Our monitoring tools and dashboards are great, however, it is important to be ready to look beyond what is presently measured and viewable. We must be ready to dive deep into the application and its behaviors. It’s also important to take time to iterate on improving tools and dashboards to be more relevant after an event.

We are also happy to know that AWS works hard on eliminating or increasing limits as services improve so that we will not need to go through these exercises too often.

We hope that sharing this story might be helpful to our fellow SREs out there!

About the Author

Bakha Nurzhanov is a Senior Site Reliability Engineer at RealSelf. Prior to joining RealSelf, he had been solving interesting data and infrastructure problems in healthcare IT, and worked on payments infrastructure engineering team at Amazon. His Twitter handle is @bnurzhanov.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Athena Savior of Adhoc Analytics

06. December 2018 2018 0

Introduction

Companies strive to attract customers by creating an excellent product with many features. Previously, product to reality took months to years. Nowadays, product to reality can take a matter of weeks. Companies can fail-fast, learn and move ahead to make it better. Data analytics often takes a back seat becoming a bottleneck.

Some of the problems that cause bottlenecks are

  • schema differences,
  • missing data,
  • security restrictions,
  • encryption

AWS Athena, an ad-hoc query tool can alleviate these problems. The main compelling characteristics include :

  • Serverless
  • Query Ease
  • Cost ($5 per TB of data scanned)
  • Availability
  • Durability
  • Performance
  • Security

Athena behind the scene uses Hive and Presto for analytical queries of any size, stored in S3. Athena processes structured, semi-structured and unstructured data sets including CSV, JSON, ORC, Avro, and Parquet. There are multiple languages supported for Athena drivers to query datastores including java, python, and other languages.

Let’s examine a few different use cases with Athena.

Use cases

Case 1: Storage Analysis

Let us say you have a service where you store user data such as documents, contacts, videos, and images. You have an accounting system in the relational database whereas user resources in S3 orchestrated through metadata housed in DynamoDB.  How do we get ad-hoc storage statistics individually as well as the entire customer base across various parameters and events?

Steps :

  • Create AWS data pipeline to export  Relational Database data to S3
    • Data persisted in S3 in CSV
  • Create AWS data pipeline to export  DynamoDB data to S3
    • Data persisted in S3 in JSON string
  • Create Database in Athena
  • Create tables for data sources
  • Run queries
  • Clean the resources

Figure 1: Data Ingestion

Figure 2: Schema and Queries

Case 2: Bucket Inventory

Why is S3 usage growing out of sync from user base changes? Do you know how your S3 bucket is being used? How many objects did it store? How many duplicate files? How many deleted?

AWS Bucket Inventory helps to manage the storage and provides audit and report on the replication and encryption status the objects in the bucket. Let us create a bucket and enable Inventory and perform the following steps.

Steps :

  • Go to S3 bucket
  • Create buckets vijay-yelanji-insights for objects and vijay-yelanji-inventory for inventory.
  • Enable inventory
    • AWS generates report into the inventory bucket at regular intervals as per schedule job.
  • Upload files
  • Delete files
  • Upload same files to check duplicates
  • Create Athena table pointing to vijay-yelanji-inventory
  • Run queries as shown in Figure 5 to get S3 usage to take necessary actions to reduce the cost.

Figure 3: S3 Inventory

Figure 4: Bucket Insights


Figure 5: Bucket Insight Queries

Case 3: Event comparison

Let’s say you are sending a stream of events to two different targets after pre-processing the events very differently and experiencing discrepancy in the data. How do you fix the events counts? What if event and or data are missing? How do you resolve inconsistencies and or quality issues?

If data is stored in S3, and the data format is supported by Athena, you expose it as tables and identify the gaps as shown in figure 7

Figure 6: Event Comparison

Steps:

  • Data ingested in S3 in snappy or JSON and forwarded to the legacy system of records
  • Data ingested in S3 in CSV (column separated by ‘|’ ) and forwarded to a new system of records
    • Event Forwarder system consumes the source event, modifies the data before pushing into the multiple targets.
  • Create Athena table from legacy source data and compare it problematic event forwarder data.


Figure 7: Comparison Inference

 

Case 4: API Call Analysis

If you have not enabled CloudWatch or set up your own ELK stack, but need to analyze service patterns like total HTTP requests by type, 4XX and 5XX errors by call types, this is possible by enabling  ELB access logs and reading through Athena.

Figure 8: Calls Inference

Steps :

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/access-log-collection.html

You can do the same on CloudTrail Logs with more information here:

https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html

 

Case 5: Python S3 Crawler

If you have  tons of JSON data in S3 spread across directories and files, want to analyze keys and its values, all you need to do is use python libraries like PyAthena or JayDeBe to read compressed snappy files after unzipping through SnZip and set these keys into Set data structure before passing as columns to the Athena as shown in Figure 10

Figure 9: Event Crawling

Figure 10: Events to Athena

Limitations

Athena has some limitations including:
  • Data must reside in S3.
  • To reduce the cost of the query and improve performance, data must be compressed, partitioned and converted to columnar formats.
  • User-defined functions, stored procedure, and many DDL are not supported.
  • If you are generating data continuously or has large data sets, want to get insights into real-time or frequently you should rely on analytical and visualization tools such as RedShift, Kinesis, EMR, Denodo, Spotfire and Tableau.
  • Check Athena FAQ to understand more about its benefits and limitations.

Summary

In this post, I shared how to leverage Athena to get analytics and minimize bottlenecks to product delivery. Be aware that some of the methods used were implemented when Athena was new. New tools may have changed how best to solve these use cases. Lately, it has been integrated with Glue for building, maintaining, and running ETL jobs and then QuickSight for visualization.

Reference

Athena documentation is at https://docs.aws.amazon.com/athena/latest/ug/what-is.html

About the Author

Vijay Yelanji (@VijayYelanji) is an architect at Asurion working at San Mateo, CA. has more than 20+ years of experience across various domains like Cloud enabled Micro Services to support enterprise level Account, File, Order, and Subscription Management Systems, Websphere Integration Servers and Solutions, IBM Enterprise Storage Solutions, Informix Databases, and 4GL tools.

In Asurion, he was Instrumental in designing and developing multi-tenant, multi-carrier, highly scalable Backup and Restore Mobile Application using various AWS services.

You can download the Asurion Memories application for free at 

Recently Vijay presented a topic  ‘Logging in AWS’ at AWS Meetup, Mountain View, CA.

Many thanks to AnanthakrishnaChar, Kashyap and Cathy, Hui for their assistance in fine-tuning some of the use cases.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Scaling up An Existing Application with Lambda Functions

05. December 2018 2018 0

What is Serverless?

Serverless infrastructure is the latest buzzword in tech, but from the name itself, it’s not clear what it means. At its core, serverless is the next logical step in infrastructure service providers abstracting infrastructure away from developers so that when you deploy your code, “it just works.”

Serverless doesn’t mean that your code is not running on a server. Rather, it means that you don’t have to worry about what server it’s running on, whether that server has adequate resources to run your code, or if you have to add or remove servers to properly scale your implementation. In addition to abstracting away the specifics of infrastructure, serverless lets you pay only for the time your code is explicitly running (in 100ms increments).

AWS-specific Serverless: Lambda

Lambda, Amazon Web Service’s serverless offering, provides all these benefits, along with the ability to trigger a Lambda function with a wide variety of AWS services.` Natively, Lambda functions can be written in:

  • Java
  • Go
  • PowerShell
  • Node.js
  • C#
  • Python

However, as of November 29th, AWS announced that it’s now possible to use the Runtime API to add any language to that list. Adapters for Erlang, Elixir, Cobol, N|Solid, and PHP are currently in development as of this writing.

What does it cost?

The Free Tier (which does not expire after the 12-month window of some other free tiers) allows up to 1 million requests per month, with the price increasing to $0.20 per million requests thereafter. The free tier also includes 400,000 GB-seconds of compute time.

The rate at which this compute time will be used up depends on how much memory you allocate to your Lambda function on its creation. Any way you look at it, many workflows can operate in the free tier for the entire life of the application, which makes this a very attractive option to add capacity and remove bottlenecks in existing software, as well as quickly spin up new offerings.

What does a Lambda workflow look like?

Using the various triggers that AWS provides, Lambda offers the promise of never having to provision a server again. This promise, while technically possible, is only fulfilled for various workflows and with complex stringing together of various services.

Complex Workflow With No Provisioned Servers

However, if you have an existing application and don’t want to rebuild your entire infrastructure on AWS services, can you still integrate serverless into your infrastructure?

Yes.

At its most simple implementation, AWS Lambda functions can be the compute behind a single API endpoint. This means rather than dealing with the complex dependency graph we see above; existing code can be abstracted into Lambda functions behind an API endpoint, which can be called in your existing code. This approach allows you to take codepaths that are currently bottlenecks and put them on infrastructure that can run in parallel as well as scale infinitely.

Creating an entire serverless infrastructure can seem daunting. Instead of exploring how you can create an entire serverless infrastructure all at once, we will look at how to eliminate bottlenecks in an existing application using the power that serverless provides.

Case Study: Resolving a Bottleneck With Serverless

In our hypothetical application, users can set up alerts to receive emails when a certain combination of API responses all return true. These various APIs are queried hourly to determine whether alerts need to be sent out. This application is built as a traditional monolithic application, with all the logic and execution happening on a single server.

Important to note is that the application itself doesn’t care about the results of the alerts. It simply needs to dispatch them, and if the various API responses match the specified conditions, the user needs to get an email alert.

What is the Bottleneck?

In this case, as we add more alerts, processing the entire collection of alerts starts to take more and more time. Eventually, we will reach a point where the checking of all the alerts will overrun into the next hour, thus ensuring the application is perpetually behind in checking a user’s alerts.

This is a prime example of a piece of functionality that can be moved to a serverless function, because the application doesn’t care about the result of the serverless function, we can dispatch all of our alert calls asynchronously and take advantage of the auto-scaling and parallelization of AWS Lambda to ensure all of our events are processed in a fraction of the time.

Refactoring the checking of alerts into a Lambda function takes this bottleneck out of our codebase and turns it into an API call. However, there are now a few caveats that we have to resolve if we want to run this as efficiently as possible.

Caveat: Calling an API-invoked Lambda Function Asynchronously

If you’re building this application in a language or library that’s synchronous by default, turning these checks into API calls will just result in API calls each waiting for the previous to finish. While it’s possible you’ll receive a speed boost because your Lambda function is set up to be more powerful than your existing server, you’ll eventually run into a similar problem as we had before.

As we’ve already discussed, all our application needs to care about is that the call was received by API Gateway and Lambda, not whether it finished or what the result of the alert was. This means, if we can get API Gateway to return as soon as it receives the request from our application, we can run through all these requests in our application much more quickly.

In their documentation for integrating API Gateway and Lambda, AWS has provided documentation on how to do just this:

To support asynchronous invocation of the Lambda function, you must explicitly add the X-Amz-Invocation-Type:Event header to the integration request.

This will make the loop in our application that dispatches the requests run much faster and allow our alert checks to be parallelized as much as possible.

Caveat: Monitoring

Now that your application is no longer responsible for ensuring these API calls complete successfully, failing APIs will no longer trigger any monitoring you have in place.

Out of the box, Lambda supports Cloudwatch monitoring where you can check if there were any errors in the function execution. You can also set up Cloudwatch to monitor API Gateway as well. If the existing metrics don’t fit your needs, you can always set up custom metrics in Cloudwatch to ensure you’re monitoring everything that makes sense for your application.

By integrating Cloudwatch into your existing monitoring solution, you can ensure your serverless functions are firing properly and always available.

Tools for Getting Started

One of the most significant barriers to entry for serverless has traditionally been the lack of tools for local development and the fact that the infrastructure environment is a bit of a black box.

Luckily, AWS has built SAM (Serverless Application Model) CLI, which can be run locally inside Docker to give you a simulated serverless environment. Once you get the project installed and API Gateway running locally, you can hit the API endpoint from your application and see how your serverless function performs.

This allows you to test your new function’s integration with your application and iron out any bugs before you go through the process of setting up API Gateway on AWS and deploying your function.

Serverless: Up and Running

Once you go through the process of creating a serverless function and getting it up and running, you’ll see just how quickly you can remove bottlenecks from your existing application by leaning on AWS infrastructure.

Serverless isn’t something you have to adopt across your entire stack or adopt all at once. By shifting to a serverless application model piece by piece, you can avoid overcomplicating your workflow while still taking advantage of everything this exciting new technology has to offer.

About the Author

Keanan Koppenhaver @kkoppenhaver, is CTO at Alpha Particle, a digital consultancy that helps plan and execute digital projects that serve anywhere from a few users a month to a few million. He enjoys helping clients build out their developer teams, modernize legacy tech stacks, and better position themselves as technology continues to move forward. He believes that more technology isn’t always the answer, but when it is, it’s important to get it right.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Unlearning for DynamoDB

04. December 2018 2018 0

Introduction

Relational Databases have been the standard way to store information since the 1980s. Relational Database usage techniques such as normalisation and index creation form part of many University courses around the world. In this article, I will set out how perhaps the hardest part of using DynamoDB is forgetting the years of relational database theory most of us have absorbed so fundamentally that it colours every aspect of how we think about storing and retrieving data.

Let’s start with a brief overview of DynamoDB.

Relation Databases were conceived at a time when storage was the most expensive part of a system. Many of the design decisions revolve around this constraint – normalisation places a great deal of emphasis on only having a single copy of each item of data.

NoSQL databases are a more recent development, and have a different target for operation. They were conceived in a world where storage is cheap, and network latency is often the biggest problem, limiting scalability.

DynamoDB is Amazon’s answer to the problem of effectively storing and retrieving data at scale. The marketing would lead you to believe it offers:

  • Very high data durability
  • Very high data availability
  • Infinite capacity
  • Infinite scalability
  • High flexibility with a schemaless design
  • Fully managed and autoscaling with no operational overhead

In reality, whether it lives up to the hype depends on your viewpoint of what each of these concepts mean. The durability, availability, and capacity points are the easiest to agree with – the changes of data loss are infinitesimally low, the only limit on capacity is the 10GB limit per partition, and the number of DynamoDB outages in the last eight years is tiny. As we move down the list though, things get a bit shakier. Scalability depends entirely on sharding data. This means that you can run into issues with ‘hot’ partitions, where particular keys are used much more than others. While Amazon has managed to mitigate this to some extent with adaptive capacity, the problem is still very much something you need to design your data layout to avoid. Simply increasing the provisioned read and write capacity won’t necessarily give you extra performance. It’s kind of true that DynamoDB is schemaless, in that table structures are not uniform, and each row within a table can have different columns of differing types. However, you must define your primary key up front, and this can never change. The primary key consists of at minimum a partition key, and also optionally a range key (also referred to as a sort key). You can add secondary indexes after table creation, but only up to a maximum of 5 local and 5 secondaries. This all adds up to make the last point, that there is no operational overhead when using DynamoDB, obviously false. You won’t spend time upgrading database version or operation systems, but there’s plenty of ops work to do designing the correct table structure, ensuring partition usage is evenly spread, and troubleshooting performance issues.

Unlearning

So how do you make your life using DynamoDB as easy as possible? Start by forgetting everything you know about relational databases because almost none of it is true here. Be careful – along the way you’ll find a lot of your instincts about how data should be structured are actually optimisations for RDBMS rather than incontrovertible facts.

Everything in a single table

In DynamoDB, the ‘right’ number of tables to power an application is one. There are of course exceptions but start with the assumption that all data for your application will be in a single table, and move to multiple tables only if really necessary.

Know how you’re going to use your data up front

In a relational database, the structure of your data stems from the data itself. You group and normalise, and then stitch things back together at query time.

DynamoDB is different. Here you need to know the queries you’re most commonly going to run to structure your data appropriately and work backwards to come up with the table design.

It’s OK to duplicate data

Many times when using DynamoDB you will store the same data more than once. This might even be done with multiple copies of data in the same row, to allow it to be used by different indexes. Global Secondary Indexes are just copies of your data, with different keys.

Re-use the same column for different data

Imagine you have a table with a compound primary key – an account ID as the partition key, and a range key. You could use the range key to store different content about the account, for example, you might have a sort key settings for storing account configuration, then a set of timestamps for actions. You can get all timestamps by executing a query between the start of time and now, and the settings key by specifically looking up the partition key and a sort key named settings. This is probably the hardest part to get your head around at first, but once you get used to it, it becomes very powerful, particularly when used with secondary indexes.

This, of course, makes choosing a suitable name for your sort column very difficult.

Concatenate data into composite sort keys

If you have hierarchical data, you can use compound sort keys to allow you to refine data. For example, if you need to sort by location, you might have a sort key that looks like this:

Using this sort key, you can find all items within a partition at any of the hierarchical levels by using a starts-with operator.

Use sparse indexes

Because not every row in DynamoDB must have the same columns, you can have secondary indexes that only contain a subset of data. A row will only appear in a secondary index if the primary key for that index exists in the table. Using this, you can make less specific and even scan queries that are efficient enough for frequent usage.

Don’t try and use your database for all types of queries

With relational databases, it’s common to spin up a read-only replica to interrogate data for analytics and trending. With DynamoDB, you’ll be needing to do this work somewhere else – perhaps even in a relational database. You can use DynamoDB streams to have data sent to S3, for analysis with Athena, Redshift, or even something like MySQL. Doing this allows you to have a best of both worlds approach, with the high throughput and predictable scalability of DynamoDB, and the ability to do ad-hoc queries provided by a relational engine.

Conclusions

  • Know what questions you need to ask of your data before designing the schema
  • Question what you think you know about how data should be stored
  • Don’t be fooled into thinking you can ‘set it and forget it.’

About the Author

Sam Bashton
Sam Bashton is a cloud computing expert who recently relocated with his family from Manchester, UK to Sydney, Australia. Sam has been working with AWS since 2006, providing consultancy and support for high traffic e-commerce websites. Recently Sam wrote and released bucketbridge.cloud, an AWS Marketplace solution providing an FTP and FTPS proxy for S3.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf and developed cookbooks to simplify building and managing infrastructure at Chef. Jennifer is the coauthor of Effective DevOps and speaks about DevOps, tech culture, and monitoring. She also gives tutorials on a variety of technical topics. When she’s not working, she enjoys learning to make things and spending quality time with her family.


Vanquishing CORS with Cloudfront and [email protected]

03. December 2018 2018 0

If you’re deploying a traditional server-rendered web app, it makes sense to host static files on the same machine. The HTML, being server-rendered, will have to be served there, and it is simple and easy to serve css, javascript, and other assets from the same domain.

When you’re deploying a single-page web app (or SPA), the best choice is less obvious. A SPA consists of a collection of static files, as opposed to server-rendered files that might change depending on the requester’s identity, logged-in state, etc. The files may still change when a new version is deployed, but not for every request.

In a single-page web app, you might access several APIs on different domains, or a single API might serve multiple SPAs. Imagine you want to have the main site at mysite.com and some admin views at admin.mysite.com, both talking to api.mysite.com.

Problems with S3 as a static site host

S3 is a good option for serving the static files of a SPA, but it’s not perfect. It doesn’t support SSL—a requirement for any serious website in 2018. There are a couple other deficiencies that you may encounter, namely client-side routing and CORS headaches.

Client-side routing

Most SPA frameworks rely on client-side routing. With client-side routing, every path should receive the content for index.html, and the specific “page” to show is determined on the client. It’s possible to configure this to use the fragment portion of the url, giving routes like /#!/login and /#!/users/253/profile. These “hashbang” routes are trivially supported by S3: the fragment portion of a URL is not interpreted as a filename. S3 just serves the content for /, or index.html, just like we wanted.

However, many developers prefer to use client-side routers in “history” mode (aka “push-state” or “HTML5” mode). In history mode, routes omit that #! portion and look like /login and /users/253/profile. This is usually done for SEO reasons, or just for aesthetics. Regardless, it doesn’t work with S3 at all. From S3’s perspective, those look like totally different files. It will fruitlessly search your bucket for files called /login or /users/253/profile. Your users will see 404 errors instead of lovingly crafted pages.

CORS headaches

Another potential problem, not unique to S3, is due to Cross-Origin Resource Sharing (CORS). CORS polices which routes and data are accesible from other origins. For example, a request from your SPA mysite.com to api.mysite.com is considered cross-origin, so it’s subject to CORS rules. Browsers enforce that cross-origin requests are only permitted when the server at api.mysite.com sets headers explicitly allowing them.

Even when you have control of the server, CORS headers can be tricky to set up correctly. Some SPA tutorials recommend side-stepping the problem using webpack-dev-server’s proxy setting. In this configuration, webpack-dev-server accepts requests to /api/* (or some other prefix) and forwards them to a server (eg, http://localhost:5000). As far as the browser is concerned, your API is hosted on the same domain—not a cross-origin request at all.

Some browsers will also reject third-party cookies. If your API server is on a subdomain this can make it difficult to maintain a logged-in state, depending on your users’ browser settings. The same fix for CORS—proxying /api/* requests from mysite.com to api.mysite.com—would also make the browser see these as first-party cookies.

In production or staging environments, you wouldn’t be using webpack-dev-server, so you could see new issues due to CORS that didn’t happen on your local computer. We need a way to achieve similar proxy behavior that can stand up to a production load.

CloudFront enters, stage left

To solve these issues, I’ve found CloudFront to be an invaluable tool. CloudFront acts as a distributed cache and proxy layer. You make DNS records that resolve mysite.com to something.CloudFront.net. A CloudFront distribution accepts requests and forwards them to another origin you configure. It will cache the responses from the origin (unless you tell it not to). For a SPA, the origin is just your S3 bucket.

In addition to providing caching, SSL, and automatic gzipping, CloudFront is a programmable cache. It gives us the tools to implement push-state client-side routing and to set up a proxy for your API requests to avoid CORS problems.

Client-side routing

There are many suggestions to use CloudFront’s “Custom Error Response” feature in order to achieve pretty push-state-style URLs. When CloudFront receives a request to /login it will dutifully forward that request to your S3 origin. S3, remember, knows nothing about any file called login so it responds with a 404. With a Custom Error Response, CloudFront can be configured to transform that 404 NOT FOUND into a 200 OK where the content is from index.html. That’s exactly what we need for client-side routing!

The Custom Error Response method works well, but it has a drawback. It turns all 404s into 200s with index.html for the body. That isn’t a problem yet, but we’re about to set up our API so it is accessible at mysite.com/api/* (in the next section). It can cause some confusing bugs if your API’s 404 responses are being silently rewritten into 200s with an HTML body!

If you don’t need to talk to any APIs or don’t care to side-step the CORS issues by proxying /api/* to another server, the Custom Error Response method is simpler to set up. Otherwise, we can use [email protected] to rewrite our URLs instead.

[email protected] gives us hooks where we can step in and change the behavior of the CloudFront distribution. The one we’ll need is “Origin Request”, which fires when a request is about to be sent to the S3 origin.

We’ll make some assumptions about the routes in our SPA.

  1. Any request with an extension (eg, styles/app.css, vendor.js, or imgs/logo.png) is an asset and not a client-side route. That means it’s actually backed by a file in S3.
  2. A request without an extension is a SPA client-side route path. That means we should respond with the content from index.html.

If those assumptions aren’t true for your app, you’ll need to adjust the code in the Lambda accordingly. For the rest of us, we can write a lambda to say “If the request doesn’t have an extension, rewrite it to go to index.html instead”. Here it is in Node.js:

Make a new Node.js Lambda, and copy that code into it. At this time, in order to be used with CloudFront, your Lambda must be deployed to the us-east-1 region. Additionally, you’ll have to click “Publish a new version” on the Lambda page. An unpublished Lambda cannot be used with [email protected]

Copy the ARN at the top of the page and past it in the “Lambda function associations” section of your S3 origin’s Behavior. This is what tells CloudFront to call your Lambda when an Origin Request occurs.

Et Voila! You now have pretty SPA URLs for client-side routing.

Sidestep CORS Headaches

A single CloudFront “distribution” (that’s the name for the cache rules for a domain) can forward requests to multiple servers, which CloudFront calls “Origins”. So far, we only have one: the S3 bucket. In order to have CloudFront forward our API requests, we’ll add another origin that points at our API server.

Probably, you want to set up this origin with minimal or no caching. Be sure to forward all headers and cookies as well. We’re not really using any of CloudFront’s caching capabilities for the API server. Rather, we’re treating it like a reverse proxy.

At this point you have two origins set up: the original one one for S3 and the new one for your API. Now we need to set up the “Behavior” for the distribution. This controls which origin responds to which path.

Choose /api/* as the Path Pattern to go to your API. All other requests will hit the S3 origin. If you need to communicate with multiple API servers, set up a different path prefix for each one.

CloudFront is now serving the same purpose as the webpack-dev-server proxy. Both frontend and API endpoints are available on the same mysite.com domain, so we’ll have zero issues with CORS.

Cache-busting on Deployment

The CloudFront cache makes our sites load faster, but it can cause problems too. When you deploy a new version of your site, the cache might continue to serve an old version for 10-20 minutes.

I like to include a step in my continuous integration deploy to bust the cache, ensuring that new versions of my asset files are picked up right away. Using the AWS CLI, this looks like

About the Author

Brian Schiller (@bgschiller) is a Senior Software Engineer at Devetry in Denver. He especially enjoys teaching, and leads Code Forward, a free coding bootcamp sponsored by Devetry. You can read more of his writing at brianschiller.com.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf and developed cookbooks to simplify building and managing infrastructure at Chef. Jennifer is the coauthor of Effective DevOps and speaks about DevOps, tech culture, and monitoring. She also gives tutorials on a variety of technical topics. When she’s not working, she enjoys learning to make things and spending quality time with her family.