How rolling out EC2 Nitro system based instance types surfaced DNS query rate limit
Amazon Web Services (AWS) is a great cloud platform enabling all kinds of businesses and organizations to innovate and build on a global scale and with great velocity. However, while it is a highly scalable infrastructure, AWS does not have infinite capacity. Each AWS service has a set of service limits to ensure a quality experience for all customers of AWS.
There are also some limits that customers might accidentally discover while deploying new applications or services, or while trying to scale up existing infrastructure.
Keeping Pace with AWS Offerings
We want to share our discovery and mitigation of one such limit. As prudent customers practicing the Well Architected Framework practices, we track new AWS services and updates to existing ones to take advantage of new capabilities and potential cost savings. At re:Invent 2017, the Nitro system architecture was introduced. Consequently, as soon as relevant EC2 instance types became generally available, we started updating our infrastructure to utilize new M5 and C5 instance types. We updated relevant CloudFormation templates, Launch configurations and built new AMIs to enable the new and improved elastic network interface (ENI). We were now ready to start the upgrade process.
Preparing for Success in Production by Testing Infrastructure
We were eager to try out new instance types, so we launched a couple of test instances using our common configuration to start our testing. After some preliminary testing (mostly kicking proverbial tires) we started the update of our test environment.
Our test environment is very similar to the production environment. We try to use the same configuration with modified parameters to account for a lighter load on our test instances (e.g., smaller instances and Auto Scaling groups). We updated our stacks with revised CloudFormation templates, rebuilt Auto Scaling groups using new Launch configurations with success. We did not observe any adverse effects on our infrastructure while running through some tests. The environment worked as expected and developers continued to deploy and test their changes.
Deploying to Production
After testing in our test environment and baking for a couple of weeks, we felt confident that we were ready to deploy the new instance types into production. We purchased reserved M5 and C5 instances and were saving money by utilizing them, and we also observed performance improvements as well. We started with some second-tier applications and services. These upgraded smoothly, and it added to our confidence in the changes we were making to the environment. It was exciting to see, and we could not wait to tackle our core applications and services, including infrastructure running our main site.
Everything was in place: we had new instances running in the test environment, and partially in the production environment; we notified our engineering team about the upgrade.
In our environment, we share on-call responsibilities with development. Every engineer is engaged in the success of the company through shared responsibility for the health of the site. We followed our process of notifying the on-call engineers about the changes to the environment. We pulled up our monitoring and “Golden Signal” dashboards to watch for anomalies. We were ready!
The update process went relatively smoothly, and we replaced older M4 and C4 instances with new and shiny M5 and C5 ones. We saw some performance gains, e.g., somewhat faster page loads. Dashboards didn’t show any issues or anomalies. We started to check it off our to-do list so we could move on to the next project in our backlog.
It’s the Network…
We were paged. Some of the instances in an availability zone (AZ) were throwing errors that we initially attributed to network connectivity issues. We verified that presumed “network failures” were limited only to a single AZ, so we decided to divert traffic away from that AZ and wait for the network to stabilize. After all, this is why we run multi-zone deployment.
We wanted to do due-diligence to make sure we understood the underlying problem. We started digging into the logs and did not see anything abnormal. We chalked it off to transient network issues and continued to monitor our infrastructure.
Some time passed without additional alerts, and we decided to bring the AZ back into production. No issues observed; our initial assessment must be correct.
Then, we got another alert. This time a second AZ was having issues. We thought that it must be a bad day for AWS networking. We knew how to mitigate it: take that AZ out of production and wait it out. While we were doing that, we got hit with another alert; it looked like yet another AZ was having issues. At this point, we were concerned that something wasn’t working as we expected and that maybe we needed to modify the behavior of our application. We dove deeper into our logs.
Except when it’s DNS…
This event was a great example of our SREs and developers coming together and working on the problem as one team. One engineer jumped on a support call with AWS, while our more experienced engineers started close examination of logged messages and events. Right then we noticed that our application logs contained messages that we hadn’t focused on before: failures to perform DNS name resolution.
We use Route 53 for DNS, so we had not experienced these kinds of errors before on the M4 or C4 instances. We jumped on the EC2 instances and confirmed that name resolution worked as we expected. We were really puzzled about the source of these errors. We checked to see if we had any production code deploys that might have introduced them, and we did not find anything suspicious or relevant.
In the meantime, our customers were experiencing intermittent errors while accessing our site and that fact did not sit well with us.
Luckily, the AWS support team was working through the trouble with us. They checked and confirmed that they did not see any networking outages being reported by their internal tools. Our original assumption was incorrect. AWS support suggested that we run packet capturing focused on DNS traffic between our hosts to get additional data. Coincidentally, one of our SREs was doing exactly that and analyzing captured data. Analysis revealed a very strange pattern: while many of the name resolution queries were successful, some were failing. Moreover, there was no pattern as to which names would fail to resolve. We also observed that we triggered about 400 DNS queries per second.
We shared our findings with AWS support. They took our data and contacted us with a single question. “Have you recently upgraded your instances?”
“Oh yes, we upgraded earlier that day,” we responded.
AWS support then reminded us that each Amazon EC2 instance limits the number of packets sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface (https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html?shortFooter=true#vpc-dns-limits). So the limit was not new, however, on newer instance types AWS had eliminated internal retries and thus made DNS resolution errors more visible to newer instances. To mitigate the impact service team recommended implementing DNS caching on the instances.
At first, we were skeptical about their suggestion. After all, we did not seem to breach the limit with our 400 or so requests per second. However, we did not have any better ideas, so we decided to pursue two solutions. Most important we needed to improve the experience of our customers by rolling back our changes. We did that and immediately stopped seeing DNS errors. Second, we started working on implementing local DNS caching on affected EC2 instances.
AWS support recommended using nscd (https://linux.die.net/man/8/nscd). Based on our personal experiences with various DNS tools and implementations, we decided to use bind (https://www.isc.org/downloads/bind/) configured to act only as a caching server with XML statistics-channels enabled. The reason for that requirement was our desire to understand the nature of DNS queries performed by our services and applications with the aim of improving how they interacted with DNS infrastructure.
Infrastructure as Code and the Magic DNS Address
A fair number of our EC2 instances run Ubuntu. We were hoping to utilize main package repositories to install bind and use our custom configuration to achieve our goal of reducing number of DNS queries per host.
On Ubuntu hosts, we used our configuration management system (Chef) to install bind9 package and configure it so it would listen only on localhost (127.0.0.1, ::1) interface, query AWS VPC DNS server, cache DNS query results, log statistics, and expose them via port 954. Typically, AWS VPC provides DNS server running on a reserved IP address at the base of the VPC IPv4 network range, plus two. We started coding a solution to calculate that IP address based on the network parameters of our instances, when we noticed that there is also an often overlooked “magic” DNS address available to use: 169.254.169.253. That made our effort easier since we could hard code that address in our configuration template file. We also need to remember to preserve the content of the original resolver configuration file and prepend loopback address (127.0.0.1) to it. That way, our local caching server will be queried first, but in case it was not running for some reason, clients had a fallback address to query.
Prepending of loopback address was achieved by adding the following option to the dhclient configuration file (/etc/dhcp/dhclient.conf):
prepend domain-name-servers 127.0.0.1;
Preservation of the original content of /etc/resolv.conf was done by creating /etc/default/resolvconf with the following statement:
And to apply these changes we needed to restart networking on our instances and unfortunately, the only way that we found to get it done was not what we were hoping for (service networking restart):
ifdown --exclude=lo -a && ifup --exclude=lo -a
We tested our changes using Chef’s excellent test kitchen and inspec tools (TDD!) and were ready to roll out changes once more. This time we were extra cautious and performed canary deploy with subsequent long-term validation before updating the rest of our EC2 Ubuntu fleet. The results were as expected, that is to say, we did not see any negative impact on our site. We were observing better response times and were saving money by utilizing our reserved instances.
We learned our lesson through this experience: respect service limits imposed by AWS. If something isn’t working right, don’t just assume it’s the network (or dns!), check the documentation.
We benefited from our culture of “one team” with site reliability and software development engineers coming together and working towards the common goal of delighting our customers. Having everyone being part of the on-call rotation ensured that all engineers were aware of changes and their impact in the environment. Mutual respect and open communication allowed for quick debugging and resolution when problems arose with everyone participating in the process to restore customer experience.
By treating our infrastructure as code, we were able to target our fixes with high precision and roll back and forward efficiently. Because everything was automated, we could focus on the parts that needed to change (like the local dns cache) and quickly test on both the old and new instances.
We became even better at practicing test driven development. Good tests make our infrastructure more resilient and on-call quieter which improves our overall quality of life.
Our monitoring tools and dashboards are great, however, it is important to be ready to look beyond what is presently measured and viewable. We must be ready to dive deep into the application and its behaviors. It’s also important to take time to iterate on improving tools and dashboards to be more relevant after an event.
We are also happy to know that AWS works hard on eliminating or increasing limits as services improve so that we will not need to go through these exercises too often.
We hope that sharing this story might be helpful to our fellow SREs out there!
About the Author
Bakha Nurzhanov is a Senior Site Reliability Engineer at RealSelf. Prior to joining RealSelf, he had been solving interesting data and infrastructure problems in healthcare IT, and worked on payments infrastructure engineering team at Amazon. His Twitter handle is @bnurzhanov.
About the Editor
Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.