Last Minute Naughty/Nice Updates to Santa’s List

11. December 2018 2018 0


The other day I was having a drink with my friend Alabaster. You might have heard of him before, but if not, you’ve heard of his boss, Santa. Alabaster is a highly educated elf who is the Administrator of the Naughty and Nice list for more than five decades now. It’s been his responsibility to manage it since it was still a paper list. He’s known for moving the list to a computer. Last year he moved it to AWS DynamoDB.

“It went great!” he told me with a wince that made me unsure of what he meant. “But then on the 23rd of December, we lost some kids.”

“What?! What do you mean, you lost some kids?”, I asked.

“Well. Let me explain. The process is a little complicated.

Migrating the list to AWS was more than just migrating data. We also had to change the way we manage the naughty and nice list. Before, with our own infrastructure, we didn’t care about the resource utilization we used, as long as the infrastructure could handle it. We were updating the list five times a minute per kid. At 1.8 billion kids that was just over 150 requests per second, constant, easy to manage.

Bushy, the elf that made the toy-making machine, is the kind of person that thinks a half-full glass is just twice the size it should be. Bushy pointed out that information about whether a child was naughty or not and their location for Christmas was only needed on December 24th. He proposed that we didn’t need to be updating the information as frequently.

So we made changes in how we updated the data. It was a big relief but it resulted in a spiky load. In December, we suddenly found ourselves with a lot of data to update. 1.8 billion records to be exact. And it failed. The autoscaling of DynamoDB mostly worked with some manual fiddling to keep increasing the number of writers fast enough. But on December 23rd we had our usual all hands on deck meeting on last-minute changes of behaviour for kids and no one was reacting to the throttling alarms. We didn’t notice until the 25th. By then some records had been lost, some gifts had been delivered to the wrong addresses.

Some kids stopped believing in Santa because someone else actually delivered their gifts late! It was the worst mistake of my career.”

“Oh, thank goodness you didn’t literally lose some kids! But, oh wow. Losing so many kid’s trust and belief must have really impacted morale at the North Pole! That sounds incredibly painful. So what did you learn and how has it changed the process for this year?” I asked.

“Well, the main difference is that we decoupled the writes. DynamoDB likes regular writes and can scale in a reasonable way if the traffic is not all peaks or increasing really fast.

So we send all the information to SQS and then use lambdas to process the writes. That gives us two ways of keeping control without risking a failed write: we can limit the writes and control them by changing the lambda concurrency and can either control the amount of writers needed with auto-scaling or manually.”

“That looks like an interesting way of smoothing a spiky load. Can you share an example?” I asked.

“I can show you the lambda code; I’ve just been playing with it.” He turned his laptop towards me showing me the code. It was empty, just a process_event function that did a write to boto3.

“That’s it?” I asked.

“Yes, we use zappa for it, so it’s mostly configuration, ” he replied.

We paired at the conference hotel bar, as you do when you find an interesting technical solution. First, Alabaster told me I had to create an SQS queue. We visited the SQS console. The main issue was that it looks like AWS has a completely different UI for the north-pole-1 region (which, to be honest, I didn’t know existed). I already had python 3.6 setup, so I only needed to create a virtual environment with python -m venv sqs-test and activate it with . sqs-test/bin/activate.

Then, he asked me to install zappa with pip install zappa. We created a file zappa_settings.json starting with the following as a base (you can use zappa init but you’ll then need to customise it for a non-webapp use-case):

I changed the profile_name and aws_region to match my credentials configuration and also the s3_bucket and the event_source arn to match my newly created SQS queue (as I don’t have access to the north-pole-1 region).

We then just sorted out a baseline with a minimalistic app.py:

This code shows the data of the context and event on CloudWatch logs. Alabaster explained that I could have quick access using zappa tail. Then I can use it to write to the naughty-nice list on DynamoDB or to whatever system I want to limit the activity.

Alabaster showed me the North Pole’s working implementation including how they had the throttling alarms setup in CloudWatch, concurrency configuration of Lambda on the lambda console (choose a function, go to the “Concurrency” panel, click “Reserve concurrency” and set the number to 1 – then increase as needed). While a burst of a million updates was handled with some delay, there was no data loss. I could see the pride in his face.

Hoping everything goes well for him this season, and that all have a good Christmas, and a good night!

About the Author

João Miguel Neves is a Lead Developer at POP https://www.wegotpop.com/, a company that manages people on movie productions. He also writes about python and cloud on his blog https://silvaneves.org/

About the Editors

Ed Anderson is the SRE Manager at RealSelf, organizer of ServerlessDays Seattle, and occasional public speaker. Find him on twitter at @edyesed.

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.