Hacking together an Alexa skill

24. December 2018 2018 0

Alexa is an Amazon technology that allows users to build voice-driven applications. Amazon takes care of converting voice input to text and vice versa, provisioning the software to devices, and calling into your business logic. You use the Amazon interface to build your model and provide the logic that executes based on the text input. The combination of the model and the business logic is an Alexa skill. Alexa skills run on a variety of devices, most prominently the Amazon Echo.

I built an Alexa skill as a proof of concept for a hackathon; I had approximately six hours to build something to demo. My goals for the proof of concept were to:

  • Run without error in the simulator
  • Pull data from a remote service using an AWS lambda function
  • Take input from a user
  • Respond to that input

I was working with horse racing data because it is both timely and I had access to an API that provided interesting data. Horse races happen at a specific track on a specific date and time. Each race has a number of horses that compete.

The flow of my Alexa skill was:

  • Notify Alexa to call into the custom skill using a phrase.
  • Prompt the user to choose a track from one of N provided by me.
  • Store the value of the track for future operations.
  • Store the track name in the session.
  • Prompt the user to choose between two sets of information that might be of use: the number of races today or the date of the next featured race.
  • Return the requested information.
  • Exit the skill, which meant that Alexa was no longer listening for voice input.

The barriers to entry of creating a proof of concept are low. If you can write a python script and navigate around the AWS console, you can write an Alexa skill. However, there are some other considerations that I didn’t have to work through because this was a hackathon project. Tasks including UX, testing, and deployment to a device would be crucial to any production project, however.

Jargon and Terminology

Like any other technology, Alexa has its own terminology. And there’s a lot of it.

A skill is a package of a model to convert voice to text and vice versa as well as business logic to execute against the text input. A skill is accessed by a phrase the user says, like “listen to NPR” or “talk horse racing.” This phrase is an “invocation.” The business logic is written by you, but the Alexa service handles the voice to text and text to voice conversion.

A skill is made of up of one or more intents. An intent is one branch of logic and is also identified by a phrase, called an utterance. While invocations need to be globally unique, utterances only trigger after a skill is invoked so the phrasing can overlap between different skills. An example intent utterance would be “please play ‘Fresh Air’” or “my favorite track is Arapahoe Park.” You can also use a prepackaged intent, such as one that returns search results, and tie that to an utterance. Utterances are also called samples.

Slots are placeholders within utterances. If the intent phrase is “please play ‘Fresh Air’” you can parameterize the words ‘Fresh Air’ and have that phrase converted to text and delivered to you. A slot is basically a multiple choice selection, so you can provide N values and have the text delivered to you. Each slot has a defined data type. It was unclear to me what happens when a slot is filled with a value that is not one of your N values. (I didn’t get a chance to test that.)

A session is a place for your business logic to store temporary data between invocations of intents. A session is tied both to an application and a user (more info here). Sessions stay around for about the length of time a user is interacting with your application. Depending on application settings it will be about 30 seconds. If you need to store data for longer, connect your business logic to a durable storage solution like S3 or DynamoDB.

Getting started

To get started, I’d suggest using this tutorial as a foundation. If you are looking for a different flow, look at this set of tutorials and see if any of them match your problem space. All of them walk you through creating an Alexa skill using a python lambda function. It’s worth noting that you’ll have to sign up with an Amazon account for access to the Alexa console (it’s a different authentication system than AWS IAM). I’d also start out using a lambda function to eliminate a possible complication. If you use lambda, you don’t have to worry about making sure Alexa can access your https endpoint (the other place your business logic can reside).

Once you have the tutorial working in the Alexa console, you can start extending the main components of the Alexa skill: the model or the business logic.


You configure the model using the Alexa console and the Alexa skills kit or via the CLI or skills kit API. In either case, you’re going to end up with a JSON configuration file with information about the invocation phrase, the intents and the slots. You also can trigger a model build and test your model using a browser when using the console, as long as you have a microphone.

Here are selected portions of the JSON configuration file for the Alexa skill I created. You can see this was a proof of concept as I didn’t delete the color scheme from the tutorial and only added two tracks that the user can select as their favorite.

         "invocationName":"talk horse racing",
                  "my favorite track is {TrackName}"


                  "how many races"

                  "when is the stakes race",
                  "when is the next stakes race"
                        "value":"Arapahoe Park"
                        "value":"Tampa Bay Downs"

The other component of the system is business logic. This can either be an AWS Lambda, written in any language supported by that service, or service that responds to an HTTPS request. That could be useful in leveraging existing code or data, not in AWS. If you use Lambda, you can deploy the skill just like any other Lambda, which means you can leverage whatever lifecycle, frameworks or testing solution you use for other Lambda functions. Using a non-AWS Lambda solution requires a bit more work when processing a request, but it can be done.

The business logic I wrote for this was basically hacked tutorial code. The first section is the lambda handler. Below is a relevant snippet where we examine the event passed to the lambda function by the Alexa system and call the appropriate business method.

def lambda_handler(event, context):

    if event['session']['new']:

       on_session_started({'requestId': event['request']['requestId']},


    if event['request']['type'] == "LaunchRequest":

       return on_launch(event['request'], event['session'])

    elif event['request']['type'] == "IntentRequest":

       return on_intent(event['request'], event['session'])

   elif event['request']['type'] == "SessionEndedRequest":

       return on_session_ended(event['request'], event['session'])


on_intent is the logic dispatcher which retrieves the intent name and then calls the appropriate internal function.

def on_intent(intent_request, session):

   """ Called when the user specifies an intent for this skill """

   print("on_intent requestId=" + intent_request['requestId'] +

         ", sessionId=" + session['sessionId'])

   intent = intent_request['intent']

   intent_name = intent_request['intent']['name']

    if intent_name == "MyColorIsIntent":

       return set_color_in_session(intent, session)


    elif intent_name == "HowManyRaces":

       return get_how_many_races(intent, session)


Each business logic function can be independent and could call into different services if need be.

def get_how_many_races(intent, session):

   session_attributes = {}

   reprompt_text = None

    # Setting reprompt_text to None signifies that we do not want to reprompt

    # the user. If the user does not respond or says something that is not

    # understood, the session will end.

    if session.get('attributes', {}) and "favoriteColor" in session.get('attributes', {}):

       favorite_track = session['attributes']['favoriteColor']

       speech_output = "There are " + get_number_races(favorite_track) + " races at " +favorite_track + " today. Thank you, good bye."

       should_end_session = True


       speech_output = "Please tell me your favorite track by saying, " \

                   "my favorite track is Arapahoe Park"

       should_end_session = False


   return build_response(session_attributes, build_speechlet_response(

       intent['name'], speech_output, reprompt_text, should_end_session))

build_response is directly from the sample code and creates a correctly formatted string response. This response will be interpreted by Alexa and converted into speech.

def build_response(session_attributes, speechlet_response):

   return {

       'version': '1.0',

       'sessionAttributes': session_attributes,

       'response': speechlet_response


Based on the firm foundation of the tutorial, you can easily add more slots, intents and change the invocation. You also can build out additional business logic that can respond to the additional voice input.


I tested my skill manually using the built-in simulator in the Alexa console. I tried other simulators, but they were not effective. At the bottom of the python tutorial mentioned above, there is a reference to echosim.io, which is an online Alexa skill simulator; I couldn’t seem to make it work.

After each model change (new or modified utterances, intents or invocations) you will need to rebuild the model (approximately 30-90 seconds, depending on the complexity of your model). Changing the business logic does not require rebuilding the model, and you can use that functionality to iterate more quickly.

I did not investigate automated testing. If I were building a production Alexa skill, I’d add a small layer of indirection so that the business logic could be easily unit tested, apart from any dependencies on Alexa objects. I’d also plan to build a CI/CD pipeline so that changes to the model or the lambda function, something like what is outlined here.

User Experience (UX)

Voice UX is very different from the UX of desktop or a mobile device. Because information transmission is slow, it’s even more important to think about voice UX for an Alexa skill than it would be if you were building a more traditional web-based app. If you are building a skill for any other purpose than exploration or proof of concept, make sure to devote some time to learning about voice UX. This webinar appears useful.

Some lessons I learned:

  • Don’t go too deep in navigation level. With Alexa, you can provide choice after choice for the user, but remember the last time you dealt with an interactive phone voice recognition system. Did you like it? Keep interactions short.
  • Repeat back what Alexa “heard” as this gives the user another chance to correct course.
  • Offer a help option. If I were building a production app, I’d want to get some kind of statistics on how often the help option was invoked to see if there was an oversight on my part.
  • Think about error handling using reprompts. If the skill hasn’t received input, it can reprompt and possibly get more user input.

After the simulator

A lot of testing and development can take place in the Amazon Alexa simulator. However, at some point, you’ll need to deploy to a device. Since this was a proof of concept, I didn’t do that, but there is documentation on that process here.


This custom Alexa skill was the result of a six-hour exploration during a company hackfest. At the end of the day, I had a demo I could run on the Alexa Simulator. Alexa is mainstream enough that it makes sense for anyone who works with timely, textual information to evaluate building a skill, especially since a prototype can be built relatively quickly. For instance, it seems to me that a newspaper should have an Alexa skill, but it doesn’t make as much sense for an e-commerce store (unless you have certain timely information and a broad audience) because complex navigation is problematic. Given the low barrier to entry, Alexa skills are worth exploring as this new UX for interacting with computers becomes more prevalent.

About the Author

Dan Moore is director of engineering at Culture Foundry. He is a developer with two decades of experience, former AWS trainer, and author of “Introduction to Amazon Machine Learning,” a video course from O’Reilly. He blogs at http://www.mooreds.com/wordpress/ . You can find him on Twitter at @mooreds.

About the Editors

Ed Anderson is the SRE Manager at RealSelf, organizer of ServerlessDays Seattle, and occasional public speaker. Find him on Twitter at @edyesed.

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.

Realtime Streaming with AWS Services

22. December 2018 2018 0

In today’s data-driven world, most organizations generate vast amounts of data (Big Data) and need to instrument this data to gain value out of it. Data is generated from various sources in an organization such as mobile applications, user activity, user purchases, E-commerce applications, social media and so on.  Streamed data is often generated as continuous trickle feeds, such as log files or small bytes of data at rapid intervals, and this data needs to be instrumented and processed on the fly to take analytical decisions very quickly, to help us identify what is happening right now.

A few examples of streaming data include:

    1. A company wants to do instrument search and clickstream analytics into an analytical layer quickly, to understand the search and click patterns.
    2. Social media trends such as Twitter mentions, Facebook likes, and shares, need to be instrumented and analyzed to know the product trends.
    3. Analyzing user activity in mobile applications, to understand user interaction with the application to customize and offer different experiences in the mobile app.screenshot 2018-12-21 at 2.30.12 PM
    4. Analyzing event data generated from different hardware such as sensors and IOT devices, to make timely decisions such as device replacement on hardware failures, weather forecast, etc.
    5. Various use cases of real-time alert and response during a specific event, e.g., taking necessary action when inventory runs low on a black Friday sale.



Real time streaming with AWS services

AWS offers different managed real time streaming services that allows us to stream, analyze and load data into analytic platforms. The different streaming services that AWS offers include:

  1. Amazon Kinesis Data Streams
  2. Amazon Kinesis Data Firehose
  3. Amazon Kinesis Data Analytics
  4. Amazon Kinesis Video Streams

Kinesis Data Streams enables us to capture large amounts of data from different data producers and stream it into custom applications for data processing and analytics, by default data is available for 24 hours but it can be made available up to 7 days (168 hours).

Kinesis Data Firehose is a data ingestion product that is used for capturing and streaming data into storage services such as S3, Redshift, Elasticsearch and Splunk. Data can be ingested into firehose directly using firehose APIs or can be configured to read from Kinesis Data Streams. We shall discuss more on Kinesis Data Firehose in this article.

Kinesis Data Analytics enables us to analyze streaming data by building different SQL queries using built-in templates and functions that allows us to transform and analyze streaming data. Data sources for Kinesis data analytics can be either Kinesis data streams or Kinesis Data Firehose.

Kinesis Video Streams is one of the recent offerings that enables us to securely stream video from different media devices for analytics and processing.

As all of these Kinesis services are managed services, they elastically provision, deploy and scale the infrastructure needed to process the data.

Deep Dive into Kinesis Data Firehose


With Kinesis data firehose it is very easy to configure data streaming and start instrumenting data in a couple of minutes. Let’s take a look at the Firehose UI and start configuring our first streaming application.

Step 1: Define stream and configure Source:

We define a delivery stream name and the source from which the stream gets data.

Kinesis Data Firehose can be configured to read data from Kinesis Data Stream or leverage Firehose APIs to write to the Firehose stream directly.

Leveraging Firehose APIs to write to the stream is called Direct PUT. when this option is chosen, we can write to firehose stream directly or can also configure to stream data from other AWS services like AWS IOT, Cloud watch logs and Cloud watch events.

When Kinesis Data Stream is chosen as a source, it is required to create a kinesis stream in the first place before creating a firehose delivery stream. It is also important to know that we cannot change the source of the firehose delivery stream once it is created.

Firehose can handle 5000 records / second when configured as Direct PUT method, records can also be batched together using the PutRecordBatch method. With PutRecordBatch, we can batch up to 500 records together and send it to firehose.

Step 2a: Record transformation

With Kinesis Data Firehose, we can transform records using custom AWS Lambda functions and can also convert records to open source formats that are very efficient to query.

While transforming kinesis records with lambda, we can write custom logic to process the records. Each record is identified by a unique RecordID and we are supposed to apply the custom logic to every record and return if the record is valid to process further or be rejected.

This is defined in the result field of the payload; the record is valid if the result is defined as ‘Ok’ and invalid otherwise.

Here is a nodejs based sample AWS Lambda code that is used for validation:


Step 2b: Convert record formats

With record formats enabled, it is possible to convert record formats to open source data formats like Apache Parquet or Apache ORC format which is efficient in querying JSON.

In order to enable record format conversion, it is mandated to predefine glue database and glue tables with a table structure.

It is important to note that once record format conversion is enabled, we can only deliver records to S3. Other destinations such as Redshift, Elasticsearch, and Splunk are not currently supported.

Step 3a – Delivering data to Amazon S3:

Once record transformation and record processing is defined, we can now deliver records to a destination. When choosing S3 as the target, it is important to note that we need to choose an S3 bucket and optionally we can also choose a prefix to which data will be delivered.

It is important to note that records will be delivered in folders of time partitions in S3.

In our example records will be delivered in the partition – myfirststreambucket/test/year/mon/day/hour/myfirst-firehose-stream-xxxx


The partitions generated by firehose is not compatible with Hive partitions, so if Firehose generated records are to be queried with Athena, or Hive or glue crawlers, we need to convert Firehose partition data to partition that is compatible with hive. This can be done by triggering an AWS Lambda that will convert Firehose partitions to Hive partitions.

Step 3b – Delivering data to Amazon Redshift

An intermediate S3 bucket is required  to stage data in addition to the Redshift cluster details in order for Firehose to deliver data to Amazon RedShift.


Firehose generates manifest files in a folder within the configured S3 bucket for every file that is written to S3. The manifest file is referenced in the copy command for loading data into Redshift.

Additional copy parameters can be specified in the COPY Options field which gets appended to the copy command and gets executed.

When a file fails loading into Redshift, Firehose retries to load the record into Redshift for the specified retry duration after which it skips the record. Error details of failure can be found in STL_LOAD_ERRORS table in redshift.

Step 3c – Delivering data to Amazon Elasticsearch service and Splunk

Similar to delivering data to Amazon Redshift, data can be loaded to Elasticsearch and Splunk by configuring the Elasticsearch / Splunk endpoints.

It is also required to highlight the S3 bucket details that are used for intermediate backup.



Step 4 – Buffering, Logging and Permissions

Once the source, process and destination configurations are configured, it is necessary to configure the buffer conditions, logging, and enabling the necessary permissions for Firehose to load data to the target destination.

We can specify the buffer size and intervals, and Firehose will buffer data and send it to the target when either of the conditions are met.


Firehose buffers data until the buffer size or the buffer interval conditions are met, once either of the conditions are satisfied, data is delivered to the destination.

In this case, firehose buffers for 5mb or 300 seconds once 5mb or 300 seconds condition is met data is delivered to the destination.

Firehose offers compression in GZip, Snappy or Zip formats, and data can also be encrypted using a KMS master key.

When error logging is enabled, Firehose creates a separate cloudwatch log group and logs the state of the delivery stream for each execution.

For delivering data from a source to a target, necessary permission needs to be granted to Firehose. This is handled by the IAM roles.

Step 5 – Review and send data.

Once the above configuration is defined correctly, we can now start to send data into Firehose using Firehose APIs.

We will see how we can use Firehose API commands to ingest data.

Kinesis Put Record Example:

On successful execution, the API returns a RecordId that is similar to the one below.

Firehose by default will not add a record delimiter when multiple put record statements are fired. Firehose delivers data with records appended to each other.

It is necessary to embed any required delimiters within the record.


This delivers the following output to S3 with the appropriate delimiters.

When Kinesis Data Stream is chosen as a source, Firehose scales elastically based on the number of shards defined in the kinesis stream.

We saw how kinesis data firehose can be used to effectively ingest data, by combining the 3 services of Kinesis – Kinesis Data Streams, Kinesis Data Analytics and Kinesis Data Firehose it is possible to create an end-end ETL pipeline for streaming data.

About the Author:
Srivignesh KN (@srivigneshkn) is a Senior Data Engineer at Chegg, Sri builds and architects data platforms in the cloud.  Sri is an avid AWS user, he recently presented at the AWS Community Day on Real time Analytics, the presentation can be found here.

Code assistance for boto3, always up to date and in any IDE

21. December 2018 2018 0

If you’re like me and work with the boto3 sdk to automate your Ops, then you probably are familiar with this sight:


No code completion! It’s almost as useful as coding in Notepad, isn’t it? This is one of the major quirks of the boto3 sdk. Due to its dynamic nature, we don’t get code completion like for other libraries like we are used to.

I used to deal with this by going back and forth with the boto3 docs. However, this impacted my productivity by interrupting my flow all the time. I had recently adopted Python as my primary language and had second thoughts on whether it was the right tool to automate my AWS stuff. Eventually, I even became sick of all the back-and-forth.

A couple of weeks ago, I thought enough was enough. I decided to solve the code completion problem so that I never have to worry about it anymore.

But before starting it, a few naysaying questions cropped up in my head:

  1. How would I find time to support all the APIs that the community and I want?
  2. Will this work be beneficial to people not using the X IDE?
  3. With 12 releases of boto3 in the last 15 days, will this become a full time job to continuously update my solution?

Thankfully, I found a lazy programmer’s solution that I could conceive in a weekend. I put up an open source package and released it on PyPI. I announced this on reddit and within a few hours, I saw this:


Looks like a few people are going to find this useful! 🙂

In this post I will describe botostubs, a package that gives you code completion for boto3, all methods in all APIs. It even automatically supports any new boto3 releases.

Read on to learn a couple of less-used facilities in boto3 that made this project possible. You will also learn how I automated myself out of the job of maintaining botostubs by leveraging a simple deployment pipeline on AWS that costs about $0.05 per month to run.

What’s botostubs?

botostubs is a PyPI package, which you can install to give you code completion for any AWS service. You install it in your Python runtime using pip, add “import botostubs” to your scripts and a type hint for your boto3 clients and you’re good to go:


Now, instead of “no suggestions”, your IDE can offer you something more useful like:


The parameters in boto3 are dynamic too so what about them?

With botostubs, you can now get to know which parameters are supported and also which are required or optional:


Much more useful, right? No more back-and-forth with the boto3 docs, yay!

The above is for Intellij/PyCharm but will this work in other IDEs?

Here are a couple of screenshots of botostubs running in Visual Studio Code:



Looks like it works! You should be able to use botostubs in any IDE that supports code completion from python packages.

Why is this an issue in the first place?

As I mentioned before, the boto3 sdk is dynamic, i.e the methods and APIs don’t exist as code. As it says in the guide,

It uses a data-driven approach to generate classes at runtime from JSON description files …

The SDK maintainers do it to be able to enhance the SDK reliably and faster. This is great for the maintainers but terrible for us, the end users of the SDK.

Therefore, we need statically defined classes and methods. Since boto3 doesn’t work that way, we need a separate solution.

How botostubs works

At a high level, we need a way to discover all the available APIs, find out about the method signatures and package them up as classes in a module.

  1. Get a boto3 session
  2. Loop over its available clients
  3. Find out about each client’s operations
  4. Generate class signatures
  5. Dump them in a Python module

I didn’t know much about boto3 internals before so I had to do some digging on how to accomplish that. You can use what I’ve learnt here if you’re interested in building tools on top of boto3.

First, about the clients. It’s easy when you already know which API you need, e.g with S3, you write:

client = boto3.client(‘s3’)

But for our situation, we don’t know which ones are there in advance. I could have hardcoded them but I need a scalable and foolproof way. I found out that the way to do that is with a session’s get_available_services() facility.


Tip: Much of what I’ve learnt have been though Intellij’s debugger. Very handy especially when having to deal with dynamic code.


For example, to learn what tricks are involved to get the dynamic code to convert to actual API calls to AWS, you can place a breakpoint in _make_api_call found in boto3’s client.py:


Steps 1 and 2 solved. Next, I had to find out which operations are possible in a scalable fashion. For example, the S3 API supports about 98 operations for listing objects, uploading and downloading them. Coding 98 operations is no fun, so I’m forced to get creative.

Digging deeper, I found out that clients have an internal botocore’s service model that had everything that I was looking for. Through the service model you can find the service documentation, api version, etc.

Side note: botocore is a factored out library that is shared with the AWS CLI. Much of what boto3 is capable is actually powered by botocore.

In particular, we can read the available operation names. E.g the service model for the ACM api returns:


Step 3 was therefore solved with:


Next, we need to know what parameters are available for each operation. In boto parlance, they are called “input shapes”. (Similarly, you can get the output shape if needed) Digging some more in the service model source, I found out that we can get the input shape with the operation model:


This told me the required and optional parameters. The missing part of generating the method signatures was then solved. (I don’t need the method body since I’m generating stubs)

Then it was a matter of generating classes based on the clients and operations above and package them in a Python module.

For any version of boto, I had to run my script, run the twine PyPI utility and it will output a PyPI package that’s up to date with upstream boto3. All of that took about 100 lines of Python code.

Another problem remained to be solved though; with a new boto3 release every time you change your t-shirt, I would need to run it and re-upload to PyPI several times a week. So, wouldn’t this become a maintenance hassle for me?

The deployment pipeline

To solve this problem, I looked to AWS itself. The simplest way I found out was to use their build tool and invoke it on a schedule. What I want is a way to get the latest boto3 version, run the script and upload the artefact to PyPI. All without my intervention.

The relevant AWS services to achieve this is Cloudwatch Events (to trigger other services on a schedule), CodeBuild (managed build service in the cloud) and SNS (for email notifications). This is what the architecture looks like on AWS:


Image generated with viz-cfn

The image above describes the CloudFormation template used to deploy on Github as well as the code.

The AWS Codebuild Project looks like this:



To keep my credentials outside of source control, I also attached a service role to give CodeBuild permissions to write logs and read my PyPI username and password from the Systems Manager parameter store.

I also enabled the build badge feature so that I can show the build status on Github:




For intricate details, check out the buildspec.yml and the project definition.

I want this project to be invoked on a schedule (I chose every 3 days) and I can accomplish that with a Cloudwatch Event Rule:


When the rule gets triggered, I see that my codebuild project does what it needs to do; clone the git repo, generate the stubs and upload to PyPI:


This whole process is done in about 25 seconds. Since this is entirely hands off, I needed some way to be kept in the loop. After the build has run, there’s another Cloudwatch Event which gets triggered for build events on the project. It sends a notification to SNS which in turns sends me an email to let me know if everything went OK:


The build event and notification.


The SNS Topic with an email subscription.

That’s it! But what about my AWS bill? My estimate is that it should be around $0.05 every month. Besides, it will definitely not break the bank, so I’m pretty satisfied with it! Imagine how much it would cost to maintain a build server on your own to accomplish all of that.

What’s with the weird versioning?

You will notice botostubs versions look like this:


It currently follows boto3 releases in the format 0.4.x.y.z. Therefore, if botostubs is currently at, then it means that it will offer whatever is available in boto3 version 1.9.61. I included the boto version in mine to make it more obvious what version of boto3 that botostubs was generated from but also because PyPI does not allow uploads at the same version number.

Are people using it?

According to pypistats.org, botostubs has been downloaded about 600 times in its initial week after I showed it to the reddit community. So it seems that it was a tool well needed:


Your turn

If this sounds that something that you’ll need, get started by running:

pip install botostubs

Run it and let me know if you have any advice on how to make this better.


Huge thanks goes to another project called pyboto3 for the original idea. The issues that I had with it was that it was unmaintained and supported legacy Python only. I wouldn’t have known that this would be possible were it not for pyboto3.

Open for contribution

botostubs is an open source project, so feel free to send your pull requests.

A couple of areas where I’ll need some help:

  • Support Python < 3.5
  • Support boto3 high-level resources (as opposed to just low-level clients)


In this article, I’ve shared my process for developing botostubs through examining the internals of boto3 and automate its maintenance with a deployment pipeline that handles all the grunt work. If you like it, I would appreciate it if you share it with a fellow Python DevOps engineer


I hope you are inspired to find solutions for AWS challenges that are not straightforward and share them with the community.

If you used what you’ve learnt above to build something new, let me know, I’d love to take a look! Tweet me @jeshan25.

About the Author

Jeshan Babooa is an independent software developer from Mauritius. He is passionate about all things infra automation on AWS especially with tools like Cloudformation and Lambda. He is the guy behind LambdaTV, a youtube channel dedicated to teaching serverless on AWS. You can reach him on Twitter @jeshan25.

About the Editors

Ed Anderson is the SRE Manager at RealSelf, organizer of ServerlessDays Seattle, and occasional public speaker. Find him on Twitter at @edyesed.

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.

Thinking About EFS? Ten Reasons It May Not Be a Fit

20. December 2018 2018 0

Is “Elastic” Enough?

Before we talk about why Amazon’s Elastic File System (EFS) might not be a fit in some environments, let’s discuss why it might. EFS is very good at what it was designed to do. It was designed to be a simple, scalable, elastic file store for Linux clients running NFS 4. AWS describes several use cases for this functionality in their EFS overview, including big data analytics, web serving and content management, application test and development, media and entertainment, and database backup. If your target application is Linux-only and fits somewhere in the list of use cases, EFS may well be what you’re looking for.

On the other hand, what EFS was not designed to be, as many EFS users have discovered, is a fully-featured, multi-protocol enterprise file system. Also, it’s not cheap.

So, what are the key feature gaps in EFS? Here’s my “Top 10 (plus a few)” list:

  1. Support for native windows file systems (SMB/CIFS): Amazon is quite clear that EFS support is limited to Linux clients and NFS 4. Windows clients are not supported, even if they’re running the NFS client (this isn’t Amazon’s fault; the Windows NFS client is, uh… “problematic” running NFS v4).
  2. Support for Active Directory (AD): This is related-but-separate from SMB/CIFS support and is particularly problematic for organizations that deploy AD for enterprise-wide authentication. Since EFS doesn’t support AD, adopting EFS means that (at least) a subset of the permissions structure needs to be duplicated into NFS-style for use with EFS. And then (and here’s the kicker), the duplicated permissions need to be kept current with any changes to the underlying AD structure.
  3. File system quotas: Independent of client OS-specifics, many organizations use file system quotas to manage unstructured data “sprawl” in their file systems, and EFS does not support file system quotas. So, if you deploy EFS, you’ll need another tool.
  4. Flexible local data replication: Making application-consistent and/or file system consistent copies of data to enable backup, test & dev, data analytics, and a host of other uses has been a mainstay of enterprise storage for something like twenty years. EFS does not offer any native snapshot or clone capability.
  5. Remote replication: Similarly, the capability to replicate data between geographically distributed data centers in order to provide availability in case of regional disasters (e.g., hurricanes, earthquakes, tornados, floods, etc.) has also been a critical component of enterprise business continuity planning for decades, but this functionality isn’t available in EFS either. And that’s pretty ironic, considering what ubiquitous cloud computing has done for the affordability of effective disaster recovery planning.
  6. User-managed encryption keys: I won’t belabor how important data security is in this piece, but one key (if you’ll pardon the pun) to security is to “trust no one.” Everyone agrees that both in-flight and at-rest data needs to be encrypted but, following the “trust no one” adage, when the user control encryption keys, even “bad actors” with physical access to storage infrastructure can’t access secure data. Most storage solutions, including EFS, only support vendor-managed encryption keys.
  7. Non-disruptive, dynamic volume migrations: Think of this as “Whoops!” insurance. E.g., what if you realize that you need to change volumes after they’re already in use? You’d really like not to have to take the affected volume(s) down to make corrections but, like most cloud file systems, the only solution that EFS provides, in this case, requires provisioning new storage and running a user-managed migration. Whoops.
  8. Predictable performance immune from “noisy neighbors”: Amazon goes to considerable lengths to minimize the effects of multiple tenants running multiple workloads on shared infrastructure inevitably have on each other. But the simple fact is that environments based on shared resources, like EFS, are potentially subject to “noisy neighbors” scenarios.
  9. Hybrid Cloud: This really comes down to support for simultaneous file system access from applications running in EC2 instances and applications running on-premises. EFS does support on-premises access to EFS via Direct Connect, but the use cases Amazon discusses are focused on copying data back-and-forth between on-premises and AWS. That’s not simultaneous access; processing data on-premises OR in the cloud is not the same as processing data on-premises AND in the cloud.
  10. Multi-Cloud: The more experience that organizations gain with public cloud computing, the more important avoiding cloud provider lock-in and minimizing “wasted cloud spend” becomes. Given that EFS doesn’t even support simultaneous access between on-premises and AWS instances, you feel pretty confident that it doesn’t (and never will) support data sharing between AWS and, say, Azure or Google Cloud Platform. So, even if EFS meets all your other requirements, If you ever want to simultaneously present your file system(s) to multiple public cloud environments, EFS isn’t the solution for you.
  11. Data availability across Virtual Private Clouds: AWS Virtual Private Clouds (VPCs) are exactly what they sound like: multiple private clouds belonging to a single organization, hosted by AWS. There are some reasons that enterprises choose to operate multiple VPCs, analogous to operating multiple on-premises private clouds, including (but in no way limited to) supporting multiple functional groups, business units, subsidiaries, etc. There are also technical requirements in some deployments, e.g., VMware Cloud on AWS, that may require multiple VPCs. If your organization deploys, or might in the future deploy, multiple VPCs and need to share file system data between them, understand that EFS doesn’t support that.
  12. Flexible Storage Media: If you’re only going to support one storage medium, EFS is certainly correct to support only flash. But I already mentioned, “not cheap,” right? On the other hand, however, if we can match application requirements to storage media, we can optimize price/performance to application requirements. And if we have Non-disruptive, dynamic volume migrations (see #7, above), we can always reconfigure the volumes to different media if the requirements change.

My final point is more speculative. With the announcement of FSx for Windows and AWS’s embrace of Samba to enable Linux connectivity to FSx for Windows, one could be forgiven for wondering how Amazon views EFS’s long-term prospects.

The bottom line is that EFS was designed for the “80” part of the old 80/20 rule; it is fit-for-purpose for recommended applications. The top-level mismatch between EFS and an industry understanding of enterprise file systems is that enterprise file systems are expected to cover ninety-nine-plus percent of enterprise use cases.

And that’s just not what EFS was designed for.

About the Author

Marc Leavitt, Senior Director of Product Marketing at Zadara (www.zadara.com), has more than twenty years experience developing, architecting, selling, and marketing enterprise storage solutions for companies like EMC, Brocade, Western Digital, and QLogic.

Marc is a graduate The University of California, Berkeley and currently resides in Irvine, California.

Serverless: From Azure to AWS

19. December 2018 2018 0

Over the last six months, I’ve had the opportunity to learn about Serverless ecosystems in AWS from the standpoint of someone who is quite familiar with the Azure serverless ecosystem in Azure.  I don’t feel that I’m a cloud novice, however at the same time switching context from something very familiar to a whole new model and context had me feeling like a complete novice very early on.  In some ways, I was probably hindered more than someone who’s just learning cloud and AWS completely new. I was also trying to do things, subconsciously, the Azure way instead of just following how AWS intends for things to be done. Early on I created a spreadsheet that mapped out the equivalent component offerings between the two which was a good first step at not feeling so lost and frustrated.

The purpose of this advent day is to not go into the pros or cons of AWS or Azure. The hope is to provide a bridge between the two for developers that might be migrating from one to the other or needing to skill up reasonably quickly.  I initially did this with a spreadsheet and simple mapping (which got messy when I started looking into the networking…but you’ll see that later). For this article, I’ve opted to stay within my wheelhouse and serverless offerings.

Developer Accounts and Dashboard Features

A lot of what I’ve learned about AWS and Azure was from my free time and non-work related to personal accounts.  So, I’m going to highlight one of the most significant differences between the two cloud providers in how they handle their developer accounts.

AWS has a “free tier” which offers the lowest performance offerings of most of their proprietary services for free use, within limits for one month.  Sorry, no free Kubernetes clusters. I know some developers that create a new free tier account every month. Azure, on the other hand, offers every account $150.00 free usage per month indefinitely.  Both Azure and AWS allow you to set up warning thresholds on cost and usage limits to help protect you from a runaway experiment. Two very different models for letting developers get hands-on experience with their cloud offerings.

Amazon AWS and Microsoft Azure are two similar yet different beasts at the same time.  I think it’s pretty much summed up, to me, in the differences between operating in a Windows world vs. a Unix world.  AWS is very much you are in control of everything, and there’s a lot of configuration that can and does happen that you are in charge of from the onset.  With Azure, you have similar controls, but they are mostly hidden and abstracted behind a UI. You can dive behind the scenes with Azure and have as fine-grained control as you do in AWS with direct Azure API calls, but it’s not the default experience and can sometimes take a frustrating amount of searching to find what is easy to find for AWS.  This can be highlighted quite nicely by a basic comparison between the Azure Portal Dashboard vs. the AWS Management Console, the two entry points to the providers.



Azure Portal


AWS Management Console

In my opinion, the Azure Portal is the equivalent of a code editing IDE (like Eclipse or Visual Studio), and the AWS Management Console is text editor (like Vim or Atom).

I bring up these basic differences between the two service providers because it’s these philosophies and mindsets that permeate this overview.  You can do the same things in either operating system, they both work and it’s just that it’s different. In Azure, a lot of things are done for your “behind the scenes” and only available via API for manual configuration. In AWS, you are expected to wire up a lot of it yourself.  In this article, I’ll try to bridge the gap and provide a quick and rough translation between the two for one workflow.

The scope of this comparison is roughly around the domain of creating a hosted code API offering that would use an eventing model to trigger more hosted code to do some business logic on data and persist it in a relatively secure way.  To do this, you need an external IP address for the API, a message system, some code to listen to that message system and a data store. You’ll also need to have a smattering of networking infrastructure to secure and front things (Internet Gateway, v-net, maybe sub-nets, resource groups, instrumentation, logging…etc.).  This is in no way a comprehensive guide, but it should have the basics for you to have enough information to start asking harder questions of better-informed people.

Base Services

At the core, from my developer’s perspective, is code.  Both AWS and Azure have similar offerings with Lambda and Function Apps that support a wide range of common languages and platforms (C#, Java, JavaScript and Python to name a few).  Both of these offerings are pretty similar with the same feeling I outlined above. Function Apps can integrate with different Azure Eventing services and datastores through dropdowns, and Azure can wire things up for you quickly and easily.  Lambda has Blueprints and 3rd Party Templates that offer some similar functionality. Additionally, with Lambda you can construct everything and do the wiring yourself.

I’m really excited about hosted code in the form of Lambda and Function Apps.  I’ve been using Function Apps since they were in Beta. I loved the concept. I started from a new project in Visual Studio to have a service that did something within 5 minutes. My service did something real to the database within 15 minutes.  I will not lie, Lambda took me a lot longer. However, in the weekend I had devoted to playing with Lambda, I was able to create an Alexa skill for my partner to use and interact with at home… which is very different from having an API that does some serverless business logic.

If you are creating a message based eventing model you need some way to send and receive events. AWS and Azure offer all sorts of offerings from proprietary offerings to common consumer options depending on how much you want to spend.  As I’m experimenting on my own dime, I tend to go with free options and scale up once I have income coming in to spend. Azure offers the proprietary options Event Hub, Event Grid and Service Bus. AWS offers proprietary Simple Queue Service and Simple Notification Service.  In the end, what type of eventing model you use will depend on a lot of different factors outside of the scope of this article. Suffice to say, either have offerings to do a simple, cheap prototype.

Most applications need to store data. So, we need to be able to do some simple read and writing with serverless offerings.   I picked the two most common types: SQL and a non-SQL option (a Document Database). Amazon has RDS for their SQL offering and DynamoDB as a document datastore.  Obviously Azure has SQL Server and a lesser-known document database Azure Cosmos DB. For the uses that I had in mind, all four of them performed similarly. I really couldn’t tell them apart.

The fantastic part is that that the three systems: hosted code, eventing, and data store are all serverless. You can now use code to create, setup, and connect your systems.  That code can be stored in your version control and create your infrastructure at will whenever and wherever you want.

The heart of a general event-based microservice pattern serverless implementation is to receive messages either through API or the event system, do something (get, save or manipulate data) and push another message.  That’s the simple part; the nuances and complexity start when you consider hardening, monitoring and reporting from your system.

Hardening and Supporting Services

As soon as I started looking at how to make my internal services invisible to the outside world, I started to see how fundamentally different AWS and Azure are.  This continued as I layered on monitoring and reporting, too. Both services give you ways to do it that make sense. They provide different routes and building blocks to accomplish the same thing.  This is a stark contrast to the data storage example, where it was pretty much a 1:1 Service to Service comparison. They both have offerings to make a robust and secure system that you can instrument well, it just starts to look very different.



For example, AWS promotes using Amazon CloudFront that layers over an Amazon API Gateway that abstracts your AWS Lambda function. Note, the AWS section was a raw personal prototype and might not have passed security review and probably would require more layers of AWS services to meet the security requirements that the Azure implementation did. Whereas in Azure if you want to not make your HTTP publicly visible, you have to use a Network Virtual Appliance, V-Nets, Subnets, an API Management Service, and probably a few other things that I’ve forgotten since setting it up a few years ago.

They both require User Management. Azure uses Active Directory whereas AWS uses Identity and Access Management. Both are similar functionally but slightly different in implementations and nuance. I want to stress that both allow you to do what you need to do, but how to accomplish this differs more between AWS and Azure because the foundational cloud philosophies and infrastructure are different.  E.g., if you build your app the way Azure or AWS wants you to think, the implementation makes complete sense, and then the other service implementation looks a little wonky.


All told there aren’t a lot of major differences between AWS Serverless and Azure Serverless offerings, based on my use cases.  They are both great sets of offerings and offer a developer a great way to prototype something and possibly get feedback and value very quickly.  There are absolutely differences between them that stem from their origins and design choices/philosophies.

Sometimes those differences can be incredibly frustrating because you just want to do one simple thing that you could do in the other.  AWS is the current industry standard, with the lion’s share of the market, and Azure has a significant presence too. As someone who lives in Seattle, equidistant from both Microsoft and Amazon campuses, it behooves me to know both providers; their strengths and their weaknesses.  Over the last five years, it has also nice to see them both evolving and learning from each other.

About the Author

Steve Kuo is a speaker, mentor, and developer who is driven to raise awareness that writing code has a responsibility and is a craft locally to Seattle as well as at conventions.  He is active in the in the Seattle Area facilitating CodeRetreats, running the Seattle and Eastside Code Crafter meetups, speaking about high-quality code techniques, talking about creating cultures of learning and sharing stories with other technology industry professionals.

About the Editors

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.

Ed Anderson is the SRE Manager at RealSelf, organizer of ServerlessDays Seattle, and occasional public speaker. Find him on twitter at @edyesed.

Quick and easy BeyondCorp BackOffice access with ALBs, Cognito and GSuite

18. December 2018 2018 0

For some values of quick and easy.


LoveCrafts has several services which are currently hosted behind several different VPNs. VPN access is managed via LDAP which is managed by Engineering/DevOps.

Historically, we have not been notified of company leavers in a timely fashion, which is an obvious security hole, as VPN access (can) permit access to privileged resources within our hosting environment.

This includes but is not limited to:

  • Grafana
  • Kibana
  • Jenkins

For a while, we had been discussing a Single Sign-On (SSO) system to manage access to all these disparate systems. We use Google GSuite for corporate mail. Our Human Resources Team manually add and remove people as they join and leave. So it seemed obvious to treat Google as our single source of truth (at least for now).

In June 2018, AWS announced the integration of Cognito and JWT Authorisation within their Application Load Balancers (ALBs). [1]

This would allow any Web based back office services to be put behind a public facing ALB with Cognito Authorisation via GSuite.

This probably equates to 90% of our corporate VPN traffic. Theoretically, we should then be able to get the required VPN services used only for emergency SSH/RDP. We could limit SSH access as much as possible with other tools, such as SSM Manager Console.

Integrating with GSuite gets LoveCrafts significantly closer to a full SSO.

Caveat Developer: Google GSuite is being used here, but Cognito supports multiple OAuth2 sources, including Amazon, Facebook, OpenID or indeed any OAuth2/SAML provider.

The following code has been reverse engineered from our Puppet managed configuration. I have modified these to work without Puppet so there may be some inconsistencies to the following examples.

Initial Proof of Concept

To test feasibility, I used a test AWS account and created the following:

  • Cognito User Pool
  • Cognito App Client
  • Application Load Balancer(ALB)
  • Google OAuth2 Client Credentials

The ALB was configured with a separate CNAME to an existing service.

The Google OAuth2 Client credentials were configured and added to the Cognito User Pool in the testing account.

Enabling the authentication, all HTTPS access to the ALB was redirected to a Google auth page and redirected back to the ALB once sign in was complete.

Transparent access worked fine, and a user was added to the Cognito Pool.

Access was allowed to the protected resource once authenticated or repeatedly presented a Google Authentication page.

The Good

Works transparently without having to write any app-specific code. Zero to up and running in ~5mins.

AWS ALB passes the user profile data in an X-Amzn-Oidc-Data HTTP header that the app/nginx etc. can access (although it is base64 encoded JSON).

The Bad

Any Google account permits access. (This service is designed to allow app developers to pass off user management via Google, Twitter, Facebook or any OAuth2/OpenID platform and store in Cognito.)

The App needs to validate JWT Token to prove the authenticity of the X-Amzn-Oidc-Data HTTP header, which is great because we’re already using an nginx JWT auth library…

The Ugly

Initially, it was relatively trivial to get Nginx to decode the X-Amzn-Oidc-Data Header, extract Username/email/firstname/lastname and pass as separate headers to the downstream app.

However, you need to validate the signature of the JWT token to ensure it’s genuine, meaning in time (i.e., the session is still valid) and that it hasn’t been spoofed.

Amazon chose to use ES256 signatures for JWT, which the nginx lua library we’ve been using doesn’t support and I couldn’t find one which did support any Elliptical Curve Crypto Signatures. Well, there was a Kong version of nginx, but I didn’t want to attempt backporting.

What follows is an explanation of the solution I ended up writing; a python sidecar to handle the JWT validation, user data extraction and encapsulated the functionality in a new lua module for nginx.

Once/If the nginx lua JWT module improves to support ES crypto, this could be deprecated in favour of a fully lua based module.

For speed of development, I chose to write a python sidecar app to validate the JWT token and return HTTP headers back to nginx. The HTTP status code indicates whether the JWT token validated correctly.

The python app runs under gunicorn. It needs to be run under python3, as again, python2 doesn’t have support for the crypto libraries in use.

Using a python app does also allow you to expand the features and add group memberships from an LDAP service as extra headers for example.

I finally settled on the PyJWT library as it compiled and performed several orders of magnitude faster than a userland version. (less than 1ms typically compared to 150ms+). Speed is critical here, as the JWT token needs to be validated for every single request crossing the ALB.

Basic Implementation

To follow along you will need:

  • A Google GSuite account and developer access
  • An AWS account with an ALB and a Cognito Pool
  • nginx with lua support
  • python3

We’re going to run a python3 sidecar AuthService that validates the JWT token and passes the validated headers back to nginx. Nginx will then forward those headers to your own application behind the ALB. The application does not need to know anything about how the authentication is done and could even be a static site.

Applications such as Grafana and Jenkins can use the Proxy Headers as a trusted identity.

The AuthService sidecar runs locally alongside the Nginx instance and has strictly controlled timeouts. If the JWT authorisation is required and the service is down, nginx will serve a 503: Service Unavailable. If the user is authenticated but not in the list of approved domains, the nginx will serve a 401: Access denied.

Below shows the standard request path for an initial login to a Cognito ALB.

Data flow diagram showing the interaction between the browser and components

Nginx and AuthServices are the two components we need to build to validate the JWT token.

Keyserver is a publically accessible location to retrieve the public key of the server that signed the JWT token. The key id is embedded in the X-Amzn-Oicd-Data header. The python app caches the public keys in memory.

Creating GSuite OAuth2 Credentials

Login into the Google Developers Console and create an app to use for authentication.

Create OAuth Client Credentials for your app.

Create OAuth Client Credentials

Create a set of web application credentials.

Create set of web application credentials

Copy your Client ID add Secret

Copy your Client ID and Secret

Configure Cognito

If you don’t already have a Cognito User Pool create one.

Choose the domain name that Cognito will reserve for you. This is where your users will get directed to log in. (You can use your own domain, but is beyond the scope of this tutorial.)

Pick your domain prefix.

N.B. The full domain needs to be added to the Google Developer Console as a permitted Callback location for your Oauth Web Client app.

Configure Google as your identity provider. Paste in your Client ID and Secret from Google here.

Configure the ALB Endpoints for the Cognito App Client.

If, for example, your test application is being hosted on testapp.mycorp.com :

  • Your Callback URLs will be https://testapp.mycorp.com,https://testapp.mycorp.com/oauth2/idpresponse
  • The /oauth2/idpresponse url is handled by the ALB internally, and your app will not see these requests.[2]
  • Your Sign out URL will be https://testapp.mycorp.com

You can keep appending more ALBs and endpoints to this config later, comma separated.

Configure ALB

Now we can configure the ALB to force authentication when accessing all or part of our Web app.

On your ALB, select the listeners tab and edit the rules for the HTTPS listener (you can only configure this on an HTTPS listener).

Add the Cognito pool and app client to the ALB authenticate config

The Cognito user pool is from our previous step, and the App client is the client configured within the Cognito User Pool.

I reduce the Session timeout down to approximately 12 hours, as the default is 7 days.

From this point on, the ALB only ensures that there a valid session with any Google account, even a personal one. There is no way to restrict which email domains to permit in Cognito.

Configure Nginx

You will need nginx running with lua support and the resty.http lua package available as well as this custom lua script:


Our code is configured and managed by Puppet, so you will need to substitute some values with appropriate values (timeouts, valid_domains etc.)

Inside your nginx http block:

 lua_package_path "<>/?.lua;;";

Then inside your server block add the following access_by_lua code to your location block:

location / {
     access_by_lua '
         local jwt = require("nginx-aws-jwt")

auth_req defaults to true. If true, this will issue a 401: Access denied unless a valid AWS JWT token exists and the user’s email address is in the list of valid_domains,e.g. (mycorp.com, myparentcorp.com)

The false setting, as shown, enables a soft launch and will instrument the backend request with extra headers if a valid JWT token is present and otherwise permit access as normal.

The only other parameter currently supported is valid_domains. And should be used as such.

location / {
    access_by_lua '
        local jwt = require("nginx-aws-jwt")

The above example would permit any users from the three defined GSuite domains access.

Starting the sidecar JWT validator

The python app is tested on python3.6 with the following pip packages


gunicorn was launched using the following gunicorn.ini file with the commands:


ARGS="--config /etc/gunicorn/gunicorn.ini --env LOG_LEVEL=debug --env REGION=eu-west-1 --env LOGFILE=/var/log/lovecrafts/awsjwtauth/app.log ${APP}"

${DAEMON} --pid ${PID_FILE} ${ARGS}

Confirming it all works

Well, the obvious thing first, hitting the ALBs DNS name, should get you redirected to authenticate with Google and then redirect you back to your test application.

In our setup nginx is proxypassing to our test app so we can inspect the headers that the app sees post authentication by running the following on the instance behind the ALB:

$ ngrep -d any -qW byline '' dst port 3000

T -> [AP]
GET /favicon.ico HTTP/1.1.
Host: testapp.example.com.
X-Forwarded-Host: testapp.example.com.
X-Forwarded-Port: 443.
X-Forwarded-Proto: https.
X-Amzn-Trace-Id: Root=1-12345678-1234567890123456789012345.
X-Amzn-Oidc-Data: <Base64 encoded json key/signer details>.<Base64 encoded json profile data>.<signature>.
X-Amzn-Oidc-Identity: 55cf11c1-1234-1234-1234-68eaaa646dbb.
X-Amzn-Oidc-Accesstoken: <Base64 JWT Token Redacted>
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0.
accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8.
accept-language: en-US,en;q=0.5.
accept-encoding: gzip, deflate, br.
X-LC-Sid: 123412345123457114928da7eab8a01eda6ca38.
X-LC-Rid: 1234123451234582900d3fc4554225bd338edc4.
X-Auth-Family-name: Brockhurst.
X-Auth-Email: bob@mycorp.com.
X-Auth-Given-name: Bob.
X-Auth-Picture: https://lh5.googleusercontent.com/-12345678901/AAAAAAAAAAA/AAAAAAAAAAA/123-123434556/123-1/photo.jpg.

If the validation fails or is not present the X-Auth-* Headers will not be present. This assumes you’ve set auth=false making auth optional.

If auth=true on a validation failure or missing X-Amzn-Oidc-Data then nginx will return 401, and no request is made to the proxypass.

And a quick look at the python app.log

{"@timestamp":"2018-12-04 16:06:04,809", "level":"WARNING", "message":"Unauthorised access by: unknown_user@gmail.com", "lc-rid":"bf8794defb2885e48eb37552e96545b2cfedec98", "lc-sid":"c4dc0ae13deb4696baa4c3920ffcbbdbf25c71df"}
# When a user not in the valid_domains list attempts to access.

{"@timestamp":"2018-12-04 16:07:06,342", "level":"ERROR", "message":"Error Validating JWT Header: Invalid crypto padding", "lc-rid":"d23ecd55123abb4efff0091b37f8f6161b98218c", "lc-sid":"c4dc0ae13deb4696baa4c3920ffcbbdbf25c71df"}
# Several variants on the above, based on signature failures, corrupted/tampered headers etc.

{"@timestamp":"2018-12-04 16:08:02,738", "level":"INFO", "message":"No JWT Header present", "lc-rid":"895d1055e8f821a75e691ea1f25c29f182131030", "lc-sid":"c4dc0ae13deb4696baa4c3920ffcbbdbf25c71df"}
# INFO messaging only in dev, useful for debugging.

Further, debug information can be output in the nginx error log by setting the log level to info.


Apart from normal nginx monitoring, the authentication sidecar app generates statsd metrics published to the local statsd collector prefixed with awsjwtauth

This includes counts of error conditions and success methods, app restarts etc.

It will also send timing information for its only downstream dependency the AWS ALB Keyserver service.

Example Grafana Dashboard

This dashboard shows that typically we handle the authentication step in the python application in under 1ms. The spikes to approx 100ms are where the ALB has switched the keysigner, and so we had to go fetch the public key from the signer again. The python app caches the public key in memory (see the cache hit/misses graph)

Other Notes

ALB authentication only works on HTTPS connections. So if you also have an HTTP listener, it should redirect to HTTPS. This can be configured at the ALB Listener for HTTP.

Taking it further

Through this article, I’ve described how to achieve a minimal overhead Oauth2 SSO implementation for securing services that organisations would typically put on an internal network or behind a VPN.

Other features that could be added include using the JWT token in headers to enumerate the Google Groups the user is a member of to restrict access further, looking up group memberships in your own internal systems such as LDAP or Active Directory, validating that the device/browser is secure and up to date before allowing access[3], and probably many more things I haven’t thought of yet.


[1] – [back]https://aws.amazon.com/blogs/aws/built-in-authentication-in-alb/

[2] – [back]https://docs.aws.amazon.com/elasticloadbalancing/latest/application/listener-authenticate-users.html

[3] – [back]https://github.com/Netflix-Skunkworks/stethoscope-app

About the Author

Andy ‘Bob’ Brockhurst (@b3cft) is the Head of Infrastructure Architecture and Security at LoveCrafts Collective, a combination of social network, digital marketplace, online media, and e-commerce site to deliver everything makers need to celebrate, share advice and buy supplies for their craft.

Bob has worked with computers for more than 25 years including many years at The BBC and Yahoo!, and is finding it increasingly difficult to explain to his family what he actually does for a living.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps.

Time Series Anomaly Detection with LSTM and MXNet

As software engineers, we try our best to make sure that the solution we build is reliable and robust. Monitoring the production environment with reasonable alerts and timely actions to mitigate and resolve any issues are pieces of the puzzle required to make our customers happy. Monitoring can produce a lot of data – CPU, memory, and disk IO are the most commonly collected for hardware infrastructure, however, in most cases, you never know what the anomaly is, so the data is not labeled.

We decided to take a common problem – anomaly detection within a time series data of CPU utilization and explore how to identify it using unsupervised learning. A dataset we use is the Numenta Anomaly Benchmark (NAB). It is labeled, and we will use labels for calculating scores and the validation set. There are plenty of well-known algorithms that can be applied for anomaly detection – K-nearest neighbor, one-class SVM, and Kalman filters to name a few. However, most of them do not shine in the time series domain. According to many studies [1] [2], long short-term memory (LSTM) neural network should work well for these types of problems.

TensorFlow is currently the trend leader in deep learning, however, at Lohika we have pretty good experience with another solid deep-learning framework, Apache MXNet. We like it because it is light, scalable, portable and well-documented, and it is also Amazon’s preferred choice for their deep-learning framework at AWS.

The neural network that we are going to implement is called autoencoder. The autoencoder is a type of neural network that calculates the approximation of the input function by transforming the input data to the intermediate state and then matching it against the number of input features. When training autoencoders the idea is to minimize some metric dependent on input and output values. We use MSE for the case. To compare the performance of different network designs and hyperparameters we use F1 score. F1 score conveys the balance between the precision and the recall and is commonly used for binary classification.

The goal for this task is to detect all known anomalies on the test set and get the maximum F1 score.

For the implementation, we use Python and a few libraries that are very handy – pandas for dataset manipulation, scikit-learn for data pre-processing and metrics calculations, and matplotlib for visualization.

So let’s get to it…

Dataset Overview

The NAB dataset contains a lot of labeled real and artificial data that can be used for anomaly detection algorithm evaluation. We used actual CPU utilization data of some AWS RDS instances for our study. The dataset contains 2 files of records with the values taken every 5 minutes for a period of 14 days, 4032 entities for each file. We used one file for training and another for test purposes.

Deep learning requires large amounts of data for real-world applications. But smaller datasets are acceptable for basic study, especially since model training doesn’t take much time.

Let’s describe all paths to datasets and labels:

Anomaly labels are stored separately from the data values. Let’s load the train and test datasets and label the values with pandas:

Check the dataset head:


As we can see, it contains a timestamp, a CPU utilization value, and labels noting if this value is an anomaly.

The next step is a visualization of the dataset with pyplot, which requires converting timestamps to time epochs:

When plotting the data, we mark anomalies with green dots:

The visualization of the training and test datasets look like this:



Preparing the Dataset

There is one thing that played an important role in dataset preparation – masking of labeled anomalies.

We started with training our LSTM neural network on the initial dataset. This resulted in our model being able to predict, at best, one anomaly out of the 2 labeled in the test dataset. Then, after taking into account that we have a small dataset with limited anomalies, we decided to test if training on a dataset that contains no anomalies would improve results?

We took the approach of masking the anomalies in the original dataset – just put previous non-anomalous value instead of anomalies. After training the model on the dataset with masked anomalies, we were able to get both anomalies predicted. However, this was at the expense of an additional false positive prediction.

Nevertheless, the F1 score is higher for the second case, so the final implementation should take into account the preferred solution. In practice it can have a significant impact – depending on your case, both missing an anomaly or having a false positive prediction can be quite expensive.

Let’s prepare the data for the machine learning processing. The training set contains anomalies, as described above, replaced with non-anomalous values. The simplest way is to use pandas ‘fillna’ method with ‘ffill’ param to replace anomalies values with neighbors:

The next step is scaling the dataset values. This is highly important. We use scikit-learn StandardScaler to scale the input data and pandas to select features from the dataset.

The only feature we use is the CPU utilization value. We tried extracting some additional time-based features to increase the output performance – for example, a weekday or a day/night feature -however, we didn’t find any useful patterns this way.

Let’s prepare our training and validation datasets:

Choosing a Model

Let’s define the neural network model for our autoencoder:

There is a lot happening in this small piece of code. Let’s review it line-by-line:

  • gluon.nn.Sequential() stacks blocks sequentially
  • model.add – adds a block to the top of the stack
  • gluon.rnn.LSTM(n) – LSTM layer with n-output dimensionality. In our situation, we used an LSTM layer without dropout at the layer output. Commonly, dropout layers are used for preventing the overfitting of the model. It’s just zeroed the layer inputs with the given probability
  • gluon.nn.Dense(n, activation='tanh') – densely-connected NN layer with n-output dimensionality and hyperbolic tangent activation function

We did a few experiments with the neural network architecture and hyperparameters, and the LSTM layer followed by one Dense layer with ‘tanh’ activation function worked best in our case. You can check the comparison table with corresponding F1 scores at the end of the article.

Training & Evaluation

The next step is to choose loss function:

‘L2Loss’ is chosen as a loss function, because the trained the neural network is autoencoder, so we need to calculate the difference between the input and the output values. We will calculate MSE after each training epoch of our model to visualize this process. You can find the whole list of other available loss functions here.

Let’s use CPU for training the model. It’s possible to use GPU if you have an NVIDIA graphics card and it supports CUDA. With MXNet this requires just a context preparation:

For the training process, we should load data in batches. MXNet Gluon DataLoader is a good helper in this process:

Batch size value is important. Small batches increase training time. By experimenting with batch size, we found that 48 works well. The training process lasts a short amount of time, and the batch is not too big to decrease the F1 score. As far as we know, the sample rate of data values is 5 minutes, so 48 value batches are equal to a period of 4 hours.

The next step is to define hyperparameters initializer and model training algorithm:

We use Xavier weights initializer as it is designed to keep the scale of gradients roughly the same in all layers. We use ‘sgd’ optimizer, and the learning rate is 0,01. These values look optimal in this case – steps aren’t too small for SGD, optimization doesn’t take too long, and it’s not too big. So it doesn’t overshoot the minimum of the loss function.

Let’s run the training loop and plot MSEs:

The results of the training process:

As you can see from this plot, 15 training epochs are spot-on for the training process as we don’t want to get an undertrained or overtrained neural network.


When using autoencoder for each pair of (input value, predicted output value), we have a reconstruction error. We can find a reconstruction error for the training dataset. Then we say the input value is anomalous in case the reconstruction error deviates quite far from the value for the whole training dataset. Using the 3-sigma approach works fine for this case – if the reconstruction error is higher than the third standard deviation it will be marked as an anomaly.

Let’s see the results for the model and visualize predicted anomalies.

The threshold calculation:

Let’s check the predictions on the test dataset:

Filtering anomalies from predictions using the 3-sigma threshold:

Plotting the result:

The labeled anomalies from the NAB dataset are marked with green, and the predicted anomalies are marked with red:


As you can see from the plot with this simple model predicted 3 anomalies out of 2. The confusion matrix for the results looks like the following:


Taking into account the size of available dataset and time spent on coding, this is a pretty good result. According to the NAB whitepaper dataset scoring, you get +1 point for a TP, and -0,22 point for an FP. The final score is then normalized in the range of 0 to 100. Taking our simple solution that predicted 2 TPs and 1 FP, we get 1,78 points out of 2. When scaled, this corresponds to the score of 89. Comparing this result to the NAB scoreboard, this is a pretty strong result. However, we cannot compare our score to it. The key difference in that NAB evaluates the algorithm performance on a large number of datasets and does not take into account the nature of the data. We have built a model for the only one anomaly prediction case. So this is not an apples-to-apples comparison.

The F1-score calculation:

The final F1 score is 0,8. That’s not an ideal value, but it’s good enough as a starting point for prediction and analysis of possible anomalous cases in the system.


In this article, we have discussed a simple solution for handling anomaly detection in time series data. We have passed through standard steps of a data science process – preparing the dataset, choosing a model, training, evaluation, hyperparameter tuning and prediction. In our case, training the model on a pre-processed dataset that has no anomalies made a great impact on the F1 score. As a result, we trained the model which works quite well, given the amount of input data and effort spent. As you can see, MXNet provides an easy to use and well-documented API to work with such a complex neural network. Code from this article is available at GitHub.


  1. MXNet as simple as possible
  2. Pizza type recognition using MXNet and TensorFlow
  3. Experiments with MxNet and Handwritten Digits recognition

Appendix: Experiments with network architecture and hyperparameters tuning

In this section, we have collected the results of the experiments we performed during network design and hyperparameter tuning. The experiments are listed in chronological order and on every experiment we changed just a single parameter at a time. So any possible correlation between any parameters is not taken into account. The parameters that worked best are marked in green.


About the Authors

Denys is a senior software engineer at Lohika. Passionate about web development, well-designed architecture, data collecting, and processing.

Serhiy is a solutions architect and technical consultant at Lohika. He likes sharing his experience in the development of distributed applications, microservices architectures, and large-scale data processing solutions in different domains.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.

Offloading K8s

16. December 2018 2018 0

Container Origins

Container is, by no means, a new concept. While Docker should and gets credit for popularizing it, the concept dates all the way back to the 1970s through the birth of the chroot system call. chroot is a way to isolate a process (and its children processes) from the rest of the system. Through this approach a program (process) can be run in a modified environment that is confined (sandboxed). A few decades later, in the 2000s, BSD jails, Solaris zones and Linux Containers (LXC) served as the manifestation of progress. Containers provide virtualization at the operating system level. Since containers share the kernel, aka operating system, container start times are rapid, in seconds, with a small footprint, in megabytes. Virtual machines (VM), an alternate virtualization construct require the full operating system. This contrast is exactly what makes containers so light weight compared to VMs. Consequently, this is one of the fundamental reasons why containers became the evolutionary successor to VMs.

Container Fascination

When an application is packaged into a container, it includes the application and its dependencies. Therefore, such a containerized application essentially becomes portable and resource efficient. This also makes the sandboxed application consistent regardless of where it is deployed, one of the hallmark benefits. With monoliths giving way to microservices in the cloud era, this transition played neatly into the hands of containers where developers could package small microservices based cloud native applications.

Container Orchestration

As containers became popular, it led to the next question of managing them efficiently. Managing containers at scale became a challenge that different groups strived to solve. Until last year, there were different products wanting to become the coveted industry darling. The battle was mainly among Docker Swarm, Apache Mesos and Kubernetes. Even though there was separation emerging in adoption numbers, the debate had not really been won. 2017 finally settled the debate with Kubernetes (K8s) winning the Container Orchestration Battle. One of the salient reasons for the win was the community support behind Kubernetes. The trigger was Google relinquishing control by donating [2] it to the open source community’s Cloud Native Computing Foundation (CNCF). The evidence of victory lies in the announcements made by various cloud vendors last year.

  1. Azure Kubernetes Service (AKS) became GA in October 2017.
  2. Pivotal Container Service (PKS) announced at 2017 VMWorld.
  3. Amazon Elastic Container Service for Kubernetes (EKS) announced at 2017 ReInvent.

Amazon EKS

After much anticipation, Amazon Elastic Container Service for Kubernetes (EKS) was announced at re:Invent 2017. While it could be argued that this Managed Kubernetes Service will end up cannibalizing AWS’s native Elastic Container service (ECS), it speaks to Amazon’s commitment to customer obsession [3].

EKS is currently available in the following regions with Ohio being the fourth and latest addition as of last month [4].

  1. North Virginia
  2. Oregon
  3. Ohio
  4. Ireland

The value proposition of EKS is to offload management of the Kubernetes cluster. With the Kubernetes masters deployed across multi AZ environment, it provides the requisite resiliency. As the underlying masters’ infrastructure is managed by AWS, version upgrades and patching are automatically managed by AWS. Secondly, this allows integration of Kubernetes clusters with the rich AWS ecosystem.

EKS pricing model has two components.

  1. Cost of EC2 servers by the minute.
  2. Cost of Control Plane by the hour.

EKS introduces a concept of platform that is meant to easily identify enabled features of the cluster. While EKS currently supports Kubernetes 1.10, it should be noted that 1.12 is the latest Kubernetes version and 1.13 is expected to be released [5] around this year’s KubeCon (Dec 10-13).

AWS Fargate

While EKS announcement at last year’s reInvent wasn’t particularly a surprise, there was another service that came out of the blue. The unexpected announcement was Fargate; a managed container service married to the serverless paradigm. In this service, AWS abstracts away the infrastructure, thereby entirely relieving customers from having to manage the underlying infrastructure.

Interestingly, Fargate only supports ECS at this point but it is a given that running Kubernetes without managing infrastructure is on the road map. This will allow AWS to offer a complete Container as a Service offering (CaaS).

Container as a Service

CaaS, through Fargate, serves another huge goal of AWS that ties back to Lambda. Back in 2014 reInvent, when Lambda was announced it generated fervor with a radical paradigm of, though poorly named, Serverless. This paradigm is meant to not only abstract away the infrastructure, provisioning is only needed when the code is invoked.

With the term Serverless being a misnomer, there is a better alternate name of Function as a Service (FaaS). I recall at one of the Conferences a Panelist sharing a far eloquent description of Serverless.

  1. Invisible infrastructure
  2. Micro-billing
  3. Event based programming

With AWS focused on taking on more and more of the undifferentiated heavy lifting, this is the logical progress where after managing the control plane, the next iteration takes over management of the servers as well.

2018 re:Invent

Unlike last year, this year’s reInvent wasn’t heavy on container technology. Following are couple of high level enhancements.

  1. AWS CodeDeploy supports blue/green deployments to AWS ECS and Fargate.
  2. AWS CodePipeline supports Elastic Container Registry (ECR) as a source provider.


[1] http://www.linuxandubuntu.com/home/virtualbox-vs-container

[2] https://techcrunch.com/2015/07/21/as-kubernetes-hits-1-0-google-donates-technology-to-newly-formed-cloud-native-computing-foundation-with-ibm-intel-twitter-and-others/

[3] https://www.amazon.jobs/en/principles

[4] https://aws.amazon.com/about-aws/whats-new/2018/11/amazon-eks-available-in-ohio-region/

[5] https://github.com/kubernetes/sig-release/tree/master/releases/release-1.13

About the Author

Atif Siddiqui is a returning author having participated in 2016 AWS Advent. He is passionate about Cloud technologies and is a Cloud Infrastructure Architect at Citigroup. He currently holds all three AWS Associate certifications and remains committed to expanding his knowledge in Cloud Solutions.

About the Editor

John Varghese is a Cloud Steward at Intuit responsible for the AWS infrastructure of Intuit’s Futures Group. He runs the AWS Bay Area meetup in the San Francisco Peninsula Area for both beginners and intermediate AWS users. He has also organized multiple AWS Community Day events in the Bay Area. He runs a Slack channel just for AWS users. You can contact him there directly via Slack. He has a deep understanding of AWS solutions from both strategic and tactical perspectives. An avid AWS user since 2012, he evangelizes AWS and DevOps every chance he gets.

AWS and the New Enterprise WAN

15. December 2018 2018 0

The public cloud’s wholesale transformation of IT includes a shifting in enterprise IT requirements for the wide-area network (WAN). The viability of traditional network architectures for interconnecting hundreds or even thousands of remote offices, or branches, is rapidly decreasing as enterprises consume IT as utility. More agile, secure, and dynamic WAN is needed. As an industry, we have a name for this emerging networking trend: Software-Defined WAN (SD-WAN). In this article, we explore SD-WAN  with a focus on the integration with AWS VPC infrastructure.

Importance of Multi-Cloud

Enterprises may have applications in on-premise data centers, colocation facilities, and infrastructure-as-a-service (IaaS) provider platforms and need to access the information in these locations in a flexible manner. Also, SaaS (Software As A Service)  is now the standard option for a wide range of enterprise applications. Yet the growth in SaaS hasn’t always been met by a growth in the infrastructure needed to cope with the resulting increase in network utilization. Older WAN technologies deployed at corporate branches are no longer sufficient for the modern SaaS-enabled workforce. As data stops flowing to and from the data center and starts flowing over the internet, congestion, packet loss, and high latencies are all too common.

The notion that companies can spread a single application across multiple public clouds has had its detractors. Some argue that the only “multi-cloud” approach will be a distribution of applications in a way that caters to the perceived strengths of a given cloud provider. For example, a company might consume AWS’s Lambda for functions-as-a-service while looking to Google Cloud Platform (GCP) for machine learning services. This approach is valid, and we see it regularly in our work; however, we also observe how the rise of Kubernetes is changing the IT roadmaps within the enterprise. We have no doubts: the future is multi-cloud.

Let’s not forget that most enterprises–unlike companies “born in the cloud”–must continue to operate infrastructure on-premise and in third-party colocation facilities. Why will private cloud deployments persist? Isn’t this passe? To answer this, let’s look at a telling quote from Amazon’s Anu Sharma, product manager for the new AWS Outputs hybrid cloud service. She acknowledges “…there are some applications that [customers] cannot move to AWS largely because of physical constraints…” She highlighted latency and its effect on moving data in and out of the cloud. Whether the private cloud is implemented as OpenStack, AWS Outposts, or Azure Stack, private cloud will remain in the picture.

Today’s WAN is Inadequate

Therefore, in an environment in which application placement is diverse, enterprises must figure out how to connect the employees in many physical locations to the tools they need to perform their job functions. The complexity involved in moving bits around these highly heterogeneous environments can be overwhelming.

Traditionally enterprises paid telecommunication companies premium prices for Multiprotocol Label Switching (MPLS) links or private point-to-point links for connecting remote branches to centralized corporate data centers. Traffic from branch locations was carried over the private connectivity regardless whether the bits were intended for an internal app, a SaaS application, or an Internet search engine. This added latency to network connections as all traffic was–to use networking parlance–”backhauled” to a small number of corporate locations. Within these corporate data centers, the network was the proverbial long pole in the tent in deploying new applications such that the geographically dispersed workforce could access them.

Figure 1 depicts the connection of multiple remote branches to a centralized data center. Note that all traffic exiting the branches traverses the expensive MPLS network to reach all destinations–including Internet ones.


image4Figure 1: Traditional Enterprise WAN

As mentioned earlier, as workloads spread across on-premise and public cloud infrastructure, enterprises need more flexible, secure, and agile means for connecting branch offices, namely SD-WAN. The projects around hybrid cloud connectivity and the modernization of the WAN infrastructure will run in parallel for many years to come. Any effort to move traffic between on-premise and off-premise needs to keep the requirements of the new WAN in mind.

Before defining SD-WAN, let’s examine the underlying access mechanisms at our disposal. We are no longer limited to T1 and other leased line service for business-grade connectivity. Access might consist of fiber-based Ethernet, business cable, fixed 4G/5G, or satellite. These access types might coexist with MPLS links or as a mean to replace them. A given branch might have more than one path for exiting the branch. Considering internet connectivity by itself, all type of access are not alike. A larger office might have a 1Gb/s fiber links while a kiosk in the mall might have spotty Wi-fi.

Does this heterogeneity of WAN access and multiple entry/exit paths sound like a management nightmare? This could very well be the case if we designed and operated the new WAN in the manner of the previous generation WAN.

Enter The Software-Defined WAN

What SD-WAN provides is an abstraction layer for the WAN to simplify the management and cost of wide-area connectivity while recognizing application performance improvements

Let’s compare and contrast the WAN abstraction and VPC as a data center abstraction. VPC constructs such as subnets, load balancers, and virtual gateways are purely ephemeral with the ability to appear and later vanquish with the stroke of an API call. But how do we abstract a WAN? Fiber and copper are tangible. We want to break the coupling of network capabilities with how the packets are delivered over the various access mechanisms. SD-WAN accomplishes this through the abstraction or introduction of an overlay network that extends over various connectivity methods. The overload network is implemented using SD-WAN appliances on either side of the connection.



Figure 2: SD-WAN Overlay

Is it possible to cut through the hype and describe what SD-WAN can deliver for the enterprise in term of efficiency, cost reduction, and strategic enablement of new services? Yes, although doing so can be challenging.

To start, SD-WAN may be deployed in many different models. For example, service providers can add SD-WAN to their existing MPLS offering as a “first mile/last mile” technology.  On the other hand, enterprises might want to deploy SD-WAN in a full overlay model where they assert full control over the SD-WAN solution and appliances. Even such a model may be deployed in-house or as a managed service. In this post, we focus on the latter: the full overlay model.

In addition, there is much confusion as what features should an SD-WAN solution contain at a minimum. The SD-WAN space–like any other “hot” technology area–is a crowded area. Engineers may find it very hard to distinguish between a true SD-WAN solution and old WAN optimization appliance wrapped in new marketing jargon. To make things even more confusing, each enterprise is approaching SD-WAN from a different problem space. For example, some might consider application performance enhancements as their primary goal while others consider cost reductions as the primary driver.

We believe the following should be present in any chosen SD-WAN solution:

Provides Agility

Network agility–not lower costs or better performance–is the main factor for enterprises adopting SD-WAN infrastructure, according to findings in a survey conducted by Cato Networks.

A good SD-WAN solution enables rapid branch deployments with self-provisioning. Bringing a new branch or remote location online should be easy and completed within minutes. The branch appliance, physical or virtual, should just to be connected to the LAN and WAN links serving the branch, plugged it in, and turned on. No specialized IT expertise should be required on premise at the branch.

It is Flexible

Flexibility can be evaluated in many different contexts. Any SD-WAN solution should be end-to-end and not put any restrictions on where the data should reside. The reason it has to be end-to-end is that your users are in many places. They can be on your branches, or they can be on your campuses, or they can be road warriors connecting to your resources using client VPN software over the public Internet.

Similarly, your data and your applications are everywhere. They’re on-prem, and they are in the public cloud. The hubs for the SDN-deployment should be able to be a physical or virtual device in a traditional data center, a virtual device in the corporate private cloud or a virtual device in the public cloud of choice.

In addition, a true SD-WAN solution should support different topologies. Many enterprises use a hub and spoke or a full mesh topology. Most SD-WAN solutions support these basic topologies. But one could think of many other hybrid topologies, and SD-WAN solutions should not restrict enterprises to one end of the topology spectrum. SD-WAN should provide insertion of network services whether on the branch customer premise equipment (CPE), in the public cloud, or in regional and enterprise data centers, deployed in a wide range of topologies.

In addition, these SD-WAN solutions should provide automation and business-policy abstraction to simplify complex configurations and provide flexibility in traffic routing and policy definitions.

Includes Integrated Security

One could imagine a day in which a traditional firewall device isn’t needed per branch. It is no wonder that on the long list of SD-WAN vendors we find many of familiar names in the traditional firewall vendor space.  We believe integrating advanced security features into SD-WAN services allows a cleaner more simple deployment model for the branches.

Even though basic firewalling capabilities in some SD-WAN appliances might be sufficient for some enterprises, given today’s threat landscape most enterprise need and demand advanced firewall capabilities if the only appliance deployed on the branch is an SD-WAN device. Enterprises need to find a solution that delivers advanced security features without compromising desired SD-WAN functionality such as application optimization or fast fail-over.

Optimizes Application Performance

One of SD-WAN’s most desirable features is to position applications to choose the best connectivity based not only on performance metrics but also business metrics such as cost. Let’s take an example of ensuring users are proximal to their applications.  In our example, an enterprise has existing servers deployed in both the on-premise data center as well as the AWS public cloud. Let’s say you have an end user in Arlington, VA. Your on-premise data center is located in Pittsburgh while the AWS deployment region is us-east-1 in Ashburn, VA.

For applications that are housed in the Pittsburgh data center, branches have two possible paths to get to these resources. For applications that require large bandwidth and guaranteed SLAs, the MPLS path should be used. On the other hand, if cost is the primary factor for an application such as bulk file transfers, then perhaps the path through the Internet is the best option.

Similar choices exist for data housed in the AWS cloud. For the Branch to AWS VPC we can pick between Direct Connect (DX) connection to the VPC vs. pure Internet access. We should be able to optimize based on business requirements. A sensitive application might only be allowed to use the DX connection provided through the data center while for the rest Internet-based access might suffice. An acceptable SD-WAN solution provides IPSEC-based encryption between the branches and the centralized hub location (cloud or data center) when the connection is through the Internet. The configuration of this IPsec tunnel and the routing through them should be performed by the SD-WAN controller and not manually.



Figure 3: Branch with multiple paths to reach applications

As described, SD-WAN hub can reside within an AWS VPC, in effect turning the VPC into another aggregation hub for the remote sites. In AWS, an SD-WAN termination point is an appliance from the Marketplace. For an interesting in-depth look at SD-WAN appliances, check out the AWS-commissioned report by ESG Labs entitled SD-WAN Integration with Amazon Web Services.


There may be different architectures to home an SD-WAN appliance within an AWS architecture, but we would like to explore one we like to call “edge services VPC. For the sake of simplicity, we show a single region deployment with three VPCs.

image2Figure 4: The Edge Services VPC Design

There are many possible ways to terminate the edge SD-WAN connections into a public cloud. At the very basic level SDN solutions rely on an SD-WAN gateway, which performs the hub functionality. This appliance which will be a VM when deployed in the AWS VPC aggregates all the connections from the SD-WAN branches.

We like the idea of a separate “edge services VPC” dedicated for all the edge connectivity terminations. This VPC would terminate the SDN-WAN connections. The connectivity between this VPC and other VPCs can be provided through a simple VPC peering if the AWS deployment is small or through a Transit Gateway (TGW) if a larger number of VPCs need access to the Edge Services VPC.  Even though outside the scope of this paper, one could imagine yet a larger deployment with edge services VPC per region, each connected to other VPCs within that region through a TGW.


In this article, we’ve described how multi-cloud and the diversity of WAN connectivity options for enterprise branches have given rise to a flexible, agile, and secure SD-WAN. We believe that enterprise public cloud migrations–while not necessarily dependent on SD-WAN–will occur in the same timelines as the move to SD-WAN as enterprise IT architects recognize that more intelligence is needed in the network path selection process. The details about SD-WAN vendor selection and design will vary. One thing is certain: the enterprise WAN is evolving toward a more software-centric approach to meet the needs of enterprise applications.


About the Authors

Amir Tabdili and Jeff Loughridge have been designing, operating, and engineering large-scale IP infrastructures since the late-1990s. In their current roles as Chief Architect and CTO of Konekti Systems, the two help clients with public cloud networking, SD-WAN, and hybrid IT architectures. You can learn more about Konekti at https://konekti.us.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.

Deploy a Secure Static Site with AWS & Terraform

14. December 2018 2018 0

There are many uses for static websites. A static site is the simplest form of website, though every website consists of delivering HTML, CSS and other resources to a browser. With a static website, initial page content is delivered the same to every user, regardless as to how they’ve interacted with your site previously. There’s no database, authentication or anything else associated with sending the site to the user – just a straight HTTPS connection and some text content. This content can benefit from caching on servers closer to its users for faster delivery; it will generally also be lower cost as the servers to deliver this content to not themselves need to interpret scripting languages or make database connections on behalf of the application.

The static website now has another use, as there are more tools to provide highly interactive in-browser applications based on JavaScript frameworks (such as React, Vue or Angular) which manage client interaction, maintain local data and interact with the web service via small but often frequent API calls. These systems decouple front-end applications from back-end services and allow those back-ends to be written in multiple languages or as small siloed applications, often called microservices. Microservices may take advantage of modern back-end technologies such as containers (via Docker and/or Kubernetes) and “serverless” providers like AWS Lambda.

People deploying static sites fall into these two very different categories – for one the site is the whole of their business, for the other the static site is a very minor part supporting the API. However, each category of static site use still shares similar requirements. In this article we explore deploying a static site with the following attributes:

  • Must work at the root domain of a business, e.g., example.com
  • Must redirect from the common (but unnecessary) www. subdomain to the root domain
  • Must be served via HTTPS (and upgrade HTTP to HTTPS)
  • Must support “pretty” canonical URLs – e.g., example.com/about-us rather than example.com/about-us.html
  • Must not cost anything when not being accessed (except for domain name costs)

AWS Service Offerings

We achieve these requirements through use of the following AWS services:

  • S3
  • CloudFront
  • ACM (Amazon Certificate Manager)
  • Route53
  • Lambda

This may seem like quite a lot of services to host a simple static website; let’s review and summarise why each item is being used:

  • S3 – object storage; allows you to put files in the cloud. Other AWS users or AWS services may be permitted access to these files. They can be made public. S3 supports website hosting, but only via HTTP. For HTTPS you need…
  • CloudFront – content delivery system; can sit in front of an S3 bucket or a website served via any other domain (doesn’t need to be on AWS) and deliver files from servers close to users, caching them if allowed. Allows you to import HTTPS certificates managed by…
  • ACM – generates and stores certificates (you can also upload your own). Will automatically renew certificates which it generates. For generating certificates, your domain must be validated via adding custom CNAME records. This can be done automatically in…
  • Route53 – AWS nameservers and DNS service. R53 replaces your domain provider’s nameservers (at the cost of $0.50 per month per domain) and allows both traditional DNS records (A, CNAME, MX, TXT, etc.) and “alias” records which map to a specific other AWS service – such as S3 websites or CloudFront distributions. Thus an A record on your root domain can link directly to Cloudfront, and your CNAMEs to validate your ACM certificate can also be automatically provisioned
  • Lambda – functions as a service. Lambda lets you run custom code on events, which can come directly or from a variety of other AWS services. Crucially you can put a Lambda function into Cloudfront, manipulating requests or responses as they’re received from or sent to your users. This is how we’ll make our URLs look nice

Hopefully, that gives you some understanding of the services – you could cut out CloudFront and ACM if you didn’t care about HTTPS, but there’s a worldwide push for HTTPS adoption to provide improved security for users and browsers including Chrome are marking pages not served via HTTPS as “insecure” as part of their commitment.

All this is well and good, but whilst AWS is powerful their console leaves much to be desired, and setting up one site can take some time – replicating it for multiple sites is as much an exercise in memory and box ticking as it is in technical prowess. What we need is a way to do this once, or even better have somebody else do this once, and then replicate it as many times as we need.

Enter Terraform from HashiCorp

One of the most powerful parts of AWS isn’t clear when you first start using the console to manage your resources. AWS has a super powerful API that drives pretty much everything. It’s key to so much of their automation, to the entirety of their security model and tools, tools like Terraform.

Terraform from HashiCorp is “Infrastructure-as-Code” or IaC. It lets you define resources on a variety of cloud providers and then run commands to:

  • Check the current state of your environment
  • Make required changes such that your actual environment matches the code you’ve written

In code form, Terraform uses blocks of code called resources:

resource “aws_s3_bucket” “some-internal-reference” {
  bucket = “my-bucket-name”

Each resource can include variables (documented on the provider’s website), and these can be text, numbers, true/false, lists (of the above) or maps (basically like subresources with their variables).

Terraform is distributed as pre-built binaries (it’s also open source, written in Go so you can build it yourself) that you can run simply by downloading, making them executable and then executing them. To work with AWS, you need to define a “provider” which is formatted similarly to a resource:

provider “aws” {

To run any AWS API (via command line, terraform or a language of your choice) you’ll need to generate an access key and secret key for the account you’d like to use. That’s beyond the scope of this article, but given you should also avoid hardcoding those credentials into Terraform, and given you’d be very well served to have access to it, skip over to the AWS CLI setup instructions and set this up with the correct keys before continuing.

(NB: in this step you’re best provisioning an account with admin rights, or at least full access to IAM, S3, Route53, Cloudfront, ACM & Lambda. However don’t be tempted to create access keys for your root account – AWS recommends against this)

Now that you’ve got your system set up to use AWS programmatically, installed Terraform and been introduced to the basics of its syntax it’s a good time to look at our code on GitHub.

Clone the repository above; you’ll see we have one file in the root (main.tf.example) and then a directory called modules. One of the best parts of Terraform is modules and how they behave. Modules allow one user to define a specific set of infrastructure that may either relate directly to each other or interact by being on the same account. These modules can define variables allowing some aspects (names, domains, tags) to be customised, whilst other items that may be necessary for the module to function (like a certain configuration of a CloudFront distribution) are fixed.

To start off run bash ./setup which will copy the example file to main.tf and also ensure your local Terraform installation has the correct providers (AWS and file archiving) as well as set up the modules. In main.tf then you’ll see a suggested set up using three modules. Of course, you’d be free to just remove main.tf entirely and use each module in its own right, but for this tutorial, it helps to have a complete picture.

At the top of the main.tf file are defined three variables which you’ll need to fill in correctly:

  1. The first is the domain you wish to use – it can be your root domain (example.com) or any sort of subdomain (my-site.example.com).
  2. Second, you’ll need the Zone ID associated with your domain on Route 53. Each Route 53 domain gets a zone ID which relates to AWS’ internal domain mapping system. To find your Zone ID visit the Route53 Hosted Zones page whilst signed in to your AWS account and check the right-hand column next to the root domain you’re interested in using for your static site.
  3. Finally choose a region; if you already use AWS you may have a preferred region, otherwise, choose one from the AWS list nearest to you. As a note, it’s generally best to avoid us-east-1 where possible, as on balance this tends to have more issues arise due to its centrality in various AWS services.

Now for the fun part. Run terraform plan – if your AWS CLI environment is set up the plan should execute and show the creation of a whole list of resources – S3 Buckets, CloudFront distributions, a number of DNS records and even some new IAM roles & policies. If this bit fails entirely, check that the provider entity in main.tf is using the right profile name based on your ~/.aws/credentials file.

Once the plan has run and told you it’s creating resources (it shouldn’t say updating or destroying at this point), you’re ready to go. Run terraform apply – this basically does another plan, but at the end, you can type yes to have Terraform create the resources. This can take a while as Terraform has to call various AWS APIs and some are quicker than others – DNS records can be slightly slower, and ACM generation may wait until it’s verified DNS before returning a positive response. Be patient and eventually it will inform you that it’s finished, or tell you if there have been problems applying.

If the plan or apply options have problems you may need to change some of your variables based on the following possible issues:

  • Names of S3 buckets should be globally unique – so if anyone in the world has a bucket with the name you want, you can’t have it. A good system is to prefix buckets with your company name or suffix them with random characters. By default, the system names your buckets for you, but you can override this.
  • You shouldn’t have an A record for your root or www. domain already in Route53.
  • You shouldn’t have an ACM certificate for your root domain already.

It’s safe (in the case of this code at least) to re-run Terraform if problems have occurred and you’ve tried to fix them – it will only modify or remove resources it has already created, so other resources on the account are safe.

Go into the AWS console and browse S3, CloudFront, Route53 and you should see your various resources created. You can also view the Lambda function and ACM but be aware that for the former you’ll need to be in the specific region you chose to run in, and for the latter, you must select us-east-1 (N. Virginia)

What now?

It’s time to deploy a website. This is the easy part – you can use the S3 console to drag and drop files (remember to use the website bucket and not the logs or www redirect buckets), use awscli to upload yourself (via aws s3 cp or aws s3 sync) or run the example bash script provided in the repo which takes one argument, a directory of all files you want to upload. Be aware – any files uploaded to your bucket will immediately be public on the internet if somebody knows the URL!

If you don’t have a website, check the “example-website” directory – running the bash script above without any arguments will deploy this for you. Once you’ve deployed something, visit your domain and all being well you should see your site. Cloudfront distributions have a variable time to set up so in some cases it might be 15ish minutes before the site works as expected.

Note also that CloudFront is set to cache files for 5 minutes; even a hard refresh won’t reload resource files like CSS or JavaScript as Cloudfront won’t go and fetch them again from your bucket for 5 minutes after first fetching them. During development you may wish to turn this off – you can do this in the CloudFront console, set the TTL values to 0. Once you’re ready to go live, run terraform apply again and it will reconfigure Cloudfront to recommended settings.


With a minimal amount of work we now have a framework that can deploy a secure static site to any domain we choose in a matter of minutes. We could use this to deploy websites for marketing clients rapidly, publish a blog generated with a static site builder like Jekyll, or use it as the basis for a serverless web application using ReactJS delivered to the client and a back-end provided by AWS Lambda accessed via AWS API Gateway or (newly released) an AWS Application Load Balancer.

About the Author

Mike has been working in web application development for 10 years, including 3 years managing a development team for a property tech startup and before that 4 years building a real time application for managing operations at skydiving centres, as well as some time freelancing. He uses Terraform to manage all the AWS infrastructure for his current work and has dabbled in other custom AWS tools such as an improvement to the CloudWatch logging agent and a deployment tool for S3. You can find him on Twitter @m1ke and GitHub.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.