Unlearning for DynamoDB

04. December 2018 2018 0

Introduction

Relational Databases have been the standard way to store information since the 1980s. Relational Database usage techniques such as normalisation and index creation form part of many University courses around the world. In this article, I will set out how perhaps the hardest part of using DynamoDB is forgetting the years of relational database theory most of us have absorbed so fundamentally that it colours every aspect of how we think about storing and retrieving data.

Let’s start with a brief overview of DynamoDB.

Relation Databases were conceived at a time when storage was the most expensive part of a system. Many of the design decisions revolve around this constraint – normalisation places a great deal of emphasis on only having a single copy of each item of data.

NoSQL databases are a more recent development, and have a different target for operation. They were conceived in a world where storage is cheap, and network latency is often the biggest problem, limiting scalability.

DynamoDB is Amazon’s answer to the problem of effectively storing and retrieving data at scale. The marketing would lead you to believe it offers:

  • Very high data durability
  • Very high data availability
  • Infinite capacity
  • Infinite scalability
  • High flexibility with a schemaless design
  • Fully managed and autoscaling with no operational overhead

In reality, whether it lives up to the hype depends on your viewpoint of what each of these concepts mean. The durability, availability, and capacity points are the easiest to agree with – the changes of data loss are infinitesimally low, the only limit on capacity is the 10GB limit per partition, and the number of DynamoDB outages in the last eight years is tiny. As we move down the list though, things get a bit shakier. Scalability depends entirely on sharding data. This means that you can run into issues with ‘hot’ partitions, where particular keys are used much more than others. While Amazon has managed to mitigate this to some extent with adaptive capacity, the problem is still very much something you need to design your data layout to avoid. Simply increasing the provisioned read and write capacity won’t necessarily give you extra performance. It’s kind of true that DynamoDB is schemaless, in that table structures are not uniform, and each row within a table can have different columns of differing types. However, you must define your primary key up front, and this can never change. The primary key consists of at minimum a partition key, and also optionally a range key (also referred to as a sort key). You can add secondary indexes after table creation, but only up to a maximum of 5 local and 5 secondaries. This all adds up to make the last point, that there is no operational overhead when using DynamoDB, obviously false. You won’t spend time upgrading database version or operation systems, but there’s plenty of ops work to do designing the correct table structure, ensuring partition usage is evenly spread, and troubleshooting performance issues.

Unlearning

So how do you make your life using DynamoDB as easy as possible? Start by forgetting everything you know about relational databases because almost none of it is true here. Be careful – along the way you’ll find a lot of your instincts about how data should be structured are actually optimisations for RDBMS rather than incontrovertible facts.

Everything in a single table

In DynamoDB, the ‘right’ number of tables to power an application is one. There are of course exceptions but start with the assumption that all data for your application will be in a single table, and move to multiple tables only if really necessary.

Know how you’re going to use your data up front

In a relational database, the structure of your data stems from the data itself. You group and normalise, and then stitch things back together at query time.

DynamoDB is different. Here you need to know the queries you’re most commonly going to run to structure your data appropriately and work backwards to come up with the table design.

It’s OK to duplicate data

Many times when using DynamoDB you will store the same data more than once. This might even be done with multiple copies of data in the same row, to allow it to be used by different indexes. Global Secondary Indexes are just copies of your data, with different keys.

Re-use the same column for different data

Imagine you have a table with a compound primary key – an account ID as the partition key, and a range key. You could use the range key to store different content about the account, for example, you might have a sort key settings for storing account configuration, then a set of timestamps for actions. You can get all timestamps by executing a query between the start of time and now, and the settings key by specifically looking up the partition key and a sort key named settings. This is probably the hardest part to get your head around at first, but once you get used to it, it becomes very powerful, particularly when used with secondary indexes.

This, of course, makes choosing a suitable name for your sort column very difficult.

Concatenate data into composite sort keys

If you have hierarchical data, you can use compound sort keys to allow you to refine data. For example, if you need to sort by location, you might have a sort key that looks like this:

Using this sort key, you can find all items within a partition at any of the hierarchical levels by using a starts-with operator.

Use sparse indexes

Because not every row in DynamoDB must have the same columns, you can have secondary indexes that only contain a subset of data. A row will only appear in a secondary index if the primary key for that index exists in the table. Using this, you can make less specific and even scan queries that are efficient enough for frequent usage.

Don’t try and use your database for all types of queries

With relational databases, it’s common to spin up a read-only replica to interrogate data for analytics and trending. With DynamoDB, you’ll be needing to do this work somewhere else – perhaps even in a relational database. You can use DynamoDB streams to have data sent to S3, for analysis with Athena, Redshift, or even something like MySQL. Doing this allows you to have a best of both worlds approach, with the high throughput and predictable scalability of DynamoDB, and the ability to do ad-hoc queries provided by a relational engine.

Conclusions

  • Know what questions you need to ask of your data before designing the schema
  • Question what you think you know about how data should be stored
  • Don’t be fooled into thinking you can ‘set it and forget it.’

About the Author

Sam Bashton
Sam Bashton is a cloud computing expert who recently relocated with his family from Manchester, UK to Sydney, Australia. Sam has been working with AWS since 2006, providing consultancy and support for high traffic e-commerce websites. Recently Sam wrote and released bucketbridge.cloud, an AWS Marketplace solution providing an FTP and FTPS proxy for S3.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf and developed cookbooks to simplify building and managing infrastructure at Chef. Jennifer is the coauthor of Effective DevOps and speaks about DevOps, tech culture, and monitoring. She also gives tutorials on a variety of technical topics. When she’s not working, she enjoys learning to make things and spending quality time with her family.


Vanquishing CORS with Cloudfront and Lambda@Edge

03. December 2018 2018 0

If you’re deploying a traditional server-rendered web app, it makes sense to host static files on the same machine. The HTML, being server-rendered, will have to be served there, and it is simple and easy to serve css, javascript, and other assets from the same domain.

When you’re deploying a single-page web app (or SPA), the best choice is less obvious. A SPA consists of a collection of static files, as opposed to server-rendered files that might change depending on the requester’s identity, logged-in state, etc. The files may still change when a new version is deployed, but not for every request.

In a single-page web app, you might access several APIs on different domains, or a single API might serve multiple SPAs. Imagine you want to have the main site at mysite.com and some admin views at admin.mysite.com, both talking to api.mysite.com.

Problems with S3 as a static site host

S3 is a good option for serving the static files of a SPA, but it’s not perfect. It doesn’t support SSL—a requirement for any serious website in 2018. There are a couple other deficiencies that you may encounter, namely client-side routing and CORS headaches.

Client-side routing

Most SPA frameworks rely on client-side routing. With client-side routing, every path should receive the content for index.html, and the specific “page” to show is determined on the client. It’s possible to configure this to use the fragment portion of the url, giving routes like /#!/login and /#!/users/253/profile. These “hashbang” routes are trivially supported by S3: the fragment portion of a URL is not interpreted as a filename. S3 just serves the content for /, or index.html, just like we wanted.

However, many developers prefer to use client-side routers in “history” mode (aka “push-state” or “HTML5” mode). In history mode, routes omit that #! portion and look like /login and /users/253/profile. This is usually done for SEO reasons, or just for aesthetics. Regardless, it doesn’t work with S3 at all. From S3’s perspective, those look like totally different files. It will fruitlessly search your bucket for files called /login or /users/253/profile. Your users will see 404 errors instead of lovingly crafted pages.

CORS headaches

Another potential problem, not unique to S3, is due to Cross-Origin Resource Sharing (CORS). CORS polices which routes and data are accesible from other origins. For example, a request from your SPA mysite.com to api.mysite.com is considered cross-origin, so it’s subject to CORS rules. Browsers enforce that cross-origin requests are only permitted when the server at api.mysite.com sets headers explicitly allowing them.

Even when you have control of the server, CORS headers can be tricky to set up correctly. Some SPA tutorials recommend side-stepping the problem using webpack-dev-server’s proxy setting. In this configuration, webpack-dev-server accepts requests to /api/* (or some other prefix) and forwards them to a server (eg, http://localhost:5000). As far as the browser is concerned, your API is hosted on the same domain—not a cross-origin request at all.

Some browsers will also reject third-party cookies. If your API server is on a subdomain this can make it difficult to maintain a logged-in state, depending on your users’ browser settings. The same fix for CORS—proxying /api/* requests from mysite.com to api.mysite.com—would also make the browser see these as first-party cookies.

In production or staging environments, you wouldn’t be using webpack-dev-server, so you could see new issues due to CORS that didn’t happen on your local computer. We need a way to achieve similar proxy behavior that can stand up to a production load.

CloudFront enters, stage left

To solve these issues, I’ve found CloudFront to be an invaluable tool. CloudFront acts as a distributed cache and proxy layer. You make DNS records that resolve mysite.com to something.CloudFront.net. A CloudFront distribution accepts requests and forwards them to another origin you configure. It will cache the responses from the origin (unless you tell it not to). For a SPA, the origin is just your S3 bucket.

In addition to providing caching, SSL, and automatic gzipping, CloudFront is a programmable cache. It gives us the tools to implement push-state client-side routing and to set up a proxy for your API requests to avoid CORS problems.

Client-side routing

There are many suggestions to use CloudFront’s “Custom Error Response” feature in order to achieve pretty push-state-style URLs. When CloudFront receives a request to /login it will dutifully forward that request to your S3 origin. S3, remember, knows nothing about any file called login so it responds with a 404. With a Custom Error Response, CloudFront can be configured to transform that 404 NOT FOUND into a 200 OK where the content is from index.html. That’s exactly what we need for client-side routing!

The Custom Error Response method works well, but it has a drawback. It turns all 404s into 200s with index.html for the body. That isn’t a problem yet, but we’re about to set up our API so it is accessible at mysite.com/api/* (in the next section). It can cause some confusing bugs if your API’s 404 responses are being silently rewritten into 200s with an HTML body!

If you don’t need to talk to any APIs or don’t care to side-step the CORS issues by proxying /api/* to another server, the Custom Error Response method is simpler to set up. Otherwise, we can use Lambda@Edge to rewrite our URLs instead.

Lambda@Edge gives us hooks where we can step in and change the behavior of the CloudFront distribution. The one we’ll need is “Origin Request”, which fires when a request is about to be sent to the S3 origin.

We’ll make some assumptions about the routes in our SPA.

  1. Any request with an extension (eg, styles/app.css, vendor.js, or imgs/logo.png) is an asset and not a client-side route. That means it’s actually backed by a file in S3.
  2. A request without an extension is a SPA client-side route path. That means we should respond with the content from index.html.

If those assumptions aren’t true for your app, you’ll need to adjust the code in the Lambda accordingly. For the rest of us, we can write a lambda to say “If the request doesn’t have an extension, rewrite it to go to index.html instead”. Here it is in Node.js:

Make a new Node.js Lambda, and copy that code into it. At this time, in order to be used with CloudFront, your Lambda must be deployed to the us-east-1 region. Additionally, you’ll have to click “Publish a new version” on the Lambda page. An unpublished Lambda cannot be used with Lambda@Edge.

Copy the ARN at the top of the page and past it in the “Lambda function associations” section of your S3 origin’s Behavior. This is what tells CloudFront to call your Lambda when an Origin Request occurs.

Et Voila! You now have pretty SPA URLs for client-side routing.

Sidestep CORS Headaches

A single CloudFront “distribution” (that’s the name for the cache rules for a domain) can forward requests to multiple servers, which CloudFront calls “Origins”. So far, we only have one: the S3 bucket. In order to have CloudFront forward our API requests, we’ll add another origin that points at our API server.

Probably, you want to set up this origin with minimal or no caching. Be sure to forward all headers and cookies as well. We’re not really using any of CloudFront’s caching capabilities for the API server. Rather, we’re treating it like a reverse proxy.

At this point you have two origins set up: the original one one for S3 and the new one for your API. Now we need to set up the “Behavior” for the distribution. This controls which origin responds to which path.

Choose /api/* as the Path Pattern to go to your API. All other requests will hit the S3 origin. If you need to communicate with multiple API servers, set up a different path prefix for each one.

CloudFront is now serving the same purpose as the webpack-dev-server proxy. Both frontend and API endpoints are available on the same mysite.com domain, so we’ll have zero issues with CORS.

Cache-busting on Deployment

The CloudFront cache makes our sites load faster, but it can cause problems too. When you deploy a new version of your site, the cache might continue to serve an old version for 10-20 minutes.

I like to include a step in my continuous integration deploy to bust the cache, ensuring that new versions of my asset files are picked up right away. Using the AWS CLI, this looks like

About the Author

Brian Schiller (@bgschiller) is a Senior Software Engineer at Devetry in Denver. He especially enjoys teaching, and leads Code Forward, a free coding bootcamp sponsored by Devetry. You can read more of his writing at brianschiller.com.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf and developed cookbooks to simplify building and managing infrastructure at Chef. Jennifer is the coauthor of Effective DevOps and speaks about DevOps, tech culture, and monitoring. She also gives tutorials on a variety of technical topics. When she’s not working, she enjoys learning to make things and spending quality time with her family.


Auditing Bitbucket Server Data for Credentials in AWS

02. December 2018 2018 0

This article was originally published on the Sourced blog.

Introduction

Secrets management in public cloud environments continues to be a challenge for many organisations as they embrace the power of programmable infrastructure and the consumption of API-based services. All too often reputable companies will feature in the news, having fallen victim to security breaches or costly cloud resource provisioning through the accidental disclosure of passwords, API tokens or private keys.

Whilst the introduction of cloud-native services such as AWS Secrets Manager or third-party solutions like HashiCorp Vault provide more effective handling for this type of data, the nature of version control systems such as git provides a unique challenge in that the contents of old commits may contain valid secrets that could still be discovered and abused.

We’ve been engaged at a customer who has a large whole-of-business ‘self-service’ cloud platform in AWS, where deployments are driven by an infrastructure as code pipeline with code stored in git repositories hosted on a Atlassian Bitbucket server. Part of my work included identifying common, unencrypted secrets in the organisation’s git repositories and provide the business units responsible a way to easily identify and remediate these exposures.

Due to client restraints in time and resourcing, we developed a solution that leveraged our existing tooling as well as appropriate community-developed utilities to quickly and efficiently meet our customer’s requirements whilst minimising operational overhead.

In this blog post, we’ll walk through the components involved in allowing us to visualise these particular security issues and work to drive them towards zero exposure across the organisation.

Understanding Bitbucket Server

As mentioned above, our client leverages an AWS deployed instance of Atlassian Bitbucket Server to store and manage their git repositories across the group.

From the application side, the Bitbucket platform contains the following data characteristics that continues to grow every day:

  • 100Gb+ of git repository data
  • 1300+ repositories with more than 9000 branches
  • 200,000+ commits in just the master branches alone

As part of this deployment, EBS and RDS snapshots are created on a schedule to ensure that a point-in-time backup of the application is available ensuring that the service can be redeployed in the event of a failure, or to test software upgrades to the software against production-grade data.

When these snapshots are created, a tag containing a timestamp is created that allows quick identification of the most recent backup of the service by both humans and automated processes.

Auditing git repositories

When it comes to inspecting git repositories for unwanted data, one of the challenges is that it involves inspecting every commit in every branch of every repository.

Even though the Bitbucket platform provides a search capability in the web interface, it is limited in that it can only search for certain patterns in the master branch of the repository, in files smaller than a set size. In addition to this, the search API is private and is not advocated for use for non-GUI operations, further emphasised with the fact that it returns its results in HTML format.

Another challenge that we encountered was that heavy use of the code search API resulted in impaired performance on the Bitbucket server itself.

As such, we looked to the community to see what other tools might exist to help us sift through the data and identify any issues. During our search, we identified a number of different tools, each with their own capabilities and limitations.  Each of these are worthy of a mention and are detailed below:

After trying each of these tools out and understand their capabilities, we ended up selecting gitleaks for our use.

The primary reasons for its selection includes:

  • It is an open source security scanner for git repositories, actively maintained by Zach Rice;
  • Written in GO, gitleaks provides a fast individual repository scanning capability that comes with a set of pre-defined secrets identification patterns; and
  • It functions by parsing the contents of a cloned copy of a git repository on the local machine, which it then uses to examine all files and commits, returning the results in a JSON output file for later use.

The below example shows the output of gitleaks being run against a sample repository called “secretsdummy” that contains an unencrypted RSA private key file.

As you can see, gitleaks detects it in a number of commits and returns the results in JSON format to the output file /tmp/secretsdummy_leaks.json for later use.

Read the rest of this of this article on Sourced blog.


Amazon Machine Learning: Super Simple Supervised Learning

01. December 2018 2018 0

Introduction

Machine learning is a big topic. It’s full of math, white papers, open source libraries, and algorithms. And worse, PhDs. If you simply want to predict an outcome based on your historical data, it can feel overwhelming.

What if you want to predict customer churn (when a customer will stop using your service) so that you can reach out to them before they decide to leave? Or what if you want to predict when one of hundreds or thousands of remote devices will fail? You need some kind of mathematical construct, called a “model,” which you will feed data and in return, receive predictions.

You could break out the statistics textbook and start thinking about what algorithm to use. Or you can choose a technology that lets you quickly apply machine learning to a broad set of scenarios: Amazon Machine Learning (AML).

AML

Amazon Web Services (AWS) offers Amazon Machine Learning, which lets you build a simplified machine learning (ML) system. AML makes it very easy to create multiple models, evaluate the models, and make predictions. AML is a PaaS solution and is a building block of an application, rather than an entire application itself. It should be incorporated into an existing application, using AML predictions to make the application “smarter”.

ML systems perform either supervised or unsupervised learning. With supervised learning, correct answers are provided as part of the input data for the model. With unsupervised learning, the algorithm used teases out innate structure in the input data without any help as to what is the correct answer.

AML is supervised machine learning. To build a model, AML needs input data with both the values that will help predict the outcome and values of that outcome. The outcome variable is called the “target variable”. AML needs both so the machine learning algorithm can tease out the relationships and learn how to predict the target variable. Such data is called training data.

For example, if you are trying to predict the winner of a baseball game, you might provide input data such as who was playing each position, the weather, the location of the game and other information. The target variable would be a boolean value–true for a home team win, false for a visiting team win. To use AML to solve this problem, you’d have to provide a data set with all of the input variables and also the results of previous games. Then, once the model was built, you provide all of the input values except the target variable (called an “observation”) and get a predicted value for the winner.

In addition, AML has the following features:

  • It works with structured text data. It supports only CSV at present.
  • Input data can be strings, numbers, booleans or categorical (one of N) values.
  • Target variable types can be numbers, booleans, or categorical values.
  • There’s little to no coding needed to experiment with AML.
  • You don’t need machine learning experience to use AML and get useful predictions.
  • AML is a pay as you go service; you only pay for what you use.
  • It is a hosted service. You don’t have to run any servers to use AML.

In order to make machine learning simple to use, AML limits the configurability of the system. It also has other limits, as mentioned below. AML is a great solution when you have CSV data that you want to make predictions against. Examples of problems for which  AML would be a good solution include:

  • Is this customer about to churn/leave?
  • Does this machine need service?
  • Should I send this customer a special offer?

AML is not a general purpose machine learning toolkit. Some of the constraints on any system built on AML include:

  • AML is a “medium” data solution, rather than big data. If you have hundreds of gigs of data (or less), AML will work.
  • The model that is created is housed completely within the AML system. While you can access it to generate predictions, you can’t examine the mathematical makeup of the model (for example, the weights of the features). It is also not possible to export the model to run on any other system (for example in a different cloud or on premise).
  • AML only supports the four input types mentioned above: strings, numbers, booleans or categorical (one of N values). Target variables can only be a number, boolean, or categorical value–the data type of the target variable determines the type of model (regression models for numeric target variables, binary classification models for boolean target variables, and multi-class classification models for categorical target variables).
  • AML is currently only available in two AWS regions: northern Virginia and Ireland.
  • While you can tweak some settings, there is only one algorithm for each predicted value data type. The only optimization technique available is stochastic gradient descent.
  • It can only be used for supervised prediction, not for clustering, recommendations or other kinds of machine learning.

Examples of problems for which AML will not be a good fit include:

  • Is this a picture of a dog or a cat?
  • What are the multi dimensional clusters of this data?
  • Given this user’s purchase history, what other products would they like?

Ethics

Before diving into building making predictions, it’s worth discussing the ethics of machine learning. Models make predictions which have real-world consequences. When you are involved in building such systems, you must think about the ramifications. In particular, think about the bias in your training data. If you are working on a project that will be rolled out across a broad population, make sure your training data is evenly distributed.

In addition, it’s worth thinking about how your model will be used. (This framework is pulled from the excellent “Weapons of Math Destruction” by Cathy O’Neil). Consider:

  • Opacity: How often is it updated? Is the data source available to all people affected by the model?
  • Scale: How many people will this system affect, now or in the future?
  • Damage: What kind of decisions are being made with this model? Deciding whether to show someone an ad has far fewer ramifications than deciding whether someone is a good credit risk or not.

Even more than software developers, people developing ML models need to consider the ethics of the systems they build. Software developers build tools that humans use, whereas ML models affect human beings, often without their knowledge.

Think about what you are building.

The Data Pipeline

An AML process can be thought of like a pipeline. You push data in on one end, build certain constructs that the AML system leverages, and eventually, you get predictions out on the other end. The first steps for most ML problems are to determine the question you are trying to answer and to locate the requisite data. This article won’t discuss these efforts, other than to note that garbage in, garbage out applies to ML just as much as it does to other types of data processing. Make sure you have good data, plenty of it, and know what kind of predictions you want to make before building an AML system.

All these AML operations can either be done via the AWS console or the AWS API. For initial exploration, the console is preferable; it’s easier to understand and requires no coding. For production use, you should use the API to build a repeatable system. All the data and scripts mentioned below are freely available on Github (https://github.com/mooreds/amazonmachinelearning-anintroduction) and can serve as a base for your own data processing scripts.

The Data Pipeline: Load the data

When you are starting out with AML, you need to make your data available to the AML system in CSV format. It also must be in a location accessible to AML.

For this post, I’m going to use data provided by UCI. In particular, I’m going to use census data that was collected in the 1990s and includes information like the age of the person, their marital status, and their educational level. This is called the ‘adult’ data set. The target variable will be whether or not the user makes more or less than $50,000 per year. This data set has about 20k records. Here is some sample data:

Note that this dataset is a bit atypical in that it has only training data. There are no observations (input data without the target variable) available to me. So I can train a model, but won’t have any data to make predictions. In order to fully show the power of AML, I’m going to split the dataset into two parts as part of the prep:

  • training data which includes the target variable and which will be used to build the model.
  • observations, which will be passed to the model to obtain predictions. These observations will not include the target variable.

For real world problems you’ll want to make sure you have a steady stream of observations on which to make predictions, and your prep script won’t need to split the initial dataset.

I also need to transform this dataset into an AML compatible format and load it up to S3. A script will help with the first task. This script will turn the <=50K and >50K values into boolean values that AML can process. It will also prepend the header row for easier variable identification later in the process. Full code is available here: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/dataprep/adult.py

Running that script yields the following training data (the last value is the target variable, which the model will predict):

It also provides the following observation data, with the target variable removed:

This prep script is a key part of the process and can execute anywhere and in any language. Other kinds of transformations that are best done in a prep script:

  • Converting non-CSV format (JSON, XML) data to CSV.
  • Turning date strings into offsets from a canonical date.
  • Removing personally identifiable information.

The example prep script is python that runs synchronously, but only processes thousands of records. Depending on the scale of your data, you may need to consider other solutions to transform your source data into CSV, such as Hadoop or Spark.

After I have the data in CSV format, I can upload it to S3. AML can also read from AWS RDS and Redshift via a query, using a SQL query as the prep script. (Note that you can’t use AWS RDS as a data source via the console, only via the API.)

The Data Pipeline: Create the Datasource

Once the CSV file is on S3, you need to build AML specific objects. These all have their own identity and are independent of the CSV data. First you need to create the AML data source.

You point AML process at the data on S3. You need to specify a schema, which includes mapping fields to one of the four supported data types. You also select a target variable (if the data source has the variable you want to predict) and a row identifier (if each row has a unique ID that should be carried through the process). If you are doing this often or you want a repeatable process, you can store the schema as JSON and provide it via an API.

Here’s an excerpt of the schema file I am using for the income prediction model:

You can see that I specify the target attribute, the data file format, and a list of attributes with a name and a data type. (Full schema file here.)

You can create multiple different data sources off of the same data, and that you only need read access to the S3 location. You can also add arbitrary string tags to the data source; for example, date_created or author. The first ten tags you add to a datasource will be inherited by other AML entities like models or evaluations that are created from the data source. As your models proliferate, tags are a good way to organize them.

Finally, when the data source is created, you’ll receive statistics about the data set, including histograms of the various variables, counts of missing values, and the distribution of your target variable. Here’s an example of target variable distribution for the adult data set I am using:

Data insights can be useful in determining if your data is incomplete or nonsensical. For example, if I had 15,000 records but only five of them had an income greater than $50,000, trying to predict that value wouldn’t make much sense. There simply isn’t a valid distribution of the target variable, and my model would be skewed heavily toward whatever attributes those five records had. This type of data intuition is only gained through working with your dataset.

The Data Pipeline: Create the Model

Once you have the AML data source created, you can create a model.

An AML model is an opaque, non-exportable representation of your data, which is built using the stochastic gradient descent optimization technique. There are configuration parameters you can tweak, but AML provides sensible defaults based on your data. These parameters are an area for experimentation.

Also, a “recipe” is required to build a model. Using a recipe, you can transform your data before the model building algorithm accesses it, without modifying the source data. Recipes can also create intermediate variables which can be fed into the model, group variables together for easy transformation and exclude source variables. There are many transformations that you can transparently perform on the data, including:

  • Lowercasing strings
  • Removing punctuation
  • Normalizing numeric values
  • Binning numeric values
  • And more

Note that if you need to perform a different type of transformation (such as converting a boolean value to an AML compatible format), you’ll have to do it as part of the prep script. There is no way to transform data in a recipe other than using the provided transformations.

If you are using the API, the recipe is a JSON file that you can store outside of the AML pipeline and provide when creating a model.

Here’s an example of a recipe that I used on this income prediction dataset:

Groups are a way of grouping different variables (defined in the schema) together so that operations can be applied to them en masse. For example, NUMERIC_VARS_QB_10 is a group of continuous numeric variables that are binned into 10 separate bins (turning the numeric variables into categorical variables).

Assignments let you create intermediate variables. I didn’t use that capability here.

Outputs are the list of variables that the model will see and operate on. In this case, ALL_CATEGORICAL and ALL_BINARY are shortcuts referring to all of those types of input variables. If you remove a variable from the outputs clause, the model will ignore the variable.

In the same way that you have multiple different data sources from the same data, you can create multiple models based on the same data source. You can tweak the parameters and the recipe to build different models. You can then compare those models and test to see which is most accurate.

But how do you test for accuracy?

The Data Pipeline: Evaluate and Use the Model

When you have an AML model, there are three operations you can perform.

The first is model evaluation. When you are training a model, you can optionally hold back some of the training data (which has the real world target variable values). This is called evaluation data. After you build the model, you can run this data through, stripping off the target variable, and get the models’ prediction. Then the system can compare the predicted value with the correct answer across all the evaluation data. This gives an indication of the accuracy.

Here’s an example of an evaluation for the income prediction model that I built using the census data:

Depending on your model’s target variable, you will get different representations of this value, but fundamentally, you are asking how often the model was correct. There are two things to be aware of:

  • You won’t get 100% accuracy. If you see that, your model exactly matches the evaluation data, which means that it’s unlikely to match real world data. This is called overfitting.
  • Evaluation scores differ based on both the model and the data. If the data isn’t representative of the observations you’re going to be making, the model won’t be accurate.

For the adult dataset, which is a binary prediction model, we get something called the area under the curve (AUC). The closer the AUC is to 1, the better our model matched reality. Other types of target variables get other measures of accuracy.

You can also, with a model that has a boolean target variable, determine a cutoff point, called the scoreThreshold. The model will give a prediction between 0 and 1, and you can then determine where you want the results to be split between 1 (true) or 0 (false). Is it 0.5 (the default)? Or 0.9 (which will give you fewer false positives, where the model predicts the value is true, but reality says it’s not)? Or 0.1 (which will give you fewer false negatives, where the model predicts the value is false, but reality says it’s true)?  What you set this value to depends on the actions you’re going to take. If the action is inexpensive (showing someone an advertisement they may like), you may want to err on the side of fewer false negatives. If the opposite is true, and the action is expensive (having a maintenance tech visit a factory for proactive maintenance) you will want to set this higher.

Other target variable types don’t have the concept of a score threshold and may return different values. Below, you’ll see sample predictions for different types of target variables.

Evaluations are entirely optional. If you have some other means of determining the accuracy of your model, you don’t have to use AML’s evaluation process. But using the built-in evaluation settings lets you easily compare models and gives you a way to experiment when tweaking configuration and recipes.

Now that you’ve built the model is you have a handle on the accuracy, you can make predictions. There are two types of predictions: batch and real time.

Batch Predictions

Batch predictions take place asynchronously. You create an AML data source pointing to your observations. The data format must be identical, and the target variable must be absent, but you can use any of the supported data source options (S3, RDS, Redshift). You can process millions of records at a time. You submit a job via the console or API. The job request includes the data source of the observations, the model ID and the prediction output location.

After you start the job, you need to poll AML until it is done. Some SDKs, including the python SDK, include code that will poll for you: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/prediction/batchpredict.py has some sample code.

At job completion, the results will be placed in the specified S3 output bucket. If your observation data has a row identifier, that will be in the output file as well. Otherwise each input row will correspond to an output row based on line number (the first row of input will correspond to the first row of output, the second row of input to the second row of output, and so on).

Here’s sample output from a batch prediction job of the income prediction model:

You are given the bestAnswer, which is based on scoreThreshold above. But you’re also given the values calculated by the model.

For a multi-class classification, I am given all the values. Below, there were seven different classes (using a different data set based on wine characteristics, if you must know). The model predicts for line 1 that the value is mostly likely to be “6” with an 84% likelihood ( 8.404026E-1 is approximately 0.84 == 84%).

And for a numeric target variable (based on yet another dataset), I just get back the value predicted:

Batch predictions work well as part of a data pipeline when you don’t care about when you get your answers, just that you get them. (Any batch job that takes more than a week will be killed, so there is a time limit.) An example of a problem for which a batch job would be appropriate is scoring thousands of customers to see if any are likely to churn this month.

Real Time Predictions

Real time predictions are, as advertised, predictions that are synchronous. AML generally returns predictions within 100 milliseconds. You can set up a real time endpoint on any model in the console or with an API call. It can take a few minutes for the endpoint server to be ready. But you don’t have to maintain the endpoint in any way–it’s entirely managed by the AML service.

You provide one observation to the real time endpoint, and it will return you a prediction based on that observation. Here’s a sample observation for the income prediction model that we built above:

Predictions are made using an AWS API and return a data structure. Here’s the JSON that is returned when I call the income prediction model with the above observation:

The predictedLabel and predictedScores are the predicted values for this observation, and are what I am really interested in. The predictedLabel is calculated using the score threshold, but I still get the calculated value if that is useful to me.

Real time predictions are the right choice when you have observations that require a prediction immediately. An example would be to choose what kind of ad to display to a user right now, based on existing data and their current behavior.

Now that you’ve seen the major constructs of the AML data pipeline, as well as some predictions that were made using an AML model, let’s cover some operational concerns.

Operational Concerns

Pricing

AML is a totally managed service. You pay for the data storage solutions (both for input and results), but you don’t pay for storage of any of the AML managed artifacts (like the model or data source). You also pay for the compute time to build your data sources, models, and evaluations.

For predictions, you pay per prediction. If you are running real time endpoints, you also pay per hour you have the endpoint up. For the model that I built using the census data,  ITRunning th was about $0.50 to process all 20k records and to make a few thousand predictions.

Full pricing information is available here: https://aws.amazon.com/aml/pricing/

The Model Creation Pipeline

AML Models are immutable, as are data sources. If you need to incorporate ongoing data into your model, which is generally a good idea, you need to automate your datasource and model building process so they are repeatable. Then, when you have new data, you can rebuild your model, test it out, and then change which model is “in production” and making predictions.

You can use tags to control which model is used for a given prediction, so you can end up building a CI pipeline and having a ‘production’ model and test models.

Here’s a simple example of a model update pipeline in one script: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/updatemodel/updatemodel.py

Permissions

Like any other AWS service, AML leverages the Identity and Access Management service (IAM). You can control access to data sources, models, and all other AML constructs with IAM. The full list of permissions is here: https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonmachinelearning.html

It’s important to note that if you are using the AWS console to test drive AML, the console will set up the permissions correctly, but if you are using the API to construct a data pipeline, you will need to ensure that IAM access is set up correctly. I’ve found it useful to use the console first and then examine the permissions it sets up and leverage that for the scripts that use the API.

Monitoring

You can monitor AML processes via Amazon Cloudwatch. Metrics published include a number of predictions and number of failed predictions, per model. You can set up alarms in the typical Cloudwatch fashion to take action on the metrics (for example, emailing someone if a new model is rolled to production, but a large number of failed predictions ensues).

AWS ML Alternatives

There are many services within AWS that are complements to AML. These focus on a particular aspect of ML (computer vision, speech recognition) and include Rekognition and Lex.

AWS Sagemaker is a more general purpose machine learning service with many of the benefits of AML. It lets you use standard machine learning software like Jupyter notebooks, supports multiple algorithms, and lets you run your models locally.

If you are looking for even more control (with corresponding responsibility), there is a Deep Learning AMI available. This AMI comes preinstalled with a number of open source machine learning frameworks. You can use this AMI to boot up an EC2 instance and have full configuration and control.

Conclusion

Amazon Machine Learning makes it super simple to make predictions by creating a model to predict outcomes based on structured text data. AML can be used at all scales, from a few hundred records to millions—all without running any infrastructure. It is the perfect way to bring ML predictions into an existing system easily and inexpensively.

AML is a great way to gain experience with machine learning. There is little to no coding required, depending on what your source data looks like. It has configuration options but is really set up to “just work” with sane defaults.

AML helps you explore the world of machine learning while providing a robust production ready system to help make your applications smarter.

About the Author

Dan Moore is director of engineering at Culture Foundry. He is a developer with two decades of experience, former AWS trainer, and author of “Introduction to Amazon Machine Learning,” a video course from O’Reilly. He blogs on AML and other topics at http://www.mooreds.com/wordpress/ . You can find him on Twitter at @mooreds.

 

About the Editors

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf and developed cookbooks to simplify building and managing infrastructure at Chef. Jennifer is the coauthor of Effective DevOps and speaks about DevOps, tech culture, and monitoring. She also gives tutorials on a variety of technical topics. When she’s not working, she enjoys learning to make things and spending quality time with her family.

John Varghese is a Cloud Steward at Intuit responsible for the AWS infrastructure of Intuit’s Futures Group. He runs the AWS Bay Area meetup in the San Francisco Peninsula Area for both beginners and intermediate AWS users. He has also organized multiple AWS Community Day events in the Bay Area. He runs a Slack channel just for AWS users. You can contact him there directly via Slack. He has a deep understanding of AWS solutions from both strategic and tactical perspectives. An avid AWS user since 2012, he evangelizes AWS and DevOps every chance he gets.


Welcome to AWS Advent 2018

22. November 2018 2018, welcome 0

AWS Advent is returning shortly!

What is the AWS Advent event? Many technology platforms have started a yearly tradition for the month of December revealing an article per day written and edited by volunteers in the style of an advent calendar, a special calendar used to count the days in anticipation of Christmas starting on December 1. The AWS Advent event explores everything around the Amazon Web Services platform.

Examples of past AWS articles:

Please explore the rest of this site for more examples of past topics.

There are a large number of AWS services, and many that have never been covered on AWS advent in previous years. We’re looking for articles that range in audience level from beginners to experts in AWS. Introductory, security, architecture, and design patterns with any of the AWS services are welcome topics.

Interested in being part of AWS Advent 2018?

Important Dates

  • Authors rolling acceptance – November 1, 2018
  • Submissions accepted until advent calendar complete. Submissions are still being accepted!!
  • Final drafts due – 12:00am November 30, 2018
  • Final article due 3 days prior to publishing to the site.

Thank you, and we look forward to a great AWS Advent in 2018!

Jennifer Davis, @sigje

John Varghese, @jvusa


When the Angry CFO Comes Calling: AWS Cost Control

24. December 2016 2016 0

Author: Corey Quinn
Editors: Jesse Davis

Controlling costs in AWS is a deceptively complex topic — as anyone who’s ever gone over an AWS billing statement is sadly aware. Individual cost items in Amazon’s cloud environments seem so trivial– 13¢ an hour for an EC2 instance, 5¢ a month for a few files in an S3 bucket… until before you realize it, you’re potentially spending tens of thousands of dollars on your AWS infrastructure, and your CFO is turning fascinating shades of purple. It’s hard to concentrate on your work over the screaming, so let’s take a look into fixing that.

There are three tiers of cost control to consider with respect to AWS.

First Tier

The first and simplest tier is to look at your utilization. Intelligent use of Reserved Instances, ensuring that you’re sizing your instances appropriately, validating that you’re aware of what’s running in your environment– all of these can unlock significant savings at scale, and there are a number of good ways to expose this data. CloudabilityCloudDynCloudCheckr, and other services expose this information, as does Amazon’s own Trusted Advisor– if you’ve opted to pay for either AWS’s Business or Enterprise support tiers. Along this axis, Amazon also offers significant discounting once you’re in a position where signing an Enterprise Agreement makes sense.

Beware: here be dragons! Reserved Instances come in both 1 and 3 year variants– and the latter is almost always inappropriate. By locking in pricing for specific instances types, you’re opting out of three years of AWS price reductions– as well as generational improvements in instances. If Amazon releases an instance class that’s more appropriate for your workload eight months from your purchase of a 3 year RI, you get twenty-eight months of “sunk cost” before a wholesale migration to the new class becomes viable. As a rule of thumb, unless your accounting practices force you into a three year RI model, it’s best to pass them up; the opportunity cost doesn’t justify the (marginal) savings you get over one year reservations.

Second Tier

This is all well and good, but it only takes you so far. The second tier of cost control includes taking a deeper dive into how you’re using AWS’s services, while controlling for your business case. If you have a development environment that’s only used during the day, programmatically stopping it at night and starting it again the following morning can cut your costs almost in half– without upsetting the engineers, testers, and business units who rely on that environment.

Another example of this is intelligent use of Spot Instances or Spot Fleets. This requires a bit of a deep dive into your environment to determine a few things, including what your workload requirements are, how you’ve structured your applications to respond to instances joining or leaving your environment at uncontrolled times, and the amount of engineering effort required to get into a place where this approach will work for you. That said, if you’re able to leverage Spot fleets, it unlocks the potential for massive cost savings– north of 70% is not uncommon.

Third Tier

The third tier of cost control requires digging into the nature of how your application interacts with AWS resources. This is highly site specific, and requires an in-depth awareness of both your application and AWS work. “Aurora looks awesome for this use case!” without paying attention to your IOPS can result in a surprise bill for tens of thousands of dollars per month– a most unwelcome surprise for most companies! Understanding not only how AWS works on the fringes, but understanding what your application is doing becomes important.

Depending upon where you’re starting from, reducing your annual AWS bill by more than half is feasible. Amazon offers many opportunities to save money; your application architecture invariably offers many more. By tweaking these together, you can realize the kind of savings that both you and your CFO’s rising blood pressure can both enjoy.

About the Author

Principal at The Quinn Advisory Group, Corey Quinn has a history as an engineering manager, public speaker, and advocate for cloud strategies which speak to company culture. He specializes in helping companies control and optimize their AWS cloud footprint without disrupting the engineers using it. He lives in San Francisco with his wife, two dogs, and as of May 2017 his first child.

 


Securing Machine access in AWS

23. December 2016 2016 0

Authors: Richard Ortenberg, and Aren Sandersen

Hosting your infrastructure in AWS can provide numerous operational benefits, but can also result in weakened security if you’re not careful. AWS uses a shared responsibility model in which Amazon and its customers are jointly responsible for securing their cloud infrastructure. Even with Amazon’s protections, the number of attack vectors in a poorly secured cloud system is practically too high to count: password lists get dumped, private SSH keys get checked in to GitHub, former employees reuse old credentials, current employees fall victim to spear-phishing, and so on. The most critical first steps that an organization can take towards better security in AWS is putting its infrastructure in a VPN or behind a bastion host and improving its user host access system.

The Edge

A bastion host (or jump box) is a specific host that provides the only means of access to the rest of your hosts. A VPN, on the other hand, lets your computer into the remote network, allowing direct access to hosts. Both a VPN and bastion host have their strengths and weaknesses, but the main value they provide is funnelling all access through a single point. Using this point of entry (or “edge”) to gain access to your production systems is an important security measure. If your endpoints are not behind an edge and are directly accessible on the internet, you’ll have multiple systems to patch in case of a zero-day and each server must be individually “hardened” against common attacks. With a VPN or bastion, your main concern is only hardening and securing the edge.

If you prefer to use a bastion host, Amazon provides an example of how to set one up: https://aws.amazon.com/blogs/security/how-to-record-ssh-sessions-established-through-a-bastion-host/

 

If you’d rather run a VPN, here are just a few of the more popular options:

  • Run the open-source version of OpenVPN which is available in many Linux distributions.
  • Use a prebuilt OpenVPN Access Server (AS) in the AWS Marketplace. This requires a small license fee but set up and configuration are much easier.
  • Use the Foxpass VPN in AWS Marketplace.

Two Factor Authentication

One of the most critical security measures you can implement next is to configure two-factor authentication (2FA) on your VPN or bastion host. Two-factor authentication requires that users enter a code or click a button on a device in their possession to verify a login attempt, making unauthorized access difficult.

Many two-factor systems use a smartphone based service like Duo or Google Authenticator. Third party devices like RSA keys and Yubikeys are also quite common. Even if a user’s password or SSH keys are compromised, it is much harder for an attacker to also gain access to a user’s physical device or phone. Additionally, these physical devices are unable to be stolen remotely, decreasing an attack vector by multiple orders of magnitude.

For 2FA, bastion hosts use a PAM plugin which both Duo and Google Authenticator provide. If you’re using a VPN, most have built-in support for two-factor authentication.

User Host Access

Finally, you need to make sure that your servers are correctly secured behind the edge. A newly-instantiated EC2 server is configured with a single user (usually ‘ec2-user’ or ‘ubuntu’) and a single public SSH key. If multiple people need to access the server, however, then you need a better solution than sharing the private key amongst the team. Sharing a private key is akin to sharing a password to the most important parts of your infrastructure.

Instead, each user should generate their own SSH key pair, keeping the private half on their machine and installing the public half on servers which they need access to.

From easy to more complex here are three mechanisms to improve user access:

  • Add everyone’s public keys to the /home/ec2-user/.ssh/authorized_keys file. Now each person’s access can be revoked independently of the other users.
  • Create several role accounts (e.g. ‘rwuser’, ‘rouser’ that have read/write and read-only permissions, respectively) and install users’ public keys appropriately into each role’s authorized_keys file.
  • Create individual user accounts on each host. Now you have the ability to manage permissions separately for each user.

Best-practice is to use either infrastructure automation tools (e.g. Chef, Puppet, Ansible, Salt) or an LDAP-based system, (e.g. Foxpass), to create and manage the above-mentioned accounts, keys, and permissions.

Summary

There are many benefits to hosting your infrastructure in AWS. Don’t just depend on Amazon or other third parties to protect your infrastructure. Set up a VPN or bastion, patch your vulnerable systems as soon as possible, turn on 2FA, and implement a user access strategy that is more complex than just sharing a password or an SSH key.

About the Authors:

Richard Ortenberg is currently a software developer at Foxpass. He is a former member of the CloudFormation team at AWS.

Aren Sandersen is the founder of Foxpass. Previously he has run engineering, operations and IT teams at Pinterest, Bebo, and Oodle.

 


Getting Started with CodeDeploy

22. December 2016 2016 0

Author: Craig Bruce
Editors: Alfredo Cambera, Chris Castle

Introduction

While running a web site you need a way to deploy your application code in a repeatable, reliable and scalable fashion.

CodeDeploy is part of the AWS Developer Tools family which includes CodeCommit, CodePipeline, and AWS Command Line Interface. CodeCommit is for managed git repositories, CodePipeline is a service to help you build, test and automate while the AWS Command Line Interface is your best friend for accessing the API in an interactive fashion. They do integrate with each, more on that later.

Concepts

Let’s start with the CodeDeploy specific terminology:

  • Application – A unique identifier to tie your deployment group, revisions and deployments together.
  • Deployment Group – The instances you wish to deploy to.
  • Revision – Your application code that you wish to deploy.
  • Deployment – Deploy a specific revision to a specific deployment group.
  • CodeDeploy service – The managed service from AWS which oversees everything.
  • CodeDeploy agent – The agent you install on your instances for them to check in with the CodeDeploy service.

Getting up and running is straightforward. Install the CodeDeploy agent onto your EC2 instances and then head to the AWS Management Console to create an application and a deployment group. You associated your deployment group with an Auto Scaling group or EC2 tag(s). One of the new features is that on-premise instances are supported as well now. As this resource is outside of AWS’ view you need to register them with the CodeDeploy service before it can deploy to them. Your resources must have access to the public AWS API endpoints to communicate with the CodeDeploy service. This offers some really interesting options for hybrid deployment (deploying to both your EC2 and on-premise resources) – not many AWS services support any non-AWS resources. CodeDeploy is now aware of the instances which belong to a deployment group and whenever you request a deployment it will update them all.

Your revision is essentially a compressed file of your source code with one one extra file, appspec.yml, which the CodeDeploy agent will use to help unpack your files and optionally run any lifecycle event hooks you may have specified. Let’s say you need to tidy up some files before a deployment. For a Python web application you may want to remove those pesky *.pyc files. Define a lifecycle event hook to delete those files before you unpack your new code.

Upload your revision to S3 (or you can provide a commit ID from a GitHub repository, although not CodeCommit – more on this later), provide a deployment group and CodeDeploy is away. Great job, you web application code has now been deployed to your instances.

Managing deployments

As is becoming increasingly common, most AWS services are best when used with other AWS services, in this case CloudWatch offers some new options using CloudWatch Alarms and CloudWatch Events. CloudWatch Alarms can be used to stop deployments. Let’s say the CPU utilization on your instances is over 75%, this can trigger an alarm and CodeDeploy will stop any deployments on these instances. The deployment status will update to Stopped. This prevents deployments when there is an increased chance of a deployment problem.

Also new is adding triggers to your deployment groups which is powered by CloudWatch Events. An event could be “deployment succeeds” at which point a message is sent to a SNS topic. This topic could be subscribed to a Lambda function which sends a Success! message to your Deployment channel in Slack/HipChat. There are various events you can use deployment start, stop, failed as well as individual instance states, like starts, failed or succeeds. Be aware of noisy notifications though, you probably don’t want to know about every instance, in every deployment. Plus just like AWS you can be throttled by posting too many messages to Slack/HipChat in a short period.

Deployments do not always go smoothly and if there is a problem the quickest way to restore service is to revert to the last known good revision, typically the last one. Rollbacks have now been added and can be triggered in two ways. Firstly by rolling back if the new deployment fails. Secondly by rolling back if a CloudWatch Alarm is triggered. For example after a deployment if CPU usage is over 90% for 5 minutes, automatically roll the deployment back. In either case you want to know a rollback action was performed – handy that your deployment groups have notifications now.

Integration with CodePipeline

Currently CodeCommit is not a supported entry point for CodeDeploy you provide your revision via an object in S3 or a commit ID in a GitHub repository. You can however use CodeCommit as the source action for CodePipeline, behind the scenes it drops it in S3 for you before passing onto CodeDeploy. So you can build a pipeline in CodePipeline that uses CodeCommit and CodeDeploy actions. Now you have a pipeline you can add further actions as well such as integration with your CI/CD system.

Conclusion

CodeDeploy is a straightforward service to setup and a valuable tool in your DevOps toolbox. Recent updates make it easier to get notified about the status of your deployments, avoid deployments where alarms are triggered and enabling automatic rollback if there is an issue with a new deployment. Best of all use of CodeDeploy for EC2 instances is free, you only pay for your revisions in S3 (so not very much at all). If you were undecided about CodeDeploy try it today!

Notes

Read the excellent CodeDeploy documentation to see learn about all the fine details. Three features were highlighted in this post, learn more about them:

If you are new to CodeDeploy then follow the Getting Started guide to setup your IAM access and issue your first deployment.

About the Author

Dr. Craig Bruce is a Scientific Software Development Manager at OpenEye Scientific. He is responsible for the DevOps group working on Orion, a cloud-native platform for early-stage drug discovery which is enabling chemists to design new medicines. Orion is optimized for AWS and Craig is an AWS Certified DevOps Engineer.

About the Editors

Alfredo Cambera is a Venezuelan outdoorsman, passionate about DevOps, AWS, automation, Data Visualization, Python and open source technologies. He works as Senior Operations Engineer for a company that offers Mobile Engagement Solutions around the globe.

Chris Castle is a Delivery Manager within Accenture’s Technology Architecture practice. During his tenure, he has spent time with major financial services and media companies. He is currently involved in the creation of a compute request and deployment platform to enable migration of his client’s internal applications to AWS.


Paginating AWS API Results using the Boto3 Python SDK

21. December 2016 2016 0

Author: Doug Ireton

Boto3 is Amazon’s officially supported AWS SDK for Python. It’s the de facto way to interact with AWS via Python.

If you’ve used Boto3 to query AWS resources, you may have run into limits on how many resources a query to the specified AWS API will return, generally 50 or 100 results, although S3 will return up to 1000 results. The AWS APIs return “pages” of results. If you are trying to retrieve more than one “page” of results you will need to use a paginator to issue multiple API requests on your behalf.

Introduction

Boto3 provides Paginators to automatically issue multiple API requests to retrieve all the results (e.g. on an API call toEC2.DescribeInstances). Paginators are straightforward to use, but not all Boto3 services provide paginator support. For those services you’ll need to write your own paginator in Python.

In this post, I’ll show you how to retrieve all query results for Boto3 services which provide Pagination support, and I’ll show you how to write a custom paginator for services which don’t provide built-in pagination support.

Built-In Paginators

Most services in the Boto3 SDK provide Paginators. See S3 Paginators for example.

Once you determine you need to paginate your results, you’ll need to call the get_paginator() method.

How do I know I need a Paginator?

If you suspect you aren’t getting all the results from your Boto3 API call, there are a couple of ways to check. You can look in the AWS console (e.g. number of Running Instances), or run a query via the aws command-line interface.

Here’s an example of querying an S3 bucket via the AWS command-line. Boto3 will return the first 1000 S3 objects from the bucket, but since there are a total of 1002 objects, you’ll need to paginate.

Counting results using the AWS CLI

Here’s a boto3 example which, by default, will return the first 1000 objects from a given S3 bucket.

Determining if the results are truncated

The S3 response dictionary provides some helpful properties, like IsTruncated, KeyCount, and MaxKeys which tell you if the results were truncated. If resp['IsTruncated'] is True, you know you’ll need to use a Paginator to return all the results.

Using Boto3’s Built-In Paginators

The Boto3 documentation provides a good overview of how to use the built-in paginators, so I won’t repeat it here.

If a given service has Paginators built-in, they are documented in the Paginators section of the service docs, e.g.AutoScaling, and EC2.

Determine if a method can be paginated

You can also verify if the boto3 service provides Paginators via the client.can_paginate() method.


So, that’s it for built-in paginators. In this section I showed you how to determine if your API results are being truncated, pointed you to Boto3’s excellent documentation on Paginators, and showed you how to use the can_paginate() method to verify if a given service method supports pagination.

If the Boto3 service you are using provides paginators, you should use them. They are tested and well documented. In the next section, I’ll show you how to write your own paginator.

How to Write Your Own Paginator

Some Boto3 services, such as AWS Config don’t provide paginators. For these services, you will have to write your own paginator code in Python to retrieve all the query results. In this section, I’ll show you how to write your own paginator.

You Might Need To Write Your Own Paginator If…

Some Boto3 SDK services aren’t as built-out as S3 or EC2. For example, the AWS Config service doesn’t provide paginators. The first clue is that the Boto3 AWS ConfigService docs don’t have a “Paginators” section.

The can_paginate Method

You can also ask the individual service client’s can_paginate method if it supports paginating. For example, here’s how to do that for the AWS config client. In the example below, we determine that the config service doesn’t support paginating for the get_compliance_details_by_config_rule method.

Operation Not Pageable Error

If you try to paginate a method without a built-in paginator, you will get an error similar to this:

If you get an error like this, it’s time to roll up your sleeves and write your own paginator.

Writing a Paginator

Writing a paginator is fairly straightforward. When you call the AWS service API, it will return the maximum number of results, and a long hex string token, next_token if there are more results.

Approach

To create a paginator for this, you make calls to the service API in a loop until next_token is empty, collecting the results from each loop iteration in a list. At the end of the loop, you will have all the results in the list.

In the example code below, I’m calling the AWS Config service to get a list of resources (e.g. EC2 instances), which are not compliant with the required-tags Config rule.

As you read the example code below, it might help to read the Boto3 SDK docs for theget_compliance_details_by_config_rule method, especially the “Response Syntax” section.

Example Paginator

Example Paginator – main() Method

In the example above, the main() method creates the config client and initializes the next_token variable. Theresources list will hold the final results set.

The while loop is the heart of the paginating code. In each loop iteration, we call theget_compliance_details_by_config_rule method, passing next_token as a parameter. Again, next_token is a long hex string returned by the given AWS service API method. It’s our “claim check” for the next set of results.

Next, we extract the current_batch of AWS resources and the next_token string from the compliance_detailsdictionary returned by our API call.

Example Paginator – get_resources_from() Helper Method

The get_resources_from(compliance_details) is an extracted helper method for parsing the compliance_detailsdictionary. It returns our current batch (100 results) of resources and our next_token “claim check” so we can get the next page of results from config.get_compliance_details_by_config_rule().

I hope the example is helpful in writing your own custom paginator.


In this section on writing your own paginators I showed you a Boto3 documentation example of a service without built-in Paginator support. I discussed the can_paginate method and showed you the error you get if you call it on a method which doesn’t support pagination. Finally, I discussed an approach for writing a custom paginator in Python and showed a concrete example of a custom paginator which passes the NextToken “claim check” string to fetch the next page of results.

Summary

In this post, I covered Paginating AWS API responses with the Boto3 SDK. Like most APIs (Twitter, GitHub, Atlassian, etc) AWS paginates API responses over a set limit, generally 50 or 100 resources. Knowing how to paginate results is crucial when dealing with large AWS accounts which may contain thousands of resources.

I hope this post has taught you a bit about paginators and how to get all your results from the AWS APIs.

About the Author

Doug Ireton is a Sr. DevOps engineer at 1Strategy, an AWS Consulting Partner specializing in Amazon Web Services (AWS). He has 23 years experience in IT, working at Microsoft, Washington Mutual Bank, and Nordstrom in diverse roles from testing, Windows Server engineer, developer, and Chef engineer, helping app and platform teams manage thousands of servers via automation.


Alexa is checking your list

20. December 2016 2016 0

Author: Matthew Williams
Editors: Benjamin Marsteau, Scott Francis

Recently I made a kitchen upgrade: I bought an Amazon Dot. Alexa, the voice assistant inside the intelligent puck, now plays a key role in the preparation of meals every day. With both hands full, I can say “Alexa, start a 40-minute timer” and not have to worry about burning the casserole. However, there is a bigger problem coming up that I feel it might also help me out on. It is the gift-giving season, and I have been known to get the wrong things. Wouldn’t it be great if I could have Alexa remind me what I need to get for each person on my list? Well, that simple idea took me down a path that has consumed me for a little too long. And as long as I built it, I figured I would share it with you.

Architecting a Solution

Now it is important to remember that I am a technologist and therefore I am going to go way beyond what’s necessary. [ “anything worth doing is worth overdoing.” — anon. ] Rather than just building the Alexa side of things, I decided to create the entire ecosystem. My wife and I are the first in our families to add Alexa to their household, so that means I need a website for my friends and family to add what they want. And of course, that website needs to talk to a backend server with a REST API to collect the lists into a database. And then Alexa needs to use that same API to read off my lists.

OK, so spin up an EC2 instance and build away, right? I did say I am a technologist, right? That means I have to use the shiniest tools to get the job done. Otherwise, it would just be too easy.

My plan is to use a combination of AWS Lambda to serve the logic of the application, the API Gateway to host the REST endpoints, DynamoDB for saving the data, and another Lambda to respond to Alexa’s queries.

The Plan of Attack

Based on my needs, I think I came up with the ideal plan of attack. I would tackle the problems in the following order:

  1. Build the Backend – The backend includes the logic, API, and database.
    1. Build a Database to Store the Items
    2. Lambda Function to Add an Item
    3. Lambda Function to Delete an Item
    4. Lambda Function to List All Items
    5. Configure the API Gateway
  2. Build the User Interface – The frontend can be simple: show a list, and let folks add and remove from that list.
  3. Get Alexa Talking to the Service – That is why we are here, right?

There are some technologies used that you should understand before beginning. You do not have to know everything about Lambda or the API Gateway or DynamoDB, but let’s go over a few of the essentials.

Lambda Essentials

The purpose of Lambda is to run the functions you write. Configuration is pretty minimal, and you only get charged for the time your functions run (you get a lot of free time). You can do everything from the web console, but after setting up a few functions, you will want another way. See this page for more about AWS Lambda.

API Gateway Essentials

The API Gateway is a service to make it easier to maintain and secure your APIs. Even if I get super popular, I probably won’t get charged much here as it is $3.50 per million API calls. See this page for more about the Amazon API Gateway.

DynamoDB Essentials

DynamoDB is a simple (and super fast) NoSQL database. My application has simple needs, and I am going to need a lot more friends before I reach the 25 GB and 200 million requests per month that are on the free plan. See this page for more about Amazon DynamoDB.

Serverless Framework

Sure I can go to each service’s console page and configure them, but I find it a lot easier to have it automated and in source control. There are many choices in this category including the Serverless framework, Apex, Node Lambda, and many others. They all share similar features so you should review them to see which fits your needs best. I used the Serverless framework for my implementation.

Alexa Skills

When you get your Amazon Echo or Dot home, you interact with Alexa, the voice assistant. The things that she does are Alexa Skills. To build a skill you need to define a list of phrases to recognize, what actions they correspond to, and write the code that performs those actions.

Let’s Start Building

There are three main components that need to be built here: API, Web, and Skill. I chose a different workflow for each of them. The API uses the Serverless framework to define the CloudFormation template, Lambda Functions, IAM Roles, and API Gateway configuration. The Webpage uses a Gulp workflow to compile and preview the site. And the Alexa skill uses a Yeoman generator. Each workflow has its benefits and it was exciting to use each.

If you would like to follow along, you can clone the GitHub repo: https://github.com/DataDog/AWS-Advent-Alexa-Skill-on-Lambda.

Building the Server

The process I went through was:

  1. Install Serverless Framework (npm i -g serverless)
  2. Create the first function (sls create -n <service name> -t aws-nodejs)The top-level concept in Serverless is that of a service. You create a service, then all the Lambda functions, CloudFormation templates, and IAM roles defined in the serverless.yaml file support that service.Add the resources needed to a CloudFormation template in the serverless.yaml file. For example:Refer to the CloudFormation docs and the Serverless Resources docs for more about this section.
  3. Add the resources needed to a CloudFormation template in the serverless.yaml file. For example:
    alexa_1
    Refer to the CloudFormation docs and the Serverless Resources docs for more about this section.
  4. Add the IAM Role statements to allow your Lambda access to everything needed. For example:
    alexa_2
  5. Add the Lambda functions you want to use in this service. For example:
    alexa_3
    The events section lists the triggers that can kick off this function. **http** means to use the API Gateway. I spent a little time in the API Gateway console and got confused. But these four lines in the serverless.yaml file were all I needed.
  6. Install serverless-webpack npm and add it to the YAML file:
    alexa_4
    This configuration tells Serverless to use WebPack to bundle all your npm modules together in the right way. And if you want to use EcmaScript 2015 this will run Babel to convert back down to a JavaScript version that Lambda can use.  You will have to setup your webpack.config.js and .babelrc files to get everything working.
  7. Write the functions. For the function I mentioned earlier, I added the following to my items.js file:
    alexa_5
    This function sets the table name in my DynamoDB and then grabs all the rows. No matter what the result is, a response is formatted using this createResponse function:
    alexa_6Notice the header. Without this, Cross Origin Resource Sharing will not work. You will get nothing but 502 errors when you try to consume the API.
  8. Deploy the Service:

    Now I use 99Design’s aws-vault to store my AWS access keys rather than adding them to a rc file that could accidentally find its way up to GitHub. So the command I use is:

    If everything works, it creates the DynamoDB table, configures the API Gateway APIs, and sets up the Lambdas. All I have to do is try them out from a new application or using a tool like Paw or Postman. Then rinse and repeat until everything works.

Building the Frontend

alexa_7

Remember, I am a technologist, not an artist. It works, but I will not be winning any design awards. It is a webpage with a simple table on it and loads up some Javascript to show my DynamoDB table:

alexa_8

Have I raised the technologist card enough times yet? Well, because of that I need to keep to the new stuff even with the Javascript features I am using. That means I am writing the code in ECMAScript 2015, so I need to use Babel to convert it to something usable in most browsers. I used Gulp for this stage to keep building the files and then reloading my browser with each change.

Building the Alexa Skill

Now that we have everything else working, it is time to build the Alexa Skill. Again, Amazon has a console for this which I used for the initial configuration on the Lambda that backs the skill. But then I switched over to using Matt Kruse’s Alexa App framework. What I found especially cool about his framework was that it works with his alexa-app-server so I can test out the skill locally without having to deploy to Amazon.

For this one I went back to the pre-ECMAScript 2015 syntax but I hope that doesn’t mean I lose technologist status in your eyes.

Here is a quick look at a simple Alexa response to read out the gift list:

alexa_9

Summary

And now we have an end to end solution around working with your gift lists. We built the beginnings of an API to work with gift lists. Then we added a web frontend to allow users to add to the list. And then we added an Alexa skill to read the list while both hands are on a hot pan. Is this overkill? Maybe. Could I have stuck with a pen and scrap of paper? Well, I guess one could do that. But what kind of technologist would I be then?

About the Author

Matt Williams is the DevOps Evangelist at Datadog. He is passionate about the power of monitoring and metrics to make large-scale systems stable and manageable. So he tours the country speaking and writing about monitoring with Datadog. When he’s not on the road, he’s coding. You can find Matt on Twitter at @Technovangelist.

About the Editors

Benjamin Marsteau is a System administrator | Ops | Dad | and tries to give back to the community has much as it gives him.