Vanquishing CORS with Cloudfront and Lambda@Edge

03. December 2018 2018 0

If you’re deploying a traditional server-rendered web app, it makes sense to host static files on the same machine. The HTML, being server-rendered, will have to be served there, and it is simple and easy to serve css, javascript, and other assets from the same domain.

When you’re deploying a single-page web app (or SPA), the best choice is less obvious. A SPA consists of a collection of static files, as opposed to server-rendered files that might change depending on the requester’s identity, logged-in state, etc. The files may still change when a new version is deployed, but not for every request.

In a single-page web app, you might access several APIs on different domains, or a single API might serve multiple SPAs. Imagine you want to have the main site at mysite.com and some admin views at admin.mysite.com, both talking to api.mysite.com.

Problems with S3 as a static site host

S3 is a good option for serving the static files of a SPA, but it’s not perfect. It doesn’t support SSL—a requirement for any serious website in 2018. There are a couple other deficiencies that you may encounter, namely client-side routing and CORS headaches.

Client-side routing

Most SPA frameworks rely on client-side routing. With client-side routing, every path should receive the content for index.html, and the specific “page” to show is determined on the client. It’s possible to configure this to use the fragment portion of the url, giving routes like /#!/login and /#!/users/253/profile. These “hashbang” routes are trivially supported by S3: the fragment portion of a URL is not interpreted as a filename. S3 just serves the content for /, or index.html, just like we wanted.

However, many developers prefer to use client-side routers in “history” mode (aka “push-state” or “HTML5” mode). In history mode, routes omit that #! portion and look like /login and /users/253/profile. This is usually done for SEO reasons, or just for aesthetics. Regardless, it doesn’t work with S3 at all. From S3’s perspective, those look like totally different files. It will fruitlessly search your bucket for files called /login or /users/253/profile. Your users will see 404 errors instead of lovingly crafted pages.

CORS headaches

Another potential problem, not unique to S3, is due to Cross-Origin Resource Sharing (CORS). CORS polices which routes and data are accesible from other origins. For example, a request from your SPA mysite.com to api.mysite.com is considered cross-origin, so it’s subject to CORS rules. Browsers enforce that cross-origin requests are only permitted when the server at api.mysite.com sets headers explicitly allowing them.

Even when you have control of the server, CORS headers can be tricky to set up correctly. Some SPA tutorials recommend side-stepping the problem using webpack-dev-server’s proxy setting. In this configuration, webpack-dev-server accepts requests to /api/* (or some other prefix) and forwards them to a server (eg, http://localhost:5000). As far as the browser is concerned, your API is hosted on the same domain—not a cross-origin request at all.

Some browsers will also reject third-party cookies. If your API server is on a subdomain this can make it difficult to maintain a logged-in state, depending on your users’ browser settings. The same fix for CORS—proxying /api/* requests from mysite.com to api.mysite.com—would also make the browser see these as first-party cookies.

In production or staging environments, you wouldn’t be using webpack-dev-server, so you could see new issues due to CORS that didn’t happen on your local computer. We need a way to achieve similar proxy behavior that can stand up to a production load.

CloudFront enters, stage left

To solve these issues, I’ve found CloudFront to be an invaluable tool. CloudFront acts as a distributed cache and proxy layer. You make DNS records that resolve mysite.com to something.CloudFront.net. A CloudFront distribution accepts requests and forwards them to another origin you configure. It will cache the responses from the origin (unless you tell it not to). For a SPA, the origin is just your S3 bucket.

In addition to providing caching, SSL, and automatic gzipping, CloudFront is a programmable cache. It gives us the tools to implement push-state client-side routing and to set up a proxy for your API requests to avoid CORS problems.

Client-side routing

There are many suggestions to use CloudFront’s “Custom Error Response” feature in order to achieve pretty push-state-style URLs. When CloudFront receives a request to /login it will dutifully forward that request to your S3 origin. S3, remember, knows nothing about any file called login so it responds with a 404. With a Custom Error Response, CloudFront can be configured to transform that 404 NOT FOUND into a 200 OK where the content is from index.html. That’s exactly what we need for client-side routing!

The Custom Error Response method works well, but it has a drawback. It turns all 404s into 200s with index.html for the body. That isn’t a problem yet, but we’re about to set up our API so it is accessible at mysite.com/api/* (in the next section). It can cause some confusing bugs if your API’s 404 responses are being silently rewritten into 200s with an HTML body!

If you don’t need to talk to any APIs or don’t care to side-step the CORS issues by proxying /api/* to another server, the Custom Error Response method is simpler to set up. Otherwise, we can use Lambda@Edge to rewrite our URLs instead.

Lambda@Edge gives us hooks where we can step in and change the behavior of the CloudFront distribution. The one we’ll need is “Origin Request”, which fires when a request is about to be sent to the S3 origin.

We’ll make some assumptions about the routes in our SPA.

  1. Any request with an extension (eg, styles/app.css, vendor.js, or imgs/logo.png) is an asset and not a client-side route. That means it’s actually backed by a file in S3.
  2. A request without an extension is a SPA client-side route path. That means we should respond with the content from index.html.

If those assumptions aren’t true for your app, you’ll need to adjust the code in the Lambda accordingly. For the rest of us, we can write a lambda to say “If the request doesn’t have an extension, rewrite it to go to index.html instead”. Here it is in Node.js:

Make a new Node.js Lambda, and copy that code into it. At this time, in order to be used with CloudFront, your Lambda must be deployed to the us-east-1 region. Additionally, you’ll have to click “Publish a new version” on the Lambda page. An unpublished Lambda cannot be used with Lambda@Edge.

Copy the ARN at the top of the page and past it in the “Lambda function associations” section of your S3 origin’s Behavior. This is what tells CloudFront to call your Lambda when an Origin Request occurs.

Et Voila! You now have pretty SPA URLs for client-side routing.

Sidestep CORS Headaches

A single CloudFront “distribution” (that’s the name for the cache rules for a domain) can forward requests to multiple servers, which CloudFront calls “Origins”. So far, we only have one: the S3 bucket. In order to have CloudFront forward our API requests, we’ll add another origin that points at our API server.

Probably, you want to set up this origin with minimal or no caching. Be sure to forward all headers and cookies as well. We’re not really using any of CloudFront’s caching capabilities for the API server. Rather, we’re treating it like a reverse proxy.

At this point you have two origins set up: the original one one for S3 and the new one for your API. Now we need to set up the “Behavior” for the distribution. This controls which origin responds to which path.

Choose /api/* as the Path Pattern to go to your API. All other requests will hit the S3 origin. If you need to communicate with multiple API servers, set up a different path prefix for each one.

CloudFront is now serving the same purpose as the webpack-dev-server proxy. Both frontend and API endpoints are available on the same mysite.com domain, so we’ll have zero issues with CORS.

Cache-busting on Deployment

The CloudFront cache makes our sites load faster, but it can cause problems too. When you deploy a new version of your site, the cache might continue to serve an old version for 10-20 minutes.

I like to include a step in my continuous integration deploy to bust the cache, ensuring that new versions of my asset files are picked up right away. Using the AWS CLI, this looks like

About the Author

Brian Schiller (@bgschiller) is a Senior Software Engineer at Devetry in Denver. He especially enjoys teaching, and leads Code Forward, a free coding bootcamp sponsored by Devetry. You can read more of his writing at brianschiller.com.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf and developed cookbooks to simplify building and managing infrastructure at Chef. Jennifer is the coauthor of Effective DevOps and speaks about DevOps, tech culture, and monitoring. She also gives tutorials on a variety of technical topics. When she’s not working, she enjoys learning to make things and spending quality time with her family.


Auditing Bitbucket Server Data for Credentials in AWS

02. December 2018 2018 0

This article was originally published on the Sourced blog.

Introduction

Secrets management in public cloud environments continues to be a challenge for many organisations as they embrace the power of programmable infrastructure and the consumption of API-based services. All too often reputable companies will feature in the news, having fallen victim to security breaches or costly cloud resource provisioning through the accidental disclosure of passwords, API tokens or private keys.

Whilst the introduction of cloud-native services such as AWS Secrets Manager or third-party solutions like HashiCorp Vault provide more effective handling for this type of data, the nature of version control systems such as git provides a unique challenge in that the contents of old commits may contain valid secrets that could still be discovered and abused.

We’ve been engaged at a customer who has a large whole-of-business ‘self-service’ cloud platform in AWS, where deployments are driven by an infrastructure as code pipeline with code stored in git repositories hosted on a Atlassian Bitbucket server. Part of my work included identifying common, unencrypted secrets in the organisation’s git repositories and provide the business units responsible a way to easily identify and remediate these exposures.

Due to client restraints in time and resourcing, we developed a solution that leveraged our existing tooling as well as appropriate community-developed utilities to quickly and efficiently meet our customer’s requirements whilst minimising operational overhead.

In this blog post, we’ll walk through the components involved in allowing us to visualise these particular security issues and work to drive them towards zero exposure across the organisation.

Understanding Bitbucket Server

As mentioned above, our client leverages an AWS deployed instance of Atlassian Bitbucket Server to store and manage their git repositories across the group.

From the application side, the Bitbucket platform contains the following data characteristics that continues to grow every day:

  • 100Gb+ of git repository data
  • 1300+ repositories with more than 9000 branches
  • 200,000+ commits in just the master branches alone

As part of this deployment, EBS and RDS snapshots are created on a schedule to ensure that a point-in-time backup of the application is available ensuring that the service can be redeployed in the event of a failure, or to test software upgrades to the software against production-grade data.

When these snapshots are created, a tag containing a timestamp is created that allows quick identification of the most recent backup of the service by both humans and automated processes.

Auditing git repositories

When it comes to inspecting git repositories for unwanted data, one of the challenges is that it involves inspecting every commit in every branch of every repository.

Even though the Bitbucket platform provides a search capability in the web interface, it is limited in that it can only search for certain patterns in the master branch of the repository, in files smaller than a set size. In addition to this, the search API is private and is not advocated for use for non-GUI operations, further emphasised with the fact that it returns its results in HTML format.

Another challenge that we encountered was that heavy use of the code search API resulted in impaired performance on the Bitbucket server itself.

As such, we looked to the community to see what other tools might exist to help us sift through the data and identify any issues. During our search, we identified a number of different tools, each with their own capabilities and limitations.  Each of these are worthy of a mention and are detailed below:

After trying each of these tools out and understand their capabilities, we ended up selecting gitleaks for our use.

The primary reasons for its selection includes:

  • It is an open source security scanner for git repositories, actively maintained by Zach Rice;
  • Written in GO, gitleaks provides a fast individual repository scanning capability that comes with a set of pre-defined secrets identification patterns; and
  • It functions by parsing the contents of a cloned copy of a git repository on the local machine, which it then uses to examine all files and commits, returning the results in a JSON output file for later use.

The below example shows the output of gitleaks being run against a sample repository called “secretsdummy” that contains an unencrypted RSA private key file.

As you can see, gitleaks detects it in a number of commits and returns the results in JSON format to the output file /tmp/secretsdummy_leaks.json for later use.

Read the rest of this of this article on Sourced blog.


Amazon Machine Learning: Super Simple Supervised Learning

01. December 2018 2018 0

Introduction

Machine learning is a big topic. It’s full of math, white papers, open source libraries, and algorithms. And worse, PhDs. If you simply want to predict an outcome based on your historical data, it can feel overwhelming.

What if you want to predict customer churn (when a customer will stop using your service) so that you can reach out to them before they decide to leave? Or what if you want to predict when one of hundreds or thousands of remote devices will fail? You need some kind of mathematical construct, called a “model,” which you will feed data and in return, receive predictions.

You could break out the statistics textbook and start thinking about what algorithm to use. Or you can choose a technology that lets you quickly apply machine learning to a broad set of scenarios: Amazon Machine Learning (AML).

AML

Amazon Web Services (AWS) offers Amazon Machine Learning, which lets you build a simplified machine learning (ML) system. AML makes it very easy to create multiple models, evaluate the models, and make predictions. AML is a PaaS solution and is a building block of an application, rather than an entire application itself. It should be incorporated into an existing application, using AML predictions to make the application “smarter”.

ML systems perform either supervised or unsupervised learning. With supervised learning, correct answers are provided as part of the input data for the model. With unsupervised learning, the algorithm used teases out innate structure in the input data without any help as to what is the correct answer.

AML is supervised machine learning. To build a model, AML needs input data with both the values that will help predict the outcome and values of that outcome. The outcome variable is called the “target variable”. AML needs both so the machine learning algorithm can tease out the relationships and learn how to predict the target variable. Such data is called training data.

For example, if you are trying to predict the winner of a baseball game, you might provide input data such as who was playing each position, the weather, the location of the game and other information. The target variable would be a boolean value–true for a home team win, false for a visiting team win. To use AML to solve this problem, you’d have to provide a data set with all of the input variables and also the results of previous games. Then, once the model was built, you provide all of the input values except the target variable (called an “observation”) and get a predicted value for the winner.

In addition, AML has the following features:

  • It works with structured text data. It supports only CSV at present.
  • Input data can be strings, numbers, booleans or categorical (one of N) values.
  • Target variable types can be numbers, booleans, or categorical values.
  • There’s little to no coding needed to experiment with AML.
  • You don’t need machine learning experience to use AML and get useful predictions.
  • AML is a pay as you go service; you only pay for what you use.
  • It is a hosted service. You don’t have to run any servers to use AML.

In order to make machine learning simple to use, AML limits the configurability of the system. It also has other limits, as mentioned below. AML is a great solution when you have CSV data that you want to make predictions against. Examples of problems for which  AML would be a good solution include:

  • Is this customer about to churn/leave?
  • Does this machine need service?
  • Should I send this customer a special offer?

AML is not a general purpose machine learning toolkit. Some of the constraints on any system built on AML include:

  • AML is a “medium” data solution, rather than big data. If you have hundreds of gigs of data (or less), AML will work.
  • The model that is created is housed completely within the AML system. While you can access it to generate predictions, you can’t examine the mathematical makeup of the model (for example, the weights of the features). It is also not possible to export the model to run on any other system (for example in a different cloud or on premise).
  • AML only supports the four input types mentioned above: strings, numbers, booleans or categorical (one of N values). Target variables can only be a number, boolean, or categorical value–the data type of the target variable determines the type of model (regression models for numeric target variables, binary classification models for boolean target variables, and multi-class classification models for categorical target variables).
  • AML is currently only available in two AWS regions: northern Virginia and Ireland.
  • While you can tweak some settings, there is only one algorithm for each predicted value data type. The only optimization technique available is stochastic gradient descent.
  • It can only be used for supervised prediction, not for clustering, recommendations or other kinds of machine learning.

Examples of problems for which AML will not be a good fit include:

  • Is this a picture of a dog or a cat?
  • What are the multi dimensional clusters of this data?
  • Given this user’s purchase history, what other products would they like?

Ethics

Before diving into building making predictions, it’s worth discussing the ethics of machine learning. Models make predictions which have real-world consequences. When you are involved in building such systems, you must think about the ramifications. In particular, think about the bias in your training data. If you are working on a project that will be rolled out across a broad population, make sure your training data is evenly distributed.

In addition, it’s worth thinking about how your model will be used. (This framework is pulled from the excellent “Weapons of Math Destruction” by Cathy O’Neil). Consider:

  • Opacity: How often is it updated? Is the data source available to all people affected by the model?
  • Scale: How many people will this system affect, now or in the future?
  • Damage: What kind of decisions are being made with this model? Deciding whether to show someone an ad has far fewer ramifications than deciding whether someone is a good credit risk or not.

Even more than software developers, people developing ML models need to consider the ethics of the systems they build. Software developers build tools that humans use, whereas ML models affect human beings, often without their knowledge.

Think about what you are building.

The Data Pipeline

An AML process can be thought of like a pipeline. You push data in on one end, build certain constructs that the AML system leverages, and eventually, you get predictions out on the other end. The first steps for most ML problems are to determine the question you are trying to answer and to locate the requisite data. This article won’t discuss these efforts, other than to note that garbage in, garbage out applies to ML just as much as it does to other types of data processing. Make sure you have good data, plenty of it, and know what kind of predictions you want to make before building an AML system.

All these AML operations can either be done via the AWS console or the AWS API. For initial exploration, the console is preferable; it’s easier to understand and requires no coding. For production use, you should use the API to build a repeatable system. All the data and scripts mentioned below are freely available on Github (https://github.com/mooreds/amazonmachinelearning-anintroduction) and can serve as a base for your own data processing scripts.

The Data Pipeline: Load the data

When you are starting out with AML, you need to make your data available to the AML system in CSV format. It also must be in a location accessible to AML.

For this post, I’m going to use data provided by UCI. In particular, I’m going to use census data that was collected in the 1990s and includes information like the age of the person, their marital status, and their educational level. This is called the ‘adult’ data set. The target variable will be whether or not the user makes more or less than $50,000 per year. This data set has about 20k records. Here is some sample data:

Note that this dataset is a bit atypical in that it has only training data. There are no observations (input data without the target variable) available to me. So I can train a model, but won’t have any data to make predictions. In order to fully show the power of AML, I’m going to split the dataset into two parts as part of the prep:

  • training data which includes the target variable and which will be used to build the model.
  • observations, which will be passed to the model to obtain predictions. These observations will not include the target variable.

For real world problems you’ll want to make sure you have a steady stream of observations on which to make predictions, and your prep script won’t need to split the initial dataset.

I also need to transform this dataset into an AML compatible format and load it up to S3. A script will help with the first task. This script will turn the <=50K and >50K values into boolean values that AML can process. It will also prepend the header row for easier variable identification later in the process. Full code is available here: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/dataprep/adult.py

Running that script yields the following training data (the last value is the target variable, which the model will predict):

It also provides the following observation data, with the target variable removed:

This prep script is a key part of the process and can execute anywhere and in any language. Other kinds of transformations that are best done in a prep script:

  • Converting non-CSV format (JSON, XML) data to CSV.
  • Turning date strings into offsets from a canonical date.
  • Removing personally identifiable information.

The example prep script is python that runs synchronously, but only processes thousands of records. Depending on the scale of your data, you may need to consider other solutions to transform your source data into CSV, such as Hadoop or Spark.

After I have the data in CSV format, I can upload it to S3. AML can also read from AWS RDS and Redshift via a query, using a SQL query as the prep script. (Note that you can’t use AWS RDS as a data source via the console, only via the API.)

The Data Pipeline: Create the Datasource

Once the CSV file is on S3, you need to build AML specific objects. These all have their own identity and are independent of the CSV data. First you need to create the AML data source.

You point AML process at the data on S3. You need to specify a schema, which includes mapping fields to one of the four supported data types. You also select a target variable (if the data source has the variable you want to predict) and a row identifier (if each row has a unique ID that should be carried through the process). If you are doing this often or you want a repeatable process, you can store the schema as JSON and provide it via an API.

Here’s an excerpt of the schema file I am using for the income prediction model:

You can see that I specify the target attribute, the data file format, and a list of attributes with a name and a data type. (Full schema file here.)

You can create multiple different data sources off of the same data, and that you only need read access to the S3 location. You can also add arbitrary string tags to the data source; for example, date_created or author. The first ten tags you add to a datasource will be inherited by other AML entities like models or evaluations that are created from the data source. As your models proliferate, tags are a good way to organize them.

Finally, when the data source is created, you’ll receive statistics about the data set, including histograms of the various variables, counts of missing values, and the distribution of your target variable. Here’s an example of target variable distribution for the adult data set I am using:

Data insights can be useful in determining if your data is incomplete or nonsensical. For example, if I had 15,000 records but only five of them had an income greater than $50,000, trying to predict that value wouldn’t make much sense. There simply isn’t a valid distribution of the target variable, and my model would be skewed heavily toward whatever attributes those five records had. This type of data intuition is only gained through working with your dataset.

The Data Pipeline: Create the Model

Once you have the AML data source created, you can create a model.

An AML model is an opaque, non-exportable representation of your data, which is built using the stochastic gradient descent optimization technique. There are configuration parameters you can tweak, but AML provides sensible defaults based on your data. These parameters are an area for experimentation.

Also, a “recipe” is required to build a model. Using a recipe, you can transform your data before the model building algorithm accesses it, without modifying the source data. Recipes can also create intermediate variables which can be fed into the model, group variables together for easy transformation and exclude source variables. There are many transformations that you can transparently perform on the data, including:

  • Lowercasing strings
  • Removing punctuation
  • Normalizing numeric values
  • Binning numeric values
  • And more

Note that if you need to perform a different type of transformation (such as converting a boolean value to an AML compatible format), you’ll have to do it as part of the prep script. There is no way to transform data in a recipe other than using the provided transformations.

If you are using the API, the recipe is a JSON file that you can store outside of the AML pipeline and provide when creating a model.

Here’s an example of a recipe that I used on this income prediction dataset:

Groups are a way of grouping different variables (defined in the schema) together so that operations can be applied to them en masse. For example, NUMERIC_VARS_QB_10 is a group of continuous numeric variables that are binned into 10 separate bins (turning the numeric variables into categorical variables).

Assignments let you create intermediate variables. I didn’t use that capability here.

Outputs are the list of variables that the model will see and operate on. In this case, ALL_CATEGORICAL and ALL_BINARY are shortcuts referring to all of those types of input variables. If you remove a variable from the outputs clause, the model will ignore the variable.

In the same way that you have multiple different data sources from the same data, you can create multiple models based on the same data source. You can tweak the parameters and the recipe to build different models. You can then compare those models and test to see which is most accurate.

But how do you test for accuracy?

The Data Pipeline: Evaluate and Use the Model

When you have an AML model, there are three operations you can perform.

The first is model evaluation. When you are training a model, you can optionally hold back some of the training data (which has the real world target variable values). This is called evaluation data. After you build the model, you can run this data through, stripping off the target variable, and get the models’ prediction. Then the system can compare the predicted value with the correct answer across all the evaluation data. This gives an indication of the accuracy.

Here’s an example of an evaluation for the income prediction model that I built using the census data:

Depending on your model’s target variable, you will get different representations of this value, but fundamentally, you are asking how often the model was correct. There are two things to be aware of:

  • You won’t get 100% accuracy. If you see that, your model exactly matches the evaluation data, which means that it’s unlikely to match real world data. This is called overfitting.
  • Evaluation scores differ based on both the model and the data. If the data isn’t representative of the observations you’re going to be making, the model won’t be accurate.

For the adult dataset, which is a binary prediction model, we get something called the area under the curve (AUC). The closer the AUC is to 1, the better our model matched reality. Other types of target variables get other measures of accuracy.

You can also, with a model that has a boolean target variable, determine a cutoff point, called the scoreThreshold. The model will give a prediction between 0 and 1, and you can then determine where you want the results to be split between 1 (true) or 0 (false). Is it 0.5 (the default)? Or 0.9 (which will give you fewer false positives, where the model predicts the value is true, but reality says it’s not)? Or 0.1 (which will give you fewer false negatives, where the model predicts the value is false, but reality says it’s true)?  What you set this value to depends on the actions you’re going to take. If the action is inexpensive (showing someone an advertisement they may like), you may want to err on the side of fewer false negatives. If the opposite is true, and the action is expensive (having a maintenance tech visit a factory for proactive maintenance) you will want to set this higher.

Other target variable types don’t have the concept of a score threshold and may return different values. Below, you’ll see sample predictions for different types of target variables.

Evaluations are entirely optional. If you have some other means of determining the accuracy of your model, you don’t have to use AML’s evaluation process. But using the built-in evaluation settings lets you easily compare models and gives you a way to experiment when tweaking configuration and recipes.

Now that you’ve built the model is you have a handle on the accuracy, you can make predictions. There are two types of predictions: batch and real time.

Batch Predictions

Batch predictions take place asynchronously. You create an AML data source pointing to your observations. The data format must be identical, and the target variable must be absent, but you can use any of the supported data source options (S3, RDS, Redshift). You can process millions of records at a time. You submit a job via the console or API. The job request includes the data source of the observations, the model ID and the prediction output location.

After you start the job, you need to poll AML until it is done. Some SDKs, including the python SDK, include code that will poll for you: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/prediction/batchpredict.py has some sample code.

At job completion, the results will be placed in the specified S3 output bucket. If your observation data has a row identifier, that will be in the output file as well. Otherwise each input row will correspond to an output row based on line number (the first row of input will correspond to the first row of output, the second row of input to the second row of output, and so on).

Here’s sample output from a batch prediction job of the income prediction model:

You are given the bestAnswer, which is based on scoreThreshold above. But you’re also given the values calculated by the model.

For a multi-class classification, I am given all the values. Below, there were seven different classes (using a different data set based on wine characteristics, if you must know). The model predicts for line 1 that the value is mostly likely to be “6” with an 84% likelihood ( 8.404026E-1 is approximately 0.84 == 84%).

And for a numeric target variable (based on yet another dataset), I just get back the value predicted:

Batch predictions work well as part of a data pipeline when you don’t care about when you get your answers, just that you get them. (Any batch job that takes more than a week will be killed, so there is a time limit.) An example of a problem for which a batch job would be appropriate is scoring thousands of customers to see if any are likely to churn this month.

Real Time Predictions

Real time predictions are, as advertised, predictions that are synchronous. AML generally returns predictions within 100 milliseconds. You can set up a real time endpoint on any model in the console or with an API call. It can take a few minutes for the endpoint server to be ready. But you don’t have to maintain the endpoint in any way–it’s entirely managed by the AML service.

You provide one observation to the real time endpoint, and it will return you a prediction based on that observation. Here’s a sample observation for the income prediction model that we built above:

Predictions are made using an AWS API and return a data structure. Here’s the JSON that is returned when I call the income prediction model with the above observation:

The predictedLabel and predictedScores are the predicted values for this observation, and are what I am really interested in. The predictedLabel is calculated using the score threshold, but I still get the calculated value if that is useful to me.

Real time predictions are the right choice when you have observations that require a prediction immediately. An example would be to choose what kind of ad to display to a user right now, based on existing data and their current behavior.

Now that you’ve seen the major constructs of the AML data pipeline, as well as some predictions that were made using an AML model, let’s cover some operational concerns.

Operational Concerns

Pricing

AML is a totally managed service. You pay for the data storage solutions (both for input and results), but you don’t pay for storage of any of the AML managed artifacts (like the model or data source). You also pay for the compute time to build your data sources, models, and evaluations.

For predictions, you pay per prediction. If you are running real time endpoints, you also pay per hour you have the endpoint up. For the model that I built using the census data,  ITRunning th was about $0.50 to process all 20k records and to make a few thousand predictions.

Full pricing information is available here: https://aws.amazon.com/aml/pricing/

The Model Creation Pipeline

AML Models are immutable, as are data sources. If you need to incorporate ongoing data into your model, which is generally a good idea, you need to automate your datasource and model building process so they are repeatable. Then, when you have new data, you can rebuild your model, test it out, and then change which model is “in production” and making predictions.

You can use tags to control which model is used for a given prediction, so you can end up building a CI pipeline and having a ‘production’ model and test models.

Here’s a simple example of a model update pipeline in one script: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/updatemodel/updatemodel.py

Permissions

Like any other AWS service, AML leverages the Identity and Access Management service (IAM). You can control access to data sources, models, and all other AML constructs with IAM. The full list of permissions is here: https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonmachinelearning.html

It’s important to note that if you are using the AWS console to test drive AML, the console will set up the permissions correctly, but if you are using the API to construct a data pipeline, you will need to ensure that IAM access is set up correctly. I’ve found it useful to use the console first and then examine the permissions it sets up and leverage that for the scripts that use the API.

Monitoring

You can monitor AML processes via Amazon Cloudwatch. Metrics published include a number of predictions and number of failed predictions, per model. You can set up alarms in the typical Cloudwatch fashion to take action on the metrics (for example, emailing someone if a new model is rolled to production, but a large number of failed predictions ensues).

AWS ML Alternatives

There are many services within AWS that are complements to AML. These focus on a particular aspect of ML (computer vision, speech recognition) and include Rekognition and Lex.

AWS Sagemaker is a more general purpose machine learning service with many of the benefits of AML. It lets you use standard machine learning software like Jupyter notebooks, supports multiple algorithms, and lets you run your models locally.

If you are looking for even more control (with corresponding responsibility), there is a Deep Learning AMI available. This AMI comes preinstalled with a number of open source machine learning frameworks. You can use this AMI to boot up an EC2 instance and have full configuration and control.

Conclusion

Amazon Machine Learning makes it super simple to make predictions by creating a model to predict outcomes based on structured text data. AML can be used at all scales, from a few hundred records to millions—all without running any infrastructure. It is the perfect way to bring ML predictions into an existing system easily and inexpensively.

AML is a great way to gain experience with machine learning. There is little to no coding required, depending on what your source data looks like. It has configuration options but is really set up to “just work” with sane defaults.

AML helps you explore the world of machine learning while providing a robust production ready system to help make your applications smarter.

About the Author

Dan Moore is director of engineering at Culture Foundry. He is a developer with two decades of experience, former AWS trainer, and author of “Introduction to Amazon Machine Learning,” a video course from O’Reilly. He blogs on AML and other topics at http://www.mooreds.com/wordpress/ . You can find him on Twitter at @mooreds.

 

About the Editors

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf and developed cookbooks to simplify building and managing infrastructure at Chef. Jennifer is the coauthor of Effective DevOps and speaks about DevOps, tech culture, and monitoring. She also gives tutorials on a variety of technical topics. When she’s not working, she enjoys learning to make things and spending quality time with her family.

John Varghese is a Cloud Steward at Intuit responsible for the AWS infrastructure of Intuit’s Futures Group. He runs the AWS Bay Area meetup in the San Francisco Peninsula Area for both beginners and intermediate AWS users. He has also organized multiple AWS Community Day events in the Bay Area. He runs a Slack channel just for AWS users. You can contact him there directly via Slack. He has a deep understanding of AWS solutions from both strategic and tactical perspectives. An avid AWS user since 2012, he evangelizes AWS and DevOps every chance he gets.


Welcome to AWS Advent 2018

22. November 2018 2018, welcome 0

AWS Advent is returning shortly!

What is the AWS Advent event? Many technology platforms have started a yearly tradition for the month of December revealing an article per day written and edited by volunteers in the style of an advent calendar, a special calendar used to count the days in anticipation of Christmas starting on December 1. The AWS Advent event explores everything around the Amazon Web Services platform.

Examples of past AWS articles:

Please explore the rest of this site for more examples of past topics.

There are a large number of AWS services, and many that have never been covered on AWS advent in previous years. We’re looking for articles that range in audience level from beginners to experts in AWS. Introductory, security, architecture, and design patterns with any of the AWS services are welcome topics.

Interested in being part of AWS Advent 2018?

Important Dates

  • Authors rolling acceptance – November 1, 2018
  • Submissions accepted until advent calendar complete. Submissions are still being accepted!!
  • Final drafts due – 12:00am November 30, 2018
  • Final article due 3 days prior to publishing to the site.

Thank you, and we look forward to a great AWS Advent in 2018!

Jennifer Davis, @sigje

John Varghese, @jvusa