Machine learning is a big topic. It’s full of math, white papers, open source libraries, and algorithms. And worse, PhDs. If you simply want to predict an outcome based on your historical data, it can feel overwhelming.
What if you want to predict customer churn (when a customer will stop using your service) so that you can reach out to them before they decide to leave? Or what if you want to predict when one of hundreds or thousands of remote devices will fail? You need some kind of mathematical construct, called a “model,” which you will feed data and in return, receive predictions.
You could break out the statistics textbook and start thinking about what algorithm to use. Or you can choose a technology that lets you quickly apply machine learning to a broad set of scenarios: Amazon Machine Learning (AML).
Amazon Web Services (AWS) offers Amazon Machine Learning, which lets you build a simplified machine learning (ML) system. AML makes it very easy to create multiple models, evaluate the models, and make predictions. AML is a PaaS solution and is a building block of an application, rather than an entire application itself. It should be incorporated into an existing application, using AML predictions to make the application “smarter”.
ML systems perform either supervised or unsupervised learning. With supervised learning, correct answers are provided as part of the input data for the model. With unsupervised learning, the algorithm used teases out innate structure in the input data without any help as to what is the correct answer.
AML is supervised machine learning. To build a model, AML needs input data with both the values that will help predict the outcome and values of that outcome. The outcome variable is called the “target variable”. AML needs both so the machine learning algorithm can tease out the relationships and learn how to predict the target variable. Such data is called training data.
For example, if you are trying to predict the winner of a baseball game, you might provide input data such as who was playing each position, the weather, the location of the game and other information. The target variable would be a boolean value–true for a home team win, false for a visiting team win. To use AML to solve this problem, you’d have to provide a data set with all of the input variables and also the results of previous games. Then, once the model was built, you provide all of the input values except the target variable (called an “observation”) and get a predicted value for the winner.
In addition, AML has the following features:
- It works with structured text data. It supports only CSV at present.
- Input data can be strings, numbers, booleans or categorical (one of N) values.
- Target variable types can be numbers, booleans, or categorical values.
- There’s little to no coding needed to experiment with AML.
- You don’t need machine learning experience to use AML and get useful predictions.
- AML is a pay as you go service; you only pay for what you use.
- It is a hosted service. You don’t have to run any servers to use AML.
In order to make machine learning simple to use, AML limits the configurability of the system. It also has other limits, as mentioned below. AML is a great solution when you have CSV data that you want to make predictions against. Examples of problems for which AML would be a good solution include:
- Is this customer about to churn/leave?
- Does this machine need service?
- Should I send this customer a special offer?
AML is not a general purpose machine learning toolkit. Some of the constraints on any system built on AML include:
- AML is a “medium” data solution, rather than big data. If you have hundreds of gigs of data (or less), AML will work.
- The model that is created is housed completely within the AML system. While you can access it to generate predictions, you can’t examine the mathematical makeup of the model (for example, the weights of the features). It is also not possible to export the model to run on any other system (for example in a different cloud or on premise).
- AML only supports the four input types mentioned above: strings, numbers, booleans or categorical (one of N values). Target variables can only be a number, boolean, or categorical value–the data type of the target variable determines the type of model (regression models for numeric target variables, binary classification models for boolean target variables, and multi-class classification models for categorical target variables).
- AML is currently only available in two AWS regions: northern Virginia and Ireland.
- While you can tweak some settings, there is only one algorithm for each predicted value data type. The only optimization technique available is stochastic gradient descent.
- It can only be used for supervised prediction, not for clustering, recommendations or other kinds of machine learning.
Examples of problems for which AML will not be a good fit include:
- Is this a picture of a dog or a cat?
- What are the multi dimensional clusters of this data?
- Given this user’s purchase history, what other products would they like?
Before diving into building making predictions, it’s worth discussing the ethics of machine learning. Models make predictions which have real-world consequences. When you are involved in building such systems, you must think about the ramifications. In particular, think about the bias in your training data. If you are working on a project that will be rolled out across a broad population, make sure your training data is evenly distributed.
In addition, it’s worth thinking about how your model will be used. (This framework is pulled from the excellent “Weapons of Math Destruction” by Cathy O’Neil). Consider:
- Opacity: How often is it updated? Is the data source available to all people affected by the model?
- Scale: How many people will this system affect, now or in the future?
- Damage: What kind of decisions are being made with this model? Deciding whether to show someone an ad has far fewer ramifications than deciding whether someone is a good credit risk or not.
Even more than software developers, people developing ML models need to consider the ethics of the systems they build. Software developers build tools that humans use, whereas ML models affect human beings, often without their knowledge.
Think about what you are building.
The Data Pipeline
An AML process can be thought of like a pipeline. You push data in on one end, build certain constructs that the AML system leverages, and eventually, you get predictions out on the other end. The first steps for most ML problems are to determine the question you are trying to answer and to locate the requisite data. This article won’t discuss these efforts, other than to note that garbage in, garbage out applies to ML just as much as it does to other types of data processing. Make sure you have good data, plenty of it, and know what kind of predictions you want to make before building an AML system.
All these AML operations can either be done via the AWS console or the AWS API. For initial exploration, the console is preferable; it’s easier to understand and requires no coding. For production use, you should use the API to build a repeatable system. All the data and scripts mentioned below are freely available on Github (https://github.com/mooreds/amazonmachinelearning-anintroduction) and can serve as a base for your own data processing scripts.
The Data Pipeline: Load the data
When you are starting out with AML, you need to make your data available to the AML system in CSV format. It also must be in a location accessible to AML.
For this post, I’m going to use data provided by UCI. In particular, I’m going to use census data that was collected in the 1990s and includes information like the age of the person, their marital status, and their educational level. This is called the ‘adult’ data set. The target variable will be whether or not the user makes more or less than $50,000 per year. This data set has about 20k records. Here is some sample data:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
Note that this dataset is a bit atypical in that it has only training data. There are no observations (input data without the target variable) available to me. So I can train a model, but won’t have any data to make predictions. In order to fully show the power of AML, I’m going to split the dataset into two parts as part of the prep:
- training data which includes the target variable and which will be used to build the model.
- observations, which will be passed to the model to obtain predictions. These observations will not include the target variable.
For real world problems you’ll want to make sure you have a steady stream of observations on which to make predictions, and your prep script won’t need to split the initial dataset.
I also need to transform this dataset into an AML compatible format and load it up to S3. A script will help with the first task. This script will turn the <=50K and >50K values into boolean values that AML can process. It will also prepend the header row for easier variable identification later in the process. Full code is available here: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/dataprep/adult.py
Running that script yields the following training data (the last value is the target variable, which the model will predict):
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <b>false</b>
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <b>true</b>
It also provides the following observation data, with the target variable removed:
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States
This prep script is a key part of the process and can execute anywhere and in any language. Other kinds of transformations that are best done in a prep script:
- Converting non-CSV format (JSON, XML) data to CSV.
- Turning date strings into offsets from a canonical date.
- Removing personally identifiable information.
The example prep script is python that runs synchronously, but only processes thousands of records. Depending on the scale of your data, you may need to consider other solutions to transform your source data into CSV, such as Hadoop or Spark.
After I have the data in CSV format, I can upload it to S3. AML can also read from AWS RDS and Redshift via a query, using a SQL query as the prep script. (Note that you can’t use AWS RDS as a data source via the console, only via the API.)
The Data Pipeline: Create the Datasource
Once the CSV file is on S3, you need to build AML specific objects. These all have their own identity and are independent of the CSV data. First you need to create the AML data source.
You point AML process at the data on S3. You need to specify a schema, which includes mapping fields to one of the four supported data types. You also select a target variable (if the data source has the variable you want to predict) and a row identifier (if each row has a unique ID that should be carried through the process). If you are doing this often or you want a repeatable process, you can store the schema as JSON and provide it via an API.
Here’s an excerpt of the schema file I am using for the income prediction model:
You can see that I specify the target attribute, the data file format, and a list of attributes with a name and a data type. (Full schema file here.)
You can create multiple different data sources off of the same data, and that you only need read access to the S3 location. You can also add arbitrary string tags to the data source; for example, date_created or author. The first ten tags you add to a datasource will be inherited by other AML entities like models or evaluations that are created from the data source. As your models proliferate, tags are a good way to organize them.
Finally, when the data source is created, you’ll receive statistics about the data set, including histograms of the various variables, counts of missing values, and the distribution of your target variable. Here’s an example of target variable distribution for the adult data set I am using:
Data insights can be useful in determining if your data is incomplete or nonsensical. For example, if I had 15,000 records but only five of them had an income greater than $50,000, trying to predict that value wouldn’t make much sense. There simply isn’t a valid distribution of the target variable, and my model would be skewed heavily toward whatever attributes those five records had. This type of data intuition is only gained through working with your dataset.
The Data Pipeline: Create the Model
Once you have the AML data source created, you can create a model.
An AML model is an opaque, non-exportable representation of your data, which is built using the stochastic gradient descent optimization technique. There are configuration parameters you can tweak, but AML provides sensible defaults based on your data. These parameters are an area for experimentation.
Also, a “recipe” is required to build a model. Using a recipe, you can transform your data before the model building algorithm accesses it, without modifying the source data. Recipes can also create intermediate variables which can be fed into the model, group variables together for easy transformation and exclude source variables. There are many transformations that you can transparently perform on the data, including:
- Lowercasing strings
- Removing punctuation
- Normalizing numeric values
- Binning numeric values
- And more
Note that if you need to perform a different type of transformation (such as converting a boolean value to an AML compatible format), you’ll have to do it as part of the prep script. There is no way to transform data in a recipe other than using the provided transformations.
If you are using the API, the recipe is a JSON file that you can store outside of the AML pipeline and provide when creating a model.
Here’s an example of a recipe that I used on this income prediction dataset:
Groups are a way of grouping different variables (defined in the schema) together so that operations can be applied to them en masse. For example, NUMERIC_VARS_QB_10 is a group of continuous numeric variables that are binned into 10 separate bins (turning the numeric variables into categorical variables).
Assignments let you create intermediate variables. I didn’t use that capability here.
Outputs are the list of variables that the model will see and operate on. In this case, ALL_CATEGORICAL and ALL_BINARY are shortcuts referring to all of those types of input variables. If you remove a variable from the outputs clause, the model will ignore the variable.
In the same way that you have multiple different data sources from the same data, you can create multiple models based on the same data source. You can tweak the parameters and the recipe to build different models. You can then compare those models and test to see which is most accurate.
But how do you test for accuracy?
The Data Pipeline: Evaluate and Use the Model
When you have an AML model, there are three operations you can perform.
The first is model evaluation. When you are training a model, you can optionally hold back some of the training data (which has the real world target variable values). This is called evaluation data. After you build the model, you can run this data through, stripping off the target variable, and get the models’ prediction. Then the system can compare the predicted value with the correct answer across all the evaluation data. This gives an indication of the accuracy.
Here’s an example of an evaluation for the income prediction model that I built using the census data:
Depending on your model’s target variable, you will get different representations of this value, but fundamentally, you are asking how often the model was correct. There are two things to be aware of:
- You won’t get 100% accuracy. If you see that, your model exactly matches the evaluation data, which means that it’s unlikely to match real world data. This is called overfitting.
- Evaluation scores differ based on both the model and the data. If the data isn’t representative of the observations you’re going to be making, the model won’t be accurate.
For the adult dataset, which is a binary prediction model, we get something called the area under the curve (AUC). The closer the AUC is to 1, the better our model matched reality. Other types of target variables get other measures of accuracy.
You can also, with a model that has a boolean target variable, determine a cutoff point, called the scoreThreshold. The model will give a prediction between 0 and 1, and you can then determine where you want the results to be split between 1 (true) or 0 (false). Is it 0.5 (the default)? Or 0.9 (which will give you fewer false positives, where the model predicts the value is true, but reality says it’s not)? Or 0.1 (which will give you fewer false negatives, where the model predicts the value is false, but reality says it’s true)? What you set this value to depends on the actions you’re going to take. If the action is inexpensive (showing someone an advertisement they may like), you may want to err on the side of fewer false negatives. If the opposite is true, and the action is expensive (having a maintenance tech visit a factory for proactive maintenance) you will want to set this higher.
Other target variable types don’t have the concept of a score threshold and may return different values. Below, you’ll see sample predictions for different types of target variables.
Evaluations are entirely optional. If you have some other means of determining the accuracy of your model, you don’t have to use AML’s evaluation process. But using the built-in evaluation settings lets you easily compare models and gives you a way to experiment when tweaking configuration and recipes.
Now that you’ve built the model is you have a handle on the accuracy, you can make predictions. There are two types of predictions: batch and real time.
Batch predictions take place asynchronously. You create an AML data source pointing to your observations. The data format must be identical, and the target variable must be absent, but you can use any of the supported data source options (S3, RDS, Redshift). You can process millions of records at a time. You submit a job via the console or API. The job request includes the data source of the observations, the model ID and the prediction output location.
After you start the job, you need to poll AML until it is done. Some SDKs, including the python SDK, include code that will poll for you: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/prediction/batchpredict.py has some sample code.
At job completion, the results will be placed in the specified S3 output bucket. If your observation data has a row identifier, that will be in the output file as well. Otherwise each input row will correspond to an output row based on line number (the first row of input will correspond to the first row of output, the second row of input to the second row of output, and so on).
Here’s sample output from a batch prediction job of the income prediction model:
You are given the bestAnswer, which is based on scoreThreshold above. But you’re also given the values calculated by the model.
For a multi-class classification, I am given all the values. Below, there were seven different classes (using a different data set based on wine characteristics, if you must know). The model predicts for line 1 that the value is mostly likely to be “6” with an 84% likelihood ( 8.404026E-1 is approximately 0.84 == 84%).
And for a numeric target variable (based on yet another dataset), I just get back the value predicted:
Batch predictions work well as part of a data pipeline when you don’t care about when you get your answers, just that you get them. (Any batch job that takes more than a week will be killed, so there is a time limit.) An example of a problem for which a batch job would be appropriate is scoring thousands of customers to see if any are likely to churn this month.
Real Time Predictions
Real time predictions are, as advertised, predictions that are synchronous. AML generally returns predictions within 100 milliseconds. You can set up a real time endpoint on any model in the console or with an API call. It can take a few minutes for the endpoint server to be ready. But you don’t have to maintain the endpoint in any way–it’s entirely managed by the AML service.
You provide one observation to the real time endpoint, and it will return you a prediction based on that observation. Here’s a sample observation for the income prediction model that we built above:
51, Local-gov, 108435, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 80, United-States
Predictions are made using an AWS API and return a data structure. Here’s the JSON that is returned when I call the income prediction model with the above observation:
Valid JSON (RFC 4627)
Formatted JSON Data
"date":"Sun, 30 Apr 2017 04:13:51 GMT",
The predictedLabel and predictedScores are the predicted values for this observation, and are what I am really interested in. The predictedLabel is calculated using the score threshold, but I still get the calculated value if that is useful to me.
Real time predictions are the right choice when you have observations that require a prediction immediately. An example would be to choose what kind of ad to display to a user right now, based on existing data and their current behavior.
Now that you’ve seen the major constructs of the AML data pipeline, as well as some predictions that were made using an AML model, let’s cover some operational concerns.
AML is a totally managed service. You pay for the data storage solutions (both for input and results), but you don’t pay for storage of any of the AML managed artifacts (like the model or data source). You also pay for the compute time to build your data sources, models, and evaluations.
For predictions, you pay per prediction. If you are running real time endpoints, you also pay per hour you have the endpoint up. For the model that I built using the census data, ITRunning th was about $0.50 to process all 20k records and to make a few thousand predictions.
Full pricing information is available here: https://aws.amazon.com/aml/pricing/
The Model Creation Pipeline
AML Models are immutable, as are data sources. If you need to incorporate ongoing data into your model, which is generally a good idea, you need to automate your datasource and model building process so they are repeatable. Then, when you have new data, you can rebuild your model, test it out, and then change which model is “in production” and making predictions.
You can use tags to control which model is used for a given prediction, so you can end up building a CI pipeline and having a ‘production’ model and test models.
Here’s a simple example of a model update pipeline in one script: https://github.com/mooreds/amazonmachinelearning-anintroduction/blob/master/updatemodel/updatemodel.py
Like any other AWS service, AML leverages the Identity and Access Management service (IAM). You can control access to data sources, models, and all other AML constructs with IAM. The full list of permissions is here: https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonmachinelearning.html
It’s important to note that if you are using the AWS console to test drive AML, the console will set up the permissions correctly, but if you are using the API to construct a data pipeline, you will need to ensure that IAM access is set up correctly. I’ve found it useful to use the console first and then examine the permissions it sets up and leverage that for the scripts that use the API.
You can monitor AML processes via Amazon Cloudwatch. Metrics published include a number of predictions and number of failed predictions, per model. You can set up alarms in the typical Cloudwatch fashion to take action on the metrics (for example, emailing someone if a new model is rolled to production, but a large number of failed predictions ensues).
AWS ML Alternatives
There are many services within AWS that are complements to AML. These focus on a particular aspect of ML (computer vision, speech recognition) and include Rekognition and Lex.
AWS Sagemaker is a more general purpose machine learning service with many of the benefits of AML. It lets you use standard machine learning software like Jupyter notebooks, supports multiple algorithms, and lets you run your models locally.
If you are looking for even more control (with corresponding responsibility), there is a Deep Learning AMI available. This AMI comes preinstalled with a number of open source machine learning frameworks. You can use this AMI to boot up an EC2 instance and have full configuration and control.
Amazon Machine Learning makes it super simple to make predictions by creating a model to predict outcomes based on structured text data. AML can be used at all scales, from a few hundred records to millions—all without running any infrastructure. It is the perfect way to bring ML predictions into an existing system easily and inexpensively.
AML is a great way to gain experience with machine learning. There is little to no coding required, depending on what your source data looks like. It has configuration options but is really set up to “just work” with sane defaults.
AML helps you explore the world of machine learning while providing a robust production ready system to help make your applications smarter.
About the Author
Dan Moore is director of engineering at Culture Foundry. He is a developer with two decades of experience, former AWS trainer, and author of “Introduction to Amazon Machine Learning,” a video course from O’Reilly. He blogs on AML and other topics at http://www.mooreds.com/wordpress/ . You can find him on Twitter at @mooreds.
About the Editors
Jennifer Davis is a Senior Cloud Advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf and developed cookbooks to simplify building and managing infrastructure at Chef. Jennifer is the coauthor of Effective DevOps and speaks about DevOps, tech culture, and monitoring. She also gives tutorials on a variety of technical topics. When she’s not working, she enjoys learning to make things and spending quality time with her family.
John Varghese is a Cloud Steward at Intuit responsible for the AWS infrastructure of Intuit’s Futures Group. He runs the AWS Bay Area meetup in the San Francisco Peninsula Area for both beginners and intermediate AWS users. He has also organized multiple AWS Community Day events in the Bay Area. He runs a Slack channel just for AWS users. You can contact him there directly via Slack. He has a deep understanding of AWS solutions from both strategic and tactical perspectives. An avid AWS user since 2012, he evangelizes AWS and DevOps every chance he gets.