Code assistance for boto3, always up to date and in any IDE

21. December 2018 2018 0

If you’re like me and work with the boto3 sdk to automate your Ops, then you probably are familiar with this sight:

image1

No code completion! It’s almost as useful as coding in Notepad, isn’t it? This is one of the major quirks of the boto3 sdk. Due to its dynamic nature, we don’t get code completion like for other libraries like we are used to.

I used to deal with this by going back and forth with the boto3 docs. However, this impacted my productivity by interrupting my flow all the time. I had recently adopted Python as my primary language and had second thoughts on whether it was the right tool to automate my AWS stuff. Eventually, I even became sick of all the back-and-forth.

A couple of weeks ago, I thought enough was enough. I decided to solve the code completion problem so that I never have to worry about it anymore.

But before starting it, a few naysaying questions cropped up in my head:

  1. How would I find time to support all the APIs that the community and I want?
  2. Will this work be beneficial to people not using the X IDE?
  3. With 12 releases of boto3 in the last 15 days, will this become a full time job to continuously update my solution?

Thankfully, I found a lazy programmer’s solution that I could conceive in a weekend. I put up an open source package and released it on PyPI. I announced this on reddit and within a few hours, I saw this:

image4

Looks like a few people are going to find this useful! 🙂

In this post I will describe botostubs, a package that gives you code completion for boto3, all methods in all APIs. It even automatically supports any new boto3 releases.

Read on to learn a couple of less-used facilities in boto3 that made this project possible. You will also learn how I automated myself out of the job of maintaining botostubs by leveraging a simple deployment pipeline on AWS that costs about $0.05 per month to run.

What’s botostubs?

botostubs is a PyPI package, which you can install to give you code completion for any AWS service. You install it in your Python runtime using pip, add “import botostubs” to your scripts and a type hint for your boto3 clients and you’re good to go:

image18

Now, instead of “no suggestions”, your IDE can offer you something more useful like:

image17

The parameters in boto3 are dynamic too so what about them?

With botostubs, you can now get to know which parameters are supported and also which are required or optional:

image10

Much more useful, right? No more back-and-forth with the boto3 docs, yay!

The above is for Intellij/PyCharm but will this work in other IDEs?

Here are a couple of screenshots of botostubs running in Visual Studio Code:

image2

image3

Looks like it works! You should be able to use botostubs in any IDE that supports code completion from python packages.

Why is this an issue in the first place?

As I mentioned before, the boto3 sdk is dynamic, i.e the methods and APIs don’t exist as code. As it says in the guide,

It uses a data-driven approach to generate classes at runtime from JSON description files …

The SDK maintainers do it to be able to enhance the SDK reliably and faster. This is great for the maintainers but terrible for us, the end users of the SDK.

Therefore, we need statically defined classes and methods. Since boto3 doesn’t work that way, we need a separate solution.

How botostubs works

At a high level, we need a way to discover all the available APIs, find out about the method signatures and package them up as classes in a module.

  1. Get a boto3 session
  2. Loop over its available clients
  3. Find out about each client’s operations
  4. Generate class signatures
  5. Dump them in a Python module

I didn’t know much about boto3 internals before so I had to do some digging on how to accomplish that. You can use what I’ve learnt here if you’re interested in building tools on top of boto3.

First, about the clients. It’s easy when you already know which API you need, e.g with S3, you write:

client = boto3.client(‘s3’)

But for our situation, we don’t know which ones are there in advance. I could have hardcoded them but I need a scalable and foolproof way. I found out that the way to do that is with a session’s get_available_services() facility.

image19

Tip: Much of what I’ve learnt have been though Intellij’s debugger. Very handy especially when having to deal with dynamic code.

image13

For example, to learn what tricks are involved to get the dynamic code to convert to actual API calls to AWS, you can place a breakpoint in _make_api_call found in boto3’s client.py:

image14

Steps 1 and 2 solved. Next, I had to find out which operations are possible in a scalable fashion. For example, the S3 API supports about 98 operations for listing objects, uploading and downloading them. Coding 98 operations is no fun, so I’m forced to get creative.

Digging deeper, I found out that clients have an internal botocore’s service model that had everything that I was looking for. Through the service model you can find the service documentation, api version, etc.

Side note: botocore is a factored out library that is shared with the AWS CLI. Much of what boto3 is capable is actually powered by botocore.

In particular, we can read the available operation names. E.g the service model for the ACM api returns:

image6

Step 3 was therefore solved with:

image23

Next, we need to know what parameters are available for each operation. In boto parlance, they are called “input shapes”. (Similarly, you can get the output shape if needed) Digging some more in the service model source, I found out that we can get the input shape with the operation model:

image12

This told me the required and optional parameters. The missing part of generating the method signatures was then solved. (I don’t need the method body since I’m generating stubs)

Then it was a matter of generating classes based on the clients and operations above and package them in a Python module.

For any version of boto, I had to run my script, run the twine PyPI utility and it will output a PyPI package that’s up to date with upstream boto3. All of that took about 100 lines of Python code.

Another problem remained to be solved though; with a new boto3 release every time you change your t-shirt, I would need to run it and re-upload to PyPI several times a week. So, wouldn’t this become a maintenance hassle for me?

The deployment pipeline

To solve this problem, I looked to AWS itself. The simplest way I found out was to use their build tool and invoke it on a schedule. What I want is a way to get the latest boto3 version, run the script and upload the artefact to PyPI. All without my intervention.

The relevant AWS services to achieve this is Cloudwatch Events (to trigger other services on a schedule), CodeBuild (managed build service in the cloud) and SNS (for email notifications). This is what the architecture looks like on AWS:

image20

Image generated with viz-cfn

The image above describes the CloudFormation template used to deploy on Github as well as the code.

The AWS Codebuild Project looks like this:

image5

image15

To keep my credentials outside of source control, I also attached a service role to give CodeBuild permissions to write logs and read my PyPI username and password from the Systems Manager parameter store.

I also enabled the build badge feature so that I can show the build status on Github:

image7

 

 

For intricate details, check out the buildspec.yml and the project definition.

I want this project to be invoked on a schedule (I chose every 3 days) and I can accomplish that with a Cloudwatch Event Rule:

image11

When the rule gets triggered, I see that my codebuild project does what it needs to do; clone the git repo, generate the stubs and upload to PyPI:

image21

This whole process is done in about 25 seconds. Since this is entirely hands off, I needed some way to be kept in the loop. After the build has run, there’s another Cloudwatch Event which gets triggered for build events on the project. It sends a notification to SNS which in turns sends me an email to let me know if everything went OK:

image9

The build event and notification.

image8

The SNS Topic with an email subscription.

That’s it! But what about my AWS bill? My estimate is that it should be around $0.05 every month. Besides, it will definitely not break the bank, so I’m pretty satisfied with it! Imagine how much it would cost to maintain a build server on your own to accomplish all of that.

What’s with the weird versioning?

You will notice botostubs versions look like this:

image22

It currently follows boto3 releases in the format 0.4.x.y.z. Therefore, if botostubs is currently at 0.4.1.9.61, then it means that it will offer whatever is available in boto3 version 1.9.61. I included the boto version in mine to make it more obvious what version of boto3 that botostubs was generated from but also because PyPI does not allow uploads at the same version number.

Are people using it?

According to pypistats.org, botostubs has been downloaded about 600 times in its initial week after I showed it to the reddit community. So it seems that it was a tool well needed:

image16

Your turn

If this sounds that something that you’ll need, get started by running:

pip install botostubs

Run it and let me know if you have any advice on how to make this better.

Credit

Huge thanks goes to another project called pyboto3 for the original idea. The issues that I had with it was that it was unmaintained and supported legacy Python only. I wouldn’t have known that this would be possible were it not for pyboto3.

Open for contribution

botostubs is an open source project, so feel free to send your pull requests.

A couple of areas where I’ll need some help:

  • Support Python < 3.5
  • Support boto3 high-level resources (as opposed to just low-level clients)

Summary

In this article, I’ve shared my process for developing botostubs through examining the internals of boto3 and automate its maintenance with a deployment pipeline that handles all the grunt work. If you like it, I would appreciate it if you share it with a fellow Python DevOps engineer

https://pypi.org/project/botostubs/.

I hope you are inspired to find solutions for AWS challenges that are not straightforward and share them with the community.

If you used what you’ve learnt above to build something new, let me know, I’d love to take a look! Tweet me @jeshan25.

About the Author

Jeshan Babooa is an independent software developer from Mauritius. He is passionate about all things infra automation on AWS especially with tools like Cloudformation and Lambda. He is the guy behind LambdaTV, a youtube channel dedicated to teaching serverless on AWS. You can reach him on Twitter @jeshan25.

About the Editors

Ed Anderson is the SRE Manager at RealSelf, organizer of ServerlessDays Seattle, and occasional public speaker. Find him on Twitter at @edyesed.

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Time Series Anomaly Detection with LSTM and MXNet

As software engineers, we try our best to make sure that the solution we build is reliable and robust. Monitoring the production environment with reasonable alerts and timely actions to mitigate and resolve any issues are pieces of the puzzle required to make our customers happy. Monitoring can produce a lot of data – CPU, memory, and disk IO are the most commonly collected for hardware infrastructure, however, in most cases, you never know what the anomaly is, so the data is not labeled.

We decided to take a common problem – anomaly detection within a time series data of CPU utilization and explore how to identify it using unsupervised learning. A dataset we use is the Numenta Anomaly Benchmark (NAB). It is labeled, and we will use labels for calculating scores and the validation set. There are plenty of well-known algorithms that can be applied for anomaly detection – K-nearest neighbor, one-class SVM, and Kalman filters to name a few. However, most of them do not shine in the time series domain. According to many studies [1] [2], long short-term memory (LSTM) neural network should work well for these types of problems.

TensorFlow is currently the trend leader in deep learning, however, at Lohika we have pretty good experience with another solid deep-learning framework, Apache MXNet. We like it because it is light, scalable, portable and well-documented, and it is also Amazon’s preferred choice for their deep-learning framework at AWS.

The neural network that we are going to implement is called autoencoder. The autoencoder is a type of neural network that calculates the approximation of the input function by transforming the input data to the intermediate state and then matching it against the number of input features. When training autoencoders the idea is to minimize some metric dependent on input and output values. We use MSE for the case. To compare the performance of different network designs and hyperparameters we use F1 score. F1 score conveys the balance between the precision and the recall and is commonly used for binary classification.

The goal for this task is to detect all known anomalies on the test set and get the maximum F1 score.

For the implementation, we use Python and a few libraries that are very handy – pandas for dataset manipulation, scikit-learn for data pre-processing and metrics calculations, and matplotlib for visualization.

So let’s get to it…

Dataset Overview

The NAB dataset contains a lot of labeled real and artificial data that can be used for anomaly detection algorithm evaluation. We used actual CPU utilization data of some AWS RDS instances for our study. The dataset contains 2 files of records with the values taken every 5 minutes for a period of 14 days, 4032 entities for each file. We used one file for training and another for test purposes.

Deep learning requires large amounts of data for real-world applications. But smaller datasets are acceptable for basic study, especially since model training doesn’t take much time.

Let’s describe all paths to datasets and labels:

Anomaly labels are stored separately from the data values. Let’s load the train and test datasets and label the values with pandas:

Check the dataset head:

1

As we can see, it contains a timestamp, a CPU utilization value, and labels noting if this value is an anomaly.

The next step is a visualization of the dataset with pyplot, which requires converting timestamps to time epochs:

When plotting the data, we mark anomalies with green dots:

The visualization of the training and test datasets look like this:

2

3

Preparing the Dataset

There is one thing that played an important role in dataset preparation – masking of labeled anomalies.

We started with training our LSTM neural network on the initial dataset. This resulted in our model being able to predict, at best, one anomaly out of the 2 labeled in the test dataset. Then, after taking into account that we have a small dataset with limited anomalies, we decided to test if training on a dataset that contains no anomalies would improve results?

We took the approach of masking the anomalies in the original dataset – just put previous non-anomalous value instead of anomalies. After training the model on the dataset with masked anomalies, we were able to get both anomalies predicted. However, this was at the expense of an additional false positive prediction.

Nevertheless, the F1 score is higher for the second case, so the final implementation should take into account the preferred solution. In practice it can have a significant impact – depending on your case, both missing an anomaly or having a false positive prediction can be quite expensive.

Let’s prepare the data for the machine learning processing. The training set contains anomalies, as described above, replaced with non-anomalous values. The simplest way is to use pandas ‘fillna’ method with ‘ffill’ param to replace anomalies values with neighbors:

The next step is scaling the dataset values. This is highly important. We use scikit-learn StandardScaler to scale the input data and pandas to select features from the dataset.

The only feature we use is the CPU utilization value. We tried extracting some additional time-based features to increase the output performance – for example, a weekday or a day/night feature -however, we didn’t find any useful patterns this way.

Let’s prepare our training and validation datasets:

Choosing a Model

Let’s define the neural network model for our autoencoder:

There is a lot happening in this small piece of code. Let’s review it line-by-line:

  • gluon.nn.Sequential() stacks blocks sequentially
  • model.add – adds a block to the top of the stack
  • gluon.rnn.LSTM(n) – LSTM layer with n-output dimensionality. In our situation, we used an LSTM layer without dropout at the layer output. Commonly, dropout layers are used for preventing the overfitting of the model. It’s just zeroed the layer inputs with the given probability
  • gluon.nn.Dense(n, activation='tanh') – densely-connected NN layer with n-output dimensionality and hyperbolic tangent activation function

We did a few experiments with the neural network architecture and hyperparameters, and the LSTM layer followed by one Dense layer with ‘tanh’ activation function worked best in our case. You can check the comparison table with corresponding F1 scores at the end of the article.

Training & Evaluation

The next step is to choose loss function:

‘L2Loss’ is chosen as a loss function, because the trained the neural network is autoencoder, so we need to calculate the difference between the input and the output values. We will calculate MSE after each training epoch of our model to visualize this process. You can find the whole list of other available loss functions here.

Let’s use CPU for training the model. It’s possible to use GPU if you have an NVIDIA graphics card and it supports CUDA. With MXNet this requires just a context preparation:

For the training process, we should load data in batches. MXNet Gluon DataLoader is a good helper in this process:

Batch size value is important. Small batches increase training time. By experimenting with batch size, we found that 48 works well. The training process lasts a short amount of time, and the batch is not too big to decrease the F1 score. As far as we know, the sample rate of data values is 5 minutes, so 48 value batches are equal to a period of 4 hours.

The next step is to define hyperparameters initializer and model training algorithm:

We use Xavier weights initializer as it is designed to keep the scale of gradients roughly the same in all layers. We use ‘sgd’ optimizer, and the learning rate is 0,01. These values look optimal in this case – steps aren’t too small for SGD, optimization doesn’t take too long, and it’s not too big. So it doesn’t overshoot the minimum of the loss function.

Let’s run the training loop and plot MSEs:

The results of the training process:
4

As you can see from this plot, 15 training epochs are spot-on for the training process as we don’t want to get an undertrained or overtrained neural network.

Prediction

When using autoencoder for each pair of (input value, predicted output value), we have a reconstruction error. We can find a reconstruction error for the training dataset. Then we say the input value is anomalous in case the reconstruction error deviates quite far from the value for the whole training dataset. Using the 3-sigma approach works fine for this case – if the reconstruction error is higher than the third standard deviation it will be marked as an anomaly.

Let’s see the results for the model and visualize predicted anomalies.

The threshold calculation:

Let’s check the predictions on the test dataset:

Filtering anomalies from predictions using the 3-sigma threshold:

Plotting the result:

The labeled anomalies from the NAB dataset are marked with green, and the predicted anomalies are marked with red:

5

As you can see from the plot with this simple model predicted 3 anomalies out of 2. The confusion matrix for the results looks like the following:

6

Taking into account the size of available dataset and time spent on coding, this is a pretty good result. According to the NAB whitepaper dataset scoring, you get +1 point for a TP, and -0,22 point for an FP. The final score is then normalized in the range of 0 to 100. Taking our simple solution that predicted 2 TPs and 1 FP, we get 1,78 points out of 2. When scaled, this corresponds to the score of 89. Comparing this result to the NAB scoreboard, this is a pretty strong result. However, we cannot compare our score to it. The key difference in that NAB evaluates the algorithm performance on a large number of datasets and does not take into account the nature of the data. We have built a model for the only one anomaly prediction case. So this is not an apples-to-apples comparison.

The F1-score calculation:

The final F1 score is 0,8. That’s not an ideal value, but it’s good enough as a starting point for prediction and analysis of possible anomalous cases in the system.

Conclusion

In this article, we have discussed a simple solution for handling anomaly detection in time series data. We have passed through standard steps of a data science process – preparing the dataset, choosing a model, training, evaluation, hyperparameter tuning and prediction. In our case, training the model on a pre-processed dataset that has no anomalies made a great impact on the F1 score. As a result, we trained the model which works quite well, given the amount of input data and effort spent. As you can see, MXNet provides an easy to use and well-documented API to work with such a complex neural network. Code from this article is available at GitHub.

References

  1. MXNet as simple as possible
  2. Pizza type recognition using MXNet and TensorFlow
  3. Experiments with MxNet and Handwritten Digits recognition

Appendix: Experiments with network architecture and hyperparameters tuning

In this section, we have collected the results of the experiments we performed during network design and hyperparameter tuning. The experiments are listed in chronological order and on every experiment we changed just a single parameter at a time. So any possible correlation between any parameters is not taken into account. The parameters that worked best are marked in green.

7

About the Authors

Denys is a senior software engineer at Lohika. Passionate about web development, well-designed architecture, data collecting, and processing.

Serhiy is a solutions architect and technical consultant at Lohika. He likes sharing his experience in the development of distributed applications, microservices architectures, and large-scale data processing solutions in different domains.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Athena Savior of Adhoc Analytics

06. December 2018 2018 0

Introduction

Companies strive to attract customers by creating an excellent product with many features. Previously, product to reality took months to years. Nowadays, product to reality can take a matter of weeks. Companies can fail-fast, learn and move ahead to make it better. Data analytics often takes a back seat becoming a bottleneck.

Some of the problems that cause bottlenecks are

  • schema differences,
  • missing data,
  • security restrictions,
  • encryption

AWS Athena, an ad-hoc query tool can alleviate these problems. The main compelling characteristics include :

  • Serverless
  • Query Ease
  • Cost ($5 per TB of data scanned)
  • Availability
  • Durability
  • Performance
  • Security

Athena behind the scene uses Hive and Presto for analytical queries of any size, stored in S3. Athena processes structured, semi-structured and unstructured data sets including CSV, JSON, ORC, Avro, and Parquet. There are multiple languages supported for Athena drivers to query datastores including java, python, and other languages.

Let’s examine a few different use cases with Athena.

Use cases

Case 1: Storage Analysis

Let us say you have a service where you store user data such as documents, contacts, videos, and images. You have an accounting system in the relational database whereas user resources in S3 orchestrated through metadata housed in DynamoDB.  How do we get ad-hoc storage statistics individually as well as the entire customer base across various parameters and events?

Steps :

  • Create AWS data pipeline to export  Relational Database data to S3
    • Data persisted in S3 in CSV
  • Create AWS data pipeline to export  DynamoDB data to S3
    • Data persisted in S3 in JSON string
  • Create Database in Athena
  • Create tables for data sources
  • Run queries
  • Clean the resources

Figure 1: Data Ingestion

Figure 2: Schema and Queries

Case 2: Bucket Inventory

Why is S3 usage growing out of sync from user base changes? Do you know how your S3 bucket is being used? How many objects did it store? How many duplicate files? How many deleted?

AWS Bucket Inventory helps to manage the storage and provides audit and report on the replication and encryption status the objects in the bucket. Let us create a bucket and enable Inventory and perform the following steps.

Steps :

  • Go to S3 bucket
  • Create buckets vijay-yelanji-insights for objects and vijay-yelanji-inventory for inventory.
  • Enable inventory
    • AWS generates report into the inventory bucket at regular intervals as per schedule job.
  • Upload files
  • Delete files
  • Upload same files to check duplicates
  • Create Athena table pointing to vijay-yelanji-inventory
  • Run queries as shown in Figure 5 to get S3 usage to take necessary actions to reduce the cost.

Figure 3: S3 Inventory

Figure 4: Bucket Insights


Figure 5: Bucket Insight Queries

Case 3: Event comparison

Let’s say you are sending a stream of events to two different targets after pre-processing the events very differently and experiencing discrepancy in the data. How do you fix the events counts? What if event and or data are missing? How do you resolve inconsistencies and or quality issues?

If data is stored in S3, and the data format is supported by Athena, you expose it as tables and identify the gaps as shown in figure 7

Figure 6: Event Comparison

Steps:

  • Data ingested in S3 in snappy or JSON and forwarded to the legacy system of records
  • Data ingested in S3 in CSV (column separated by ‘|’ ) and forwarded to a new system of records
    • Event Forwarder system consumes the source event, modifies the data before pushing into the multiple targets.
  • Create Athena table from legacy source data and compare it problematic event forwarder data.


Figure 7: Comparison Inference

 

Case 4: API Call Analysis

If you have not enabled CloudWatch or set up your own ELK stack, but need to analyze service patterns like total HTTP requests by type, 4XX and 5XX errors by call types, this is possible by enabling  ELB access logs and reading through Athena.

Figure 8: Calls Inference

Steps :

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/access-log-collection.html

You can do the same on CloudTrail Logs with more information here:

https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html

 

Case 5: Python S3 Crawler

If you have  tons of JSON data in S3 spread across directories and files, want to analyze keys and its values, all you need to do is use python libraries like PyAthena or JayDeBe to read compressed snappy files after unzipping through SnZip and set these keys into Set data structure before passing as columns to the Athena as shown in Figure 10

Figure 9: Event Crawling

Figure 10: Events to Athena

Limitations

Athena has some limitations including:
  • Data must reside in S3.
  • To reduce the cost of the query and improve performance, data must be compressed, partitioned and converted to columnar formats.
  • User-defined functions, stored procedure, and many DDL are not supported.
  • If you are generating data continuously or has large data sets, want to get insights into real-time or frequently you should rely on analytical and visualization tools such as RedShift, Kinesis, EMR, Denodo, Spotfire and Tableau.
  • Check Athena FAQ to understand more about its benefits and limitations.

Summary

In this post, I shared how to leverage Athena to get analytics and minimize bottlenecks to product delivery. Be aware that some of the methods used were implemented when Athena was new. New tools may have changed how best to solve these use cases. Lately, it has been integrated with Glue for building, maintaining, and running ETL jobs and then QuickSight for visualization.

Reference

Athena documentation is at https://docs.aws.amazon.com/athena/latest/ug/what-is.html

About the Author

Vijay Yelanji (@VijayYelanji) is an architect at Asurion working at San Mateo, CA. has more than 20+ years of experience across various domains like Cloud enabled Micro Services to support enterprise level Account, File, Order, and Subscription Management Systems, Websphere Integration Servers and Solutions, IBM Enterprise Storage Solutions, Informix Databases, and 4GL tools.

In Asurion, he was Instrumental in designing and developing multi-tenant, multi-carrier, highly scalable Backup and Restore Mobile Application using various AWS services.

You can download the Asurion Memories application for free at 

Recently Vijay presented a topic  ‘Logging in AWS’ at AWS Meetup, Mountain View, CA.

Many thanks to AnanthakrishnaChar, Kashyap and Cathy, Hui for their assistance in fine-tuning some of the use cases.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Exploring Concurrency in Python & AWS

04. December 2016 2016 0

Exploring Concurrency in Python & AWS

From Threads to Lambdas (and lambdas with threads)

Author: Mohit Chawla

Editors: Jesse Davis, Neil Millard

The scope of the current article is to demonstrate multiple approaches to solve a seemingly simple problem of intra-S3 file transfers – using pure Python and a hybrid approach of Python and cloud based constructs, specifically AWS Lambda, with a comparison of the two concurrency approaches.

Problem Background

The problem was to transfer 250 objects daily, each of size 600-800 MB, from one S3 bucket to another. In addition, an initial bulk backup of 1500 objects (6 months of data) had to be taken, totaling 1 TB.

Attempt 1

The easiest way to do this appears to loop over all the objects and transfer them one by one:

This had a runtime of 1 hour 45 minutes. Oops.

Attempt 2

Lets use some threads !

Python offers multiple concurrency methods:

  • asyncio, based on event loops and asynchronous I/O.
  • concurrent.futures, which provides high level abstractions like ThreadPoolExecutor and ProcessPoolExecutor.
  • threading, which provides low level abstractions to build your own solution using threads, semaphores and locks.
  • multiprocessing, which is similar to threading, but for processes.

I used the concurrent.futures module, specifically the ThreadPoolExecutor, which seems to be a good fit for I/O tasks.

Note about the GIL:

Python implements a GIL (Global Interpreter Lock) which limits only a single thread to run at a time, inside a single Python interpreter. This is not a limitation for an I/O intensive task, such as the one being discussed in this article. For more details about how it works, see http://www.dabeaz.com/GIL/.

Here’s the code when using the ThreadPoolExecutor:

This code took 1 minute 40 seconds to execute, woo !

Concurrency with Lambda

I was happy with this implementation, until, at an AWS meetup, there was a discussion about using AWS Lambda and SNS for the same thing, and I thought of trying that out.

AWS Lambda is a compute service that lets you run code without provisioning or managing servers. It can be combined with AWS SNS, which is a message push notification service which can deliver and fan-out messages to several services, including E-Mail, HTTP and Lambda, which as allows the decoupling of components.

To use Lambda and SNS for this problem, a simple pipeline was devised: One Lambda function publishes object names as messages to SNS and another Lambda function is subscribed to SNS for copying the objects.

The following piece of code publishes names of objects to copy to an SNS topic. Note the use of threads to make this faster.

Yep, that’s all the code.

Now, you maybe asking yourself, how is the copy operation actually concurrent ?
The unit of concurrency in AWS Lambda is actually the function invocation. For each published message, the Lambda function is invoked, which means for multiple messages published in parallel, an equivalent number of invocations will be made for the Lambda function. According to AWS, that number for stream based sources is given by:

By default, this is limited to 100 concurrent executions, but can be raised on request.

The execution time for the above code was 2 minutes 40 seconds. This is higher than the pure Python approach, partly because the invocations were throttled by AWS.

I hope you enjoyed reading this article, and if you are an AWS or Python user, hopefully this example will be useful for your own projects.

Note – I gave this as a talk at PyUnconf ’16 in Hamburg, you can see the slides at https://speakerdeck.com/alcy/exploring-concurrency-in-python-and-aws.

About the Author:

Mohit Chawla is a systems engineer, living in Hamburg. He has contributed to open source projects over the last seven years, and has a few projects of his own. Apart from systems engineering, he has a strong interest in data visualization.