Athena Savior of Adhoc Analytics

06. December 2018 2018 0

Introduction

Companies strive to attract customers by creating an excellent product with many features. Previously, product to reality took months to years. Nowadays, product to reality can take a matter of weeks. Companies can fail-fast, learn and move ahead to make it better. Data analytics often takes a back seat becoming a bottleneck.

Some of the problems that cause bottlenecks are

  • schema differences,
  • missing data,
  • security restrictions,
  • encryption

AWS Athena, an ad-hoc query tool can alleviate these problems. The main compelling characteristics include :

  • Serverless
  • Query Ease
  • Cost ($5 per TB of data scanned)
  • Availability
  • Durability
  • Performance
  • Security

Athena behind the scene uses Hive and Presto for analytical queries of any size, stored in S3. Athena processes structured, semi-structured and unstructured data sets including CSV, JSON, ORC, Avro, and Parquet. There are multiple languages supported for Athena drivers to query datastores including java, python, and other languages.

Let’s examine a few different use cases with Athena.

Use cases

Case 1: Storage Analysis

Let us say you have a service where you store user data such as documents, contacts, videos, and images. You have an accounting system in the relational database whereas user resources in S3 orchestrated through metadata housed in DynamoDB.  How do we get ad-hoc storage statistics individually as well as the entire customer base across various parameters and events?

Steps :

  • Create AWS data pipeline to export  Relational Database data to S3
    • Data persisted in S3 in CSV
  • Create AWS data pipeline to export  DynamoDB data to S3
    • Data persisted in S3 in JSON string
  • Create Database in Athena
  • Create tables for data sources
  • Run queries
  • Clean the resources

Figure 1: Data Ingestion

Figure 2: Schema and Queries

Case 2: Bucket Inventory

Why is S3 usage growing out of sync from user base changes? Do you know how your S3 bucket is being used? How many objects did it store? How many duplicate files? How many deleted?

AWS Bucket Inventory helps to manage the storage and provides audit and report on the replication and encryption status the objects in the bucket. Let us create a bucket and enable Inventory and perform the following steps.

Steps :

  • Go to S3 bucket
  • Create buckets vijay-yelanji-insights for objects and vijay-yelanji-inventory for inventory.
  • Enable inventory
    • AWS generates report into the inventory bucket at regular intervals as per schedule job.
  • Upload files
  • Delete files
  • Upload same files to check duplicates
  • Create Athena table pointing to vijay-yelanji-inventory
  • Run queries as shown in Figure 5 to get S3 usage to take necessary actions to reduce the cost.

Figure 3: S3 Inventory

Figure 4: Bucket Insights


Figure 5: Bucket Insight Queries

Case 3: Event comparison

Let’s say you are sending a stream of events to two different targets after pre-processing the events very differently and experiencing discrepancy in the data. How do you fix the events counts? What if event and or data are missing? How do you resolve inconsistencies and or quality issues?

If data is stored in S3, and the data format is supported by Athena, you expose it as tables and identify the gaps as shown in figure 7

Figure 6: Event Comparison

Steps:

  • Data ingested in S3 in snappy or JSON and forwarded to the legacy system of records
  • Data ingested in S3 in CSV (column separated by ‘|’ ) and forwarded to a new system of records
    • Event Forwarder system consumes the source event, modifies the data before pushing into the multiple targets.
  • Create Athena table from legacy source data and compare it problematic event forwarder data.


Figure 7: Comparison Inference

 

Case 4: API Call Analysis

If you have not enabled CloudWatch or set up your own ELK stack, but need to analyze service patterns like total HTTP requests by type, 4XX and 5XX errors by call types, this is possible by enabling  ELB access logs and reading through Athena.

Figure 8: Calls Inference

Steps :

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/access-log-collection.html

You can do the same on CloudTrail Logs with more information here:

https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html

 

Case 5: Python S3 Crawler

If you have  tons of JSON data in S3 spread across directories and files, want to analyze keys and its values, all you need to do is use python libraries like PyAthena or JayDeBe to read compressed snappy files after unzipping through SnZip and set these keys into Set data structure before passing as columns to the Athena as shown in Figure 10

Figure 9: Event Crawling

Figure 10: Events to Athena

Limitations

Athena has some limitations including:
  • Data must reside in S3.
  • To reduce the cost of the query and improve performance, data must be compressed, partitioned and converted to columnar formats.
  • User-defined functions, stored procedure, and many DDL are not supported.
  • If you are generating data continuously or has large data sets, want to get insights into real-time or frequently you should rely on analytical and visualization tools such as RedShift, Kinesis, EMR, Denodo, Spotfire and Tableau.
  • Check Athena FAQ to understand more about its benefits and limitations.

Summary

In this post, I shared how to leverage Athena to get analytics and minimize bottlenecks to product delivery. Be aware that some of the methods used were implemented when Athena was new. New tools may have changed how best to solve these use cases. Lately, it has been integrated with Glue for building, maintaining, and running ETL jobs and then QuickSight for visualization.

Reference

Athena documentation is at https://docs.aws.amazon.com/athena/latest/ug/what-is.html

About the Author

Vijay Yelanji (@VijayYelanji) is an architect at Asurion working at San Mateo, CA. has more than 20+ years of experience across various domains like Cloud enabled Micro Services to support enterprise level Account, File, Order, and Subscription Management Systems, Websphere Integration Servers and Solutions, IBM Enterprise Storage Solutions, Informix Databases, and 4GL tools.

In Asurion, he was Instrumental in designing and developing multi-tenant, multi-carrier, highly scalable Backup and Restore Mobile Application using various AWS services.

You can download the Asurion Memories application for free at 

Recently Vijay presented a topic  ‘Logging in AWS’ at AWS Meetup, Mountain View, CA.

Many thanks to AnanthakrishnaChar, Kashyap and Cathy, Hui for their assistance in fine-tuning some of the use cases.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Exploring Concurrency in Python & AWS

04. December 2016 2016 0

Exploring Concurrency in Python & AWS

From Threads to Lambdas (and lambdas with threads)

Author: Mohit Chawla

Editors: Jesse Davis, Neil Millard

The scope of the current article is to demonstrate multiple approaches to solve a seemingly simple problem of intra-S3 file transfers – using pure Python and a hybrid approach of Python and cloud based constructs, specifically AWS Lambda, with a comparison of the two concurrency approaches.

Problem Background

The problem was to transfer 250 objects daily, each of size 600-800 MB, from one S3 bucket to another. In addition, an initial bulk backup of 1500 objects (6 months of data) had to be taken, totaling 1 TB.

Attempt 1

The easiest way to do this appears to loop over all the objects and transfer them one by one:

This had a runtime of 1 hour 45 minutes. Oops.

Attempt 2

Lets use some threads !

Python offers multiple concurrency methods:

  • asyncio, based on event loops and asynchronous I/O.
  • concurrent.futures, which provides high level abstractions like ThreadPoolExecutor and ProcessPoolExecutor.
  • threading, which provides low level abstractions to build your own solution using threads, semaphores and locks.
  • multiprocessing, which is similar to threading, but for processes.

I used the concurrent.futures module, specifically the ThreadPoolExecutor, which seems to be a good fit for I/O tasks.

Note about the GIL:

Python implements a GIL (Global Interpreter Lock) which limits only a single thread to run at a time, inside a single Python interpreter. This is not a limitation for an I/O intensive task, such as the one being discussed in this article. For more details about how it works, see http://www.dabeaz.com/GIL/.

Here’s the code when using the ThreadPoolExecutor:

This code took 1 minute 40 seconds to execute, woo !

Concurrency with Lambda

I was happy with this implementation, until, at an AWS meetup, there was a discussion about using AWS Lambda and SNS for the same thing, and I thought of trying that out.

AWS Lambda is a compute service that lets you run code without provisioning or managing servers. It can be combined with AWS SNS, which is a message push notification service which can deliver and fan-out messages to several services, including E-Mail, HTTP and Lambda, which as allows the decoupling of components.

To use Lambda and SNS for this problem, a simple pipeline was devised: One Lambda function publishes object names as messages to SNS and another Lambda function is subscribed to SNS for copying the objects.

The following piece of code publishes names of objects to copy to an SNS topic. Note the use of threads to make this faster.

Yep, that’s all the code.

Now, you maybe asking yourself, how is the copy operation actually concurrent ?
The unit of concurrency in AWS Lambda is actually the function invocation. For each published message, the Lambda function is invoked, which means for multiple messages published in parallel, an equivalent number of invocations will be made for the Lambda function. According to AWS, that number for stream based sources is given by:

By default, this is limited to 100 concurrent executions, but can be raised on request.

The execution time for the above code was 2 minutes 40 seconds. This is higher than the pure Python approach, partly because the invocations were throttled by AWS.

I hope you enjoyed reading this article, and if you are an AWS or Python user, hopefully this example will be useful for your own projects.

Note – I gave this as a talk at PyUnconf ’16 in Hamburg, you can see the slides at https://speakerdeck.com/alcy/exploring-concurrency-in-python-and-aws.

About the Author:

Mohit Chawla is a systems engineer, living in Hamburg. He has contributed to open source projects over the last seven years, and has a few projects of his own. Apart from systems engineering, he has a strong interest in data visualization.