data pipeline – aws advent

Athena Savior of Adhoc Analytics

by Vijay Yelanji 06. December 2018 2018 0

Introduction

Companies strive to attract customers by creating an excellent product with many features. Previously, product to reality took months to years. Nowadays, product to reality can take a matter of weeks. Companies can fail-fast, learn and move ahead to make it better. Data analytics often takes a back seat becoming a bottleneck.

Some of the problems that cause bottlenecks are

schema differences,
missing data,
security restrictions,
encryption

AWS Athena, an ad-hoc query tool can alleviate these problems. The main compelling characteristics include :

Serverless
Query Ease
Cost ($5 per TB of data scanned)
Availability
Durability
Performance
Security

Athena behind the scene uses Hive and Presto for analytical queries of any size, stored in S3. Athena processes structured, semi-structured and unstructured data sets including CSV, JSON, ORC, Avro, and Parquet. There are multiple languages supported for Athena drivers to query datastores including java, python, and other languages.

Let’s examine a few different use cases with Athena.

Use cases

Case 1: Storage Analysis

Let us say you have a service where you store user data such as documents, contacts, videos, and images. You have an accounting system in the relational database whereas user resources in S3 orchestrated through metadata housed in DynamoDB. How do we get ad-hoc storage statistics individually as well as the entire customer base across various parameters and events?

Steps :

Create AWS data pipeline to export Relational Database data to S3
- Data persisted in S3 in CSV
Create AWS data pipeline to export DynamoDB data to S3
- Data persisted in S3 in JSON string
Create Database in Athena
Create tables for data sources
Run queries
Clean the resources

Figure 1: Data Ingestion

Figure 2: Schema and Queries

Case 2: Bucket Inventory

Why is S3 usage growing out of sync from user base changes? Do you know how your S3 bucket is being used? How many objects did it store? How many duplicate files? How many deleted?

AWS Bucket Inventory helps to manage the storage and provides audit and report on the replication and encryption status the objects in the bucket. Let us create a bucket and enable Inventory and perform the following steps.

Steps :

Go to S3 bucket
Create buckets vijay-yelanji-insights for objects and vijay-yelanji-inventory for inventory.
Enable inventory
- AWS generates report into the inventory bucket at regular intervals as per schedule job.
Upload files
Delete files
Upload same files to check duplicates
Create Athena table pointing to vijay-yelanji-inventory
Run queries as shown in Figure 5 to get S3 usage to take necessary actions to reduce the cost.

Figure 3: S3 Inventory

Figure 4: Bucket Insights

Figure 5: Bucket Insight Queries

Case 3: Event comparison

Let’s say you are sending a stream of events to two different targets after pre-processing the events very differently and experiencing discrepancy in the data. How do you fix the events counts? What if event and or data are missing? How do you resolve inconsistencies and or quality issues?

If data is stored in S3, and the data format is supported by Athena, you expose it as tables and identify the gaps as shown in figure 7

Figure 6: Event Comparison

Steps:

Data ingested in S3 in snappy or JSON and forwarded to the legacy system of records
Data ingested in S3 in CSV (column separated by ‘|’ ) and forwarded to a new system of records
- Event Forwarder system consumes the source event, modifies the data before pushing into the multiple targets.
Create Athena table from legacy source data and compare it problematic event forwarder data.

Figure 7: Comparison Inference

Case 4: API Call Analysis

If you have not enabled CloudWatch or set up your own ELK stack, but need to analyze service patterns like total HTTP requests by type, 4XX and 5XX errors by call types, this is possible by enabling ELB access logs and reading through Athena.

Figure 8: Calls Inference

Steps :

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/access-log-collection.html

You can do the same on CloudTrail Logs with more information here:

https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html

Case 5: Python S3 Crawler

If you have tons of JSON data in S3 spread across directories and files, want to analyze keys and its values, all you need to do is use python libraries like PyAthena or JayDeBe to read compressed snappy files after unzipping through SnZip and set these keys into Set data structure before passing as columns to the Athena as shown in Figure 10

Figure 9: Event Crawling

Figure 10: Events to Athena

Limitations

Athena has some limitations including:

Data must reside in S3.
To reduce the cost of the query and improve performance, data must be compressed, partitioned and converted to columnar formats.
User-defined functions, stored procedure, and many DDL are not supported.
If you are generating data continuously or has large data sets, want to get insights into real-time or frequently you should rely on analytical and visualization tools such as RedShift, Kinesis, EMR, Denodo, Spotfire and Tableau.
Check Athena FAQ to understand more about its benefits and limitations.

Summary

In this post, I shared how to leverage Athena to get analytics and minimize bottlenecks to product delivery. Be aware that some of the methods used were implemented when Athena was new. New tools may have changed how best to solve these use cases. Lately, it has been integrated with Glue for building, maintaining, and running ETL jobs and then QuickSight for visualization.

Reference

Athena documentation is at https://docs.aws.amazon.com/athena/latest/ug/what-is.html

About the Author

Vijay Yelanji (@VijayYelanji) is an architect at Asurion working at San Mateo, CA. has more than 20+ years of experience across various domains like Cloud enabled Micro Services to support enterprise level Account, File, Order, and Subscription Management Systems, Websphere Integration Servers and Solutions, IBM Enterprise Storage Solutions, Informix Databases, and 4GL tools.

In Asurion, he was Instrumental in designing and developing multi-tenant, multi-carrier, highly scalable Backup and Restore Mobile Application using various AWS services.

You can download the Asurion Memories application for free at

Recently Vijay presented a topic ‘Logging in AWS’ at AWS Meetup, Mountain View, CA.

Many thanks to AnanthakrishnaChar, Kashyap and Cathy, Hui for their assistance in fine-tuning some of the use cases.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.

Tag Archives: data pipeline

Athena Savior of Adhoc Analytics

Introduction

Use cases

Case 1: Storage Analysis

Steps :

Case 2: Bucket Inventory

Steps :

Case 3: Event comparison

Steps:

Case 4: API Call Analysis

Steps :

Case 5: Python S3 Crawler

Limitations

Summary

Reference

About the Author

About the Editor

Amazon Machine Learning: Super Simple Supervised Learning