Companies strive to attract customers by creating an excellent product with many features. Previously, product to reality took months to years. Nowadays, product to reality can take a matter of weeks. Companies can fail-fast, learn and move ahead to make it better. Data analytics often takes a back seat becoming a bottleneck.
Some of the problems that cause bottlenecks are
- schema differences,
- missing data,
- security restrictions,
AWS Athena, an ad-hoc query tool can alleviate these problems. The main compelling characteristics include :
- Query Ease
- Cost ($5 per TB of data scanned)
Athena behind the scene uses Hive and Presto for analytical queries of any size, stored in S3. Athena processes structured, semi-structured and unstructured data sets including CSV, JSON, ORC, Avro, and Parquet. There are multiple languages supported for Athena drivers to query datastores including java, python, and other languages.
Let’s examine a few different use cases with Athena.
Case 1: Storage Analysis
Let us say you have a service where you store user data such as documents, contacts, videos, and images. You have an accounting system in the relational database whereas user resources in S3 orchestrated through metadata housed in DynamoDB. How do we get ad-hoc storage statistics individually as well as the entire customer base across various parameters and events?
- Create AWS data pipeline to export Relational Database data to S3
- Data persisted in S3 in CSV
- Create AWS data pipeline to export DynamoDB data to S3
- Data persisted in S3 in JSON string
- Create Database in Athena
- Create tables for data sources
- Run queries
- Clean the resources
Figure 1: Data Ingestion
Figure 2: Schema and Queries
Case 2: Bucket Inventory
Why is S3 usage growing out of sync from user base changes? Do you know how your S3 bucket is being used? How many objects did it store? How many duplicate files? How many deleted?
AWS Bucket Inventory helps to manage the storage and provides audit and report on the replication and encryption status the objects in the bucket. Let us create a bucket and enable Inventory and perform the following steps.
- Go to S3 bucket
- Create buckets vijay-yelanji-insights for objects and vijay-yelanji-inventory for inventory.
- Enable inventory
- AWS generates report into the inventory bucket at regular intervals as per schedule job.
- Upload files
- Delete files
- Upload same files to check duplicates
- Create Athena table pointing to vijay-yelanji-inventory
- Run queries as shown in Figure 5 to get S3 usage to take necessary actions to reduce the cost.
Figure 3: S3 Inventory
Figure 4: Bucket Insights
Figure 5: Bucket Insight Queries
Case 3: Event comparison
Let’s say you are sending a stream of events to two different targets after pre-processing the events very differently and experiencing discrepancy in the data. How do you fix the events counts? What if event and or data are missing? How do you resolve inconsistencies and or quality issues?
If data is stored in S3, and the data format is supported by Athena, you expose it as tables and identify the gaps as shown in figure 7
Figure 6: Event Comparison
- Data ingested in S3 in snappy or JSON and forwarded to the legacy system of records
- Data ingested in S3 in CSV (column separated by ‘|’ ) and forwarded to a new system of records
- Event Forwarder system consumes the source event, modifies the data before pushing into the multiple targets.
- Create Athena table from legacy source data and compare it problematic event forwarder data.
Figure 7: Comparison Inference
Case 4: API Call Analysis
If you have not enabled CloudWatch or set up your own ELK stack, but need to analyze service patterns like total HTTP requests by type, 4XX and 5XX errors by call types, this is possible by enabling ELB access logs and reading through Athena.
Figure 8: Calls Inference
You can do the same on CloudTrail Logs with more information here:
Case 5: Python S3 Crawler
If you have tons of JSON data in S3 spread across directories and files, want to analyze keys and its values, all you need to do is use python libraries like PyAthena or JayDeBe to read compressed snappy files after unzipping through SnZip and set these keys into Set data structure before passing as columns to the Athena as shown in Figure 10
Figure 9: Event Crawling
Figure 10: Events to Athena
- Data must reside in S3.
- To reduce the cost of the query and improve performance, data must be compressed, partitioned and converted to columnar formats.
- User-defined functions, stored procedure, and many DDL are not supported.
- If you are generating data continuously or has large data sets, want to get insights into real-time or frequently you should rely on analytical and visualization tools such as RedShift, Kinesis, EMR, Denodo, Spotfire and Tableau.
- Check Athena FAQ to understand more about its benefits and limitations.
In this post, I shared how to leverage Athena to get analytics and minimize bottlenecks to product delivery. Be aware that some of the methods used were implemented when Athena was new. New tools may have changed how best to solve these use cases. Lately, it has been integrated with Glue for building, maintaining, and running ETL jobs and then QuickSight for visualization.
Athena documentation is at https://docs.aws.amazon.com/athena/latest/ug/what-is.html
About the Author
Vijay Yelanji (@VijayYelanji) is an architect at Asurion working at San Mateo, CA. has more than 20+ years of experience across various domains like Cloud enabled Micro Services to support enterprise level Account, File, Order, and Subscription Management Systems, Websphere Integration Servers and Solutions, IBM Enterprise Storage Solutions, Informix Databases, and 4GL tools.
In Asurion, he was Instrumental in designing and developing multi-tenant, multi-carrier, highly scalable Backup and Restore Mobile Application using various AWS services.
Recently Vijay presented a topic ‘Logging in AWS’ at AWS Meetup, Mountain View, CA.
Many thanks to AnanthakrishnaChar, Kashyap and Cathy, Hui for their assistance in fine-tuning some of the use cases.
About the Editor
Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.