Athena Savior of Adhoc Analytics

06. December 2018 2018 0

Introduction

Companies strive to attract customers by creating an excellent product with many features. Previously, product to reality took months to years. Nowadays, product to reality can take a matter of weeks. Companies can fail-fast, learn and move ahead to make it better. Data analytics often takes a back seat becoming a bottleneck.

Some of the problems that cause bottlenecks are

  • schema differences,
  • missing data,
  • security restrictions,
  • encryption

AWS Athena, an ad-hoc query tool can alleviate these problems. The main compelling characteristics include :

  • Serverless
  • Query Ease
  • Cost ($5 per TB of data scanned)
  • Availability
  • Durability
  • Performance
  • Security

Athena behind the scene uses Hive and Presto for analytical queries of any size, stored in S3. Athena processes structured, semi-structured and unstructured data sets including CSV, JSON, ORC, Avro, and Parquet. There are multiple languages supported for Athena drivers to query datastores including java, python, and other languages.

Let’s examine a few different use cases with Athena.

Use cases

Case 1: Storage Analysis

Let us say you have a service where you store user data such as documents, contacts, videos, and images. You have an accounting system in the relational database whereas user resources in S3 orchestrated through metadata housed in DynamoDB.  How do we get ad-hoc storage statistics individually as well as the entire customer base across various parameters and events?

Steps :

  • Create AWS data pipeline to export  Relational Database data to S3
    • Data persisted in S3 in CSV
  • Create AWS data pipeline to export  DynamoDB data to S3
    • Data persisted in S3 in JSON string
  • Create Database in Athena
  • Create tables for data sources
  • Run queries
  • Clean the resources

Figure 1: Data Ingestion

Figure 2: Schema and Queries

Case 2: Bucket Inventory

Why is S3 usage growing out of sync from user base changes? Do you know how your S3 bucket is being used? How many objects did it store? How many duplicate files? How many deleted?

AWS Bucket Inventory helps to manage the storage and provides audit and report on the replication and encryption status the objects in the bucket. Let us create a bucket and enable Inventory and perform the following steps.

Steps :

  • Go to S3 bucket
  • Create buckets vijay-yelanji-insights for objects and vijay-yelanji-inventory for inventory.
  • Enable inventory
    • AWS generates report into the inventory bucket at regular intervals as per schedule job.
  • Upload files
  • Delete files
  • Upload same files to check duplicates
  • Create Athena table pointing to vijay-yelanji-inventory
  • Run queries as shown in Figure 5 to get S3 usage to take necessary actions to reduce the cost.

Figure 3: S3 Inventory

Figure 4: Bucket Insights


Figure 5: Bucket Insight Queries

Case 3: Event comparison

Let’s say you are sending a stream of events to two different targets after pre-processing the events very differently and experiencing discrepancy in the data. How do you fix the events counts? What if event and or data are missing? How do you resolve inconsistencies and or quality issues?

If data is stored in S3, and the data format is supported by Athena, you expose it as tables and identify the gaps as shown in figure 7

Figure 6: Event Comparison

Steps:

  • Data ingested in S3 in snappy or JSON and forwarded to the legacy system of records
  • Data ingested in S3 in CSV (column separated by ‘|’ ) and forwarded to a new system of records
    • Event Forwarder system consumes the source event, modifies the data before pushing into the multiple targets.
  • Create Athena table from legacy source data and compare it problematic event forwarder data.


Figure 7: Comparison Inference

 

Case 4: API Call Analysis

If you have not enabled CloudWatch or set up your own ELK stack, but need to analyze service patterns like total HTTP requests by type, 4XX and 5XX errors by call types, this is possible by enabling  ELB access logs and reading through Athena.

Figure 8: Calls Inference

Steps :

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/access-log-collection.html

You can do the same on CloudTrail Logs with more information here:

https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html

 

Case 5: Python S3 Crawler

If you have  tons of JSON data in S3 spread across directories and files, want to analyze keys and its values, all you need to do is use python libraries like PyAthena or JayDeBe to read compressed snappy files after unzipping through SnZip and set these keys into Set data structure before passing as columns to the Athena as shown in Figure 10

Figure 9: Event Crawling

Figure 10: Events to Athena

Limitations

Athena has some limitations including:
  • Data must reside in S3.
  • To reduce the cost of the query and improve performance, data must be compressed, partitioned and converted to columnar formats.
  • User-defined functions, stored procedure, and many DDL are not supported.
  • If you are generating data continuously or has large data sets, want to get insights into real-time or frequently you should rely on analytical and visualization tools such as RedShift, Kinesis, EMR, Denodo, Spotfire and Tableau.
  • Check Athena FAQ to understand more about its benefits and limitations.

Summary

In this post, I shared how to leverage Athena to get analytics and minimize bottlenecks to product delivery. Be aware that some of the methods used were implemented when Athena was new. New tools may have changed how best to solve these use cases. Lately, it has been integrated with Glue for building, maintaining, and running ETL jobs and then QuickSight for visualization.

Reference

Athena documentation is at https://docs.aws.amazon.com/athena/latest/ug/what-is.html

About the Author

Vijay Yelanji (@VijayYelanji) is an architect at Asurion working at San Mateo, CA. has more than 20+ years of experience across various domains like Cloud enabled Micro Services to support enterprise level Account, File, Order, and Subscription Management Systems, Websphere Integration Servers and Solutions, IBM Enterprise Storage Solutions, Informix Databases, and 4GL tools.

In Asurion, he was Instrumental in designing and developing multi-tenant, multi-carrier, highly scalable Backup and Restore Mobile Application using various AWS services.

You can download the Asurion Memories application for free at 

Recently Vijay presented a topic  ‘Logging in AWS’ at AWS Meetup, Mountain View, CA.

Many thanks to AnanthakrishnaChar, Kashyap and Cathy, Hui for their assistance in fine-tuning some of the use cases.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Vanquishing CORS with Cloudfront and [email protected]

03. December 2018 2018 0

If you’re deploying a traditional server-rendered web app, it makes sense to host static files on the same machine. The HTML, being server-rendered, will have to be served there, and it is simple and easy to serve css, javascript, and other assets from the same domain.

When you’re deploying a single-page web app (or SPA), the best choice is less obvious. A SPA consists of a collection of static files, as opposed to server-rendered files that might change depending on the requester’s identity, logged-in state, etc. The files may still change when a new version is deployed, but not for every request.

In a single-page web app, you might access several APIs on different domains, or a single API might serve multiple SPAs. Imagine you want to have the main site at mysite.com and some admin views at admin.mysite.com, both talking to api.mysite.com.

Problems with S3 as a static site host

S3 is a good option for serving the static files of a SPA, but it’s not perfect. It doesn’t support SSL—a requirement for any serious website in 2018. There are a couple other deficiencies that you may encounter, namely client-side routing and CORS headaches.

Client-side routing

Most SPA frameworks rely on client-side routing. With client-side routing, every path should receive the content for index.html, and the specific “page” to show is determined on the client. It’s possible to configure this to use the fragment portion of the url, giving routes like /#!/login and /#!/users/253/profile. These “hashbang” routes are trivially supported by S3: the fragment portion of a URL is not interpreted as a filename. S3 just serves the content for /, or index.html, just like we wanted.

However, many developers prefer to use client-side routers in “history” mode (aka “push-state” or “HTML5” mode). In history mode, routes omit that #! portion and look like /login and /users/253/profile. This is usually done for SEO reasons, or just for aesthetics. Regardless, it doesn’t work with S3 at all. From S3’s perspective, those look like totally different files. It will fruitlessly search your bucket for files called /login or /users/253/profile. Your users will see 404 errors instead of lovingly crafted pages.

CORS headaches

Another potential problem, not unique to S3, is due to Cross-Origin Resource Sharing (CORS). CORS polices which routes and data are accesible from other origins. For example, a request from your SPA mysite.com to api.mysite.com is considered cross-origin, so it’s subject to CORS rules. Browsers enforce that cross-origin requests are only permitted when the server at api.mysite.com sets headers explicitly allowing them.

Even when you have control of the server, CORS headers can be tricky to set up correctly. Some SPA tutorials recommend side-stepping the problem using webpack-dev-server’s proxy setting. In this configuration, webpack-dev-server accepts requests to /api/* (or some other prefix) and forwards them to a server (eg, http://localhost:5000). As far as the browser is concerned, your API is hosted on the same domain—not a cross-origin request at all.

Some browsers will also reject third-party cookies. If your API server is on a subdomain this can make it difficult to maintain a logged-in state, depending on your users’ browser settings. The same fix for CORS—proxying /api/* requests from mysite.com to api.mysite.com—would also make the browser see these as first-party cookies.

In production or staging environments, you wouldn’t be using webpack-dev-server, so you could see new issues due to CORS that didn’t happen on your local computer. We need a way to achieve similar proxy behavior that can stand up to a production load.

CloudFront enters, stage left

To solve these issues, I’ve found CloudFront to be an invaluable tool. CloudFront acts as a distributed cache and proxy layer. You make DNS records that resolve mysite.com to something.CloudFront.net. A CloudFront distribution accepts requests and forwards them to another origin you configure. It will cache the responses from the origin (unless you tell it not to). For a SPA, the origin is just your S3 bucket.

In addition to providing caching, SSL, and automatic gzipping, CloudFront is a programmable cache. It gives us the tools to implement push-state client-side routing and to set up a proxy for your API requests to avoid CORS problems.

Client-side routing

There are many suggestions to use CloudFront’s “Custom Error Response” feature in order to achieve pretty push-state-style URLs. When CloudFront receives a request to /login it will dutifully forward that request to your S3 origin. S3, remember, knows nothing about any file called login so it responds with a 404. With a Custom Error Response, CloudFront can be configured to transform that 404 NOT FOUND into a 200 OK where the content is from index.html. That’s exactly what we need for client-side routing!

The Custom Error Response method works well, but it has a drawback. It turns all 404s into 200s with index.html for the body. That isn’t a problem yet, but we’re about to set up our API so it is accessible at mysite.com/api/* (in the next section). It can cause some confusing bugs if your API’s 404 responses are being silently rewritten into 200s with an HTML body!

If you don’t need to talk to any APIs or don’t care to side-step the CORS issues by proxying /api/* to another server, the Custom Error Response method is simpler to set up. Otherwise, we can use [email protected] to rewrite our URLs instead.

[email protected] gives us hooks where we can step in and change the behavior of the CloudFront distribution. The one we’ll need is “Origin Request”, which fires when a request is about to be sent to the S3 origin.

We’ll make some assumptions about the routes in our SPA.

  1. Any request with an extension (eg, styles/app.css, vendor.js, or imgs/logo.png) is an asset and not a client-side route. That means it’s actually backed by a file in S3.
  2. A request without an extension is a SPA client-side route path. That means we should respond with the content from index.html.

If those assumptions aren’t true for your app, you’ll need to adjust the code in the Lambda accordingly. For the rest of us, we can write a lambda to say “If the request doesn’t have an extension, rewrite it to go to index.html instead”. Here it is in Node.js:

Make a new Node.js Lambda, and copy that code into it. At this time, in order to be used with CloudFront, your Lambda must be deployed to the us-east-1 region. Additionally, you’ll have to click “Publish a new version” on the Lambda page. An unpublished Lambda cannot be used with [email protected]

Copy the ARN at the top of the page and past it in the “Lambda function associations” section of your S3 origin’s Behavior. This is what tells CloudFront to call your Lambda when an Origin Request occurs.

Et Voila! You now have pretty SPA URLs for client-side routing.

Sidestep CORS Headaches

A single CloudFront “distribution” (that’s the name for the cache rules for a domain) can forward requests to multiple servers, which CloudFront calls “Origins”. So far, we only have one: the S3 bucket. In order to have CloudFront forward our API requests, we’ll add another origin that points at our API server.

Probably, you want to set up this origin with minimal or no caching. Be sure to forward all headers and cookies as well. We’re not really using any of CloudFront’s caching capabilities for the API server. Rather, we’re treating it like a reverse proxy.

At this point you have two origins set up: the original one one for S3 and the new one for your API. Now we need to set up the “Behavior” for the distribution. This controls which origin responds to which path.

Choose /api/* as the Path Pattern to go to your API. All other requests will hit the S3 origin. If you need to communicate with multiple API servers, set up a different path prefix for each one.

CloudFront is now serving the same purpose as the webpack-dev-server proxy. Both frontend and API endpoints are available on the same mysite.com domain, so we’ll have zero issues with CORS.

Cache-busting on Deployment

The CloudFront cache makes our sites load faster, but it can cause problems too. When you deploy a new version of your site, the cache might continue to serve an old version for 10-20 minutes.

I like to include a step in my continuous integration deploy to bust the cache, ensuring that new versions of my asset files are picked up right away. Using the AWS CLI, this looks like

About the Author

Brian Schiller (@bgschiller) is a Senior Software Engineer at Devetry in Denver. He especially enjoys teaching, and leads Code Forward, a free coding bootcamp sponsored by Devetry. You can read more of his writing at brianschiller.com.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf and developed cookbooks to simplify building and managing infrastructure at Chef. Jennifer is the coauthor of Effective DevOps and speaks about DevOps, tech culture, and monitoring. She also gives tutorials on a variety of technical topics. When she’s not working, she enjoys learning to make things and spending quality time with her family.