Deploy a Secure Static Site with AWS & Terraform

14. December 2018 2018 0

Introduction

There are many uses for static websites. A static site is the simplest form of website, though every website consists of delivering HTML, CSS and other resources to a browser. With a static website, initial page content is delivered the same to every user, regardless as to how they’ve interacted with your site previously. There’s no database, authentication or anything else associated with sending the site to the user – just a straight HTTPS connection and some text content. This content can benefit from caching on servers closer to its users for faster delivery; it will generally also be lower cost as the servers to deliver this content to not themselves need to interpret scripting languages or make database connections on behalf of the application.

The static website now has another use, as there are more tools to provide highly interactive in-browser applications based on JavaScript frameworks (such as React, Vue or Angular) which manage client interaction, maintain local data and interact with the web service via small but often frequent API calls. These systems decouple front-end applications from back-end services and allow those back-ends to be written in multiple languages or as small siloed applications, often called microservices. Microservices may take advantage of modern back-end technologies such as containers (via Docker and/or Kubernetes) and “serverless” providers like AWS Lambda.

People deploying static sites fall into these two very different categories – for one the site is the whole of their business, for the other the static site is a very minor part supporting the API. However, each category of static site use still shares similar requirements. In this article we explore deploying a static site with the following attributes:

  • Must work at the root domain of a business, e.g., example.com
  • Must redirect from the common (but unnecessary) www. subdomain to the root domain
  • Must be served via HTTPS (and upgrade HTTP to HTTPS)
  • Must support “pretty” canonical URLs – e.g., example.com/about-us rather than example.com/about-us.html
  • Must not cost anything when not being accessed (except for domain name costs)

AWS Service Offerings

We achieve these requirements through use of the following AWS services:

  • S3
  • CloudFront
  • ACM (Amazon Certificate Manager)
  • Route53
  • Lambda

This may seem like quite a lot of services to host a simple static website; let’s review and summarise why each item is being used:

  • S3 – object storage; allows you to put files in the cloud. Other AWS users or AWS services may be permitted access to these files. They can be made public. S3 supports website hosting, but only via HTTP. For HTTPS you need…
  • CloudFront – content delivery system; can sit in front of an S3 bucket or a website served via any other domain (doesn’t need to be on AWS) and deliver files from servers close to users, caching them if allowed. Allows you to import HTTPS certificates managed by…
  • ACM – generates and stores certificates (you can also upload your own). Will automatically renew certificates which it generates. For generating certificates, your domain must be validated via adding custom CNAME records. This can be done automatically in…
  • Route53 – AWS nameservers and DNS service. R53 replaces your domain provider’s nameservers (at the cost of $0.50 per month per domain) and allows both traditional DNS records (A, CNAME, MX, TXT, etc.) and “alias” records which map to a specific other AWS service – such as S3 websites or CloudFront distributions. Thus an A record on your root domain can link directly to Cloudfront, and your CNAMEs to validate your ACM certificate can also be automatically provisioned
  • Lambda – functions as a service. Lambda lets you run custom code on events, which can come directly or from a variety of other AWS services. Crucially you can put a Lambda function into Cloudfront, manipulating requests or responses as they’re received from or sent to your users. This is how we’ll make our URLs look nice

Hopefully, that gives you some understanding of the services – you could cut out CloudFront and ACM if you didn’t care about HTTPS, but there’s a worldwide push for HTTPS adoption to provide improved security for users and browsers including Chrome are marking pages not served via HTTPS as “insecure” as part of their commitment.

All this is well and good, but whilst AWS is powerful their console leaves much to be desired, and setting up one site can take some time – replicating it for multiple sites is as much an exercise in memory and box ticking as it is in technical prowess. What we need is a way to do this once, or even better have somebody else do this once, and then replicate it as many times as we need.

Enter Terraform from HashiCorp

One of the most powerful parts of AWS isn’t clear when you first start using the console to manage your resources. AWS has a super powerful API that drives pretty much everything. It’s key to so much of their automation, to the entirety of their security model and tools, tools like Terraform.

Terraform from HashiCorp is “Infrastructure-as-Code” or IaC. It lets you define resources on a variety of cloud providers and then run commands to:

  • Check the current state of your environment
  • Make required changes such that your actual environment matches the code you’ve written

In code form, Terraform uses blocks of code called resources:

resource “aws_s3_bucket” “some-internal-reference” {
  bucket = “my-bucket-name”
}

Each resource can include variables (documented on the provider’s website), and these can be text, numbers, true/false, lists (of the above) or maps (basically like subresources with their variables).

Terraform is distributed as pre-built binaries (it’s also open source, written in Go so you can build it yourself) that you can run simply by downloading, making them executable and then executing them. To work with AWS, you need to define a “provider” which is formatted similarly to a resource:

provider “aws” {
}

To run any AWS API (via command line, terraform or a language of your choice) you’ll need to generate an access key and secret key for the account you’d like to use. That’s beyond the scope of this article, but given you should also avoid hardcoding those credentials into Terraform, and given you’d be very well served to have access to it, skip over to the AWS CLI setup instructions and set this up with the correct keys before continuing.

(NB: in this step you’re best provisioning an account with admin rights, or at least full access to IAM, S3, Route53, Cloudfront, ACM & Lambda. However don’t be tempted to create access keys for your root account – AWS recommends against this)

Now that you’ve got your system set up to use AWS programmatically, installed Terraform and been introduced to the basics of its syntax it’s a good time to look at our code on GitHub.

Clone the repository above; you’ll see we have one file in the root (main.tf.example) and then a directory called modules. One of the best parts of Terraform is modules and how they behave. Modules allow one user to define a specific set of infrastructure that may either relate directly to each other or interact by being on the same account. These modules can define variables allowing some aspects (names, domains, tags) to be customised, whilst other items that may be necessary for the module to function (like a certain configuration of a CloudFront distribution) are fixed.

To start off run bash ./setup which will copy the example file to main.tf and also ensure your local Terraform installation has the correct providers (AWS and file archiving) as well as set up the modules. In main.tf then you’ll see a suggested set up using three modules. Of course, you’d be free to just remove main.tf entirely and use each module in its own right, but for this tutorial, it helps to have a complete picture.

At the top of the main.tf file are defined three variables which you’ll need to fill in correctly:

  1. The first is the domain you wish to use – it can be your root domain (example.com) or any sort of subdomain (my-site.example.com).
  2. Second, you’ll need the Zone ID associated with your domain on Route 53. Each Route 53 domain gets a zone ID which relates to AWS’ internal domain mapping system. To find your Zone ID visit the Route53 Hosted Zones page whilst signed in to your AWS account and check the right-hand column next to the root domain you’re interested in using for your static site.
  3. Finally choose a region; if you already use AWS you may have a preferred region, otherwise, choose one from the AWS list nearest to you. As a note, it’s generally best to avoid us-east-1 where possible, as on balance this tends to have more issues arise due to its centrality in various AWS services.

Now for the fun part. Run terraform plan – if your AWS CLI environment is set up the plan should execute and show the creation of a whole list of resources – S3 Buckets, CloudFront distributions, a number of DNS records and even some new IAM roles & policies. If this bit fails entirely, check that the provider entity in main.tf is using the right profile name based on your ~/.aws/credentials file.

Once the plan has run and told you it’s creating resources (it shouldn’t say updating or destroying at this point), you’re ready to go. Run terraform apply – this basically does another plan, but at the end, you can type yes to have Terraform create the resources. This can take a while as Terraform has to call various AWS APIs and some are quicker than others – DNS records can be slightly slower, and ACM generation may wait until it’s verified DNS before returning a positive response. Be patient and eventually it will inform you that it’s finished, or tell you if there have been problems applying.

If the plan or apply options have problems you may need to change some of your variables based on the following possible issues:

  • Names of S3 buckets should be globally unique – so if anyone in the world has a bucket with the name you want, you can’t have it. A good system is to prefix buckets with your company name or suffix them with random characters. By default, the system names your buckets for you, but you can override this.
  • You shouldn’t have an A record for your root or www. domain already in Route53.
  • You shouldn’t have an ACM certificate for your root domain already.

It’s safe (in the case of this code at least) to re-run Terraform if problems have occurred and you’ve tried to fix them – it will only modify or remove resources it has already created, so other resources on the account are safe.

Go into the AWS console and browse S3, CloudFront, Route53 and you should see your various resources created. You can also view the Lambda function and ACM but be aware that for the former you’ll need to be in the specific region you chose to run in, and for the latter, you must select us-east-1 (N. Virginia)

What now?

It’s time to deploy a website. This is the easy part – you can use the S3 console to drag and drop files (remember to use the website bucket and not the logs or www redirect buckets), use awscli to upload yourself (via aws s3 cp or aws s3 sync) or run the example bash script provided in the repo which takes one argument, a directory of all files you want to upload. Be aware – any files uploaded to your bucket will immediately be public on the internet if somebody knows the URL!

If you don’t have a website, check the “example-website” directory – running the bash script above without any arguments will deploy this for you. Once you’ve deployed something, visit your domain and all being well you should see your site. Cloudfront distributions have a variable time to set up so in some cases it might be 15ish minutes before the site works as expected.

Note also that CloudFront is set to cache files for 5 minutes; even a hard refresh won’t reload resource files like CSS or JavaScript as Cloudfront won’t go and fetch them again from your bucket for 5 minutes after first fetching them. During development you may wish to turn this off – you can do this in the CloudFront console, set the TTL values to 0. Once you’re ready to go live, run terraform apply again and it will reconfigure Cloudfront to recommended settings.

Summary

With a minimal amount of work we now have a framework that can deploy a secure static site to any domain we choose in a matter of minutes. We could use this to deploy websites for marketing clients rapidly, publish a blog generated with a static site builder like Jekyll, or use it as the basis for a serverless web application using ReactJS delivered to the client and a back-end provided by AWS Lambda accessed via AWS API Gateway or (newly released) an AWS Application Load Balancer.

About the Author

Mike has been working in web application development for 10 years, including 3 years managing a development team for a property tech startup and before that 4 years building a real time application for managing operations at skydiving centres, as well as some time freelancing. He uses Terraform to manage all the AWS infrastructure for his current work and has dabbled in other custom AWS tools such as an improvement to the CloudWatch logging agent and a deployment tool for S3. You can find him on Twitter @m1ke and GitHub.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


I GOTS TO GET ORGANIZIZED

13. December 2018 2018 0

Travis Bickle Clapping

AWS Organizations are an amazing way to do a few existentially important things:

  • Consolidate Payment for multiple AWS Accounts.
  • Group AWS Accounts
  • Provide policies for a Group of AWS Accounts.
  • Control access to AWS Services, by Group or Individual Account
  • Centralize CloudTrail Logging ( released at re:Invent 2018 )

Sooner or later, every business grows to need policies. You may have some policies that should trickle down to your whole organization. You may wish to declare and enforce concepts like S3 buckets should never be deleted, IAM Users should not be able to generate access keys, or perhaps CloudTrail logging should never be stopped. Whether or not these concepts resonate with you, they are the types of ideas that Organizations can declare and enforce.

Some policies may end up being domain specific. PCI-DSS doesn’t apply to non-financial business domains. Rights related to data retention may only apply in certain groups of countries, but not others. AWS Organizations can be leveraged to manifest these ideas.

AWS Organizations brings you technical controls for declaring, enforcing, and ( when paired with AWS Config ) reporting on compliance directives.

Step Zero: Get your plan together.

Austin Powers allow myself to introduce myself, overlaid with text replacing myself with organization

For many organizations having their … AWS Organization … setup as a tree structure is a great option.

The Organizational Concept

At the root of the tree, you have a single account ( the same AWS account from which we will begin working ). This single account runs no code. This account is exclusively for payment and policy management.

In this article, we’re going to:

  • Create a new account that will be the root of our organization
  • Create a Service Control Policy(SCP) that declares cloudhsm should not be used
  • Create a SCP that declares cloudtrail:StopLogging cannot be called
  • Attach those SCPs into our new Organization.
  • Create an OU inside the Organization
  • Bring another account into that OU

The Policy Layout

As you move out from the root and to the first layer of subordinate accounts ( children of the root account ), one policy may apply. Say this policy is “You can run anything except HSM.”

One of those child accounts may have its children who process credit card payments and are subject to PCI-DSS. These grandchildren accounts may be restricted to using an explicit whitelist of services. They may be required to use 2FA when logging in. Maybe they can’t use S3, because srsbzns can’t happen via S3?

But what about existing accounts

You can invite accounts into your organization via one of two ways:

  1. You invite by AWS Account ID
  2. You invite by email, which uses the AWS Console’s root user login’s email address.

I’m going to assume you’ve already got at least one AWS Account, if not, create one and invite it via email. If you do already have an account, find its account id.

The Organizational Layout

The image below illustrates the organizational structure that we’re going to be creating as a part of this writeup. Service Control Policies ( called SCPs hereafter ) will be used to enforce our policies. If you follow the link to the SCPs page, you’ll notice a few important caveats to how SCPs work. In short

  1. SCPs only Deny access
  2. SCPs don’t apply to Service Linked Accounts
  3. SCPs only apply to Principles inside your organization,
  4. if you disable the SCP policy type in a root, you can expect to spend the next several days re-enabling it with much tender loving care.

a directed graph of the organization layout

Step One: Prep your soon-to-be-root account

AWS requires you to verify your email address before you can begin summoning creating subordinate accounts. Choose an account that will be your new root account, and verify the email address on it by logging into the AWS Console, and visiting https://console.aws.amazon.com/organizations/home

Build a fresh account to be the new root account

If you have an existing AWS Account that is not itself the root of an organization, you may want to create a new account for this purpose. This writeup’s screencaps will continue with a fresh account that will be our designated root.

The organizations console default page

AWS Organizations Console for a fresh account, default page

The organization’s console accounts tab, showing the You must verify your email to use AWS Accounts dialogue. AWS Organizations Console for a fresh account, Accounts tab

The organization’s console accounts tab, after verifying the email address. AWS Organizations Console for a fresh account, Accounts tab

Create an admin user

We will need to use the aws CLI just a bit, because there isn’t super amazing CloudFormation support to generate organizational children.

This user can be created via CloudFormation, however, and that’s doable via the console.

  1. As your root user, navigate to Services -> CloudFormation
  2. Create Stack
  3. In the Choose a Template section click the radio button onto Specify an Amazon S3 URL
  4. In that URL place https://s3-us-west-2.amazonaws.com/awsadvent-2018-organizations/StepOneCFNs/phase_0_user_and_accesskey.yml
  5. Next
  6. Stack Name AdventAdmin
  7. Next
  8. Nothing needed here on the Options screen
  9. Next
  10. Check the box acknowledging that AWS CloudFormation might create IAM resources with custom names.
  11. Create
  12. Wait for the stack to reach a Create Completed state. CloudFormation Console showing the stack in a complete state

Step Two: Create Some SCPs

The Organizations API has great support via the CLI and the SDKs, but it’s not really present in CloudFormation. We’re going to use the aws cli to interact with the AWS Organizations APIs. If you don’t already have it, here is the guide to installing the aws cli.

Get your credentials together

We created an IAM User in Step One that has full Administrator privilege. For this guide, I’m going to assume that’s the user that you’ll be using.

To get the credentials to use the cli as this user

  1. Services -> CloudFormation
  2. Stacks
  3. AdventAdmin
  4. Outputs The outputs are in a strange place on the CloudFormation Stack CloudFormation Console showing the stack's outputsYou can cut and paste those two values into your CLI to build something like

Make your first SCP

We’re going to first make a Service Control Policy for our entire organzation that states “No one anywhere can run CloudHSM. CloudHSM was chosen for this example because it tends to not be used, and it’s relatively expensive. There’s nothing wrong with CloudHSM! If your business needs CloudHSM, use it! These SCPs should be considered for demonstration purposes only.

  1. Check to be sure that your organization is functional:
  2. Check your existing SCPs. Amazon created one for you when you built your organization. It says “All Accounts in this organization can use all services.” Check it with this command:
  3. Now we’ll create our new SCP, which will state Deny all use of cloudhsm. Notice how the SCP language is almost an IAM policy?
  4. List the policies again, to notice that there are two
  5. Let’s make another SCP, which will state CloudTrail cannot be disabled.
  6. Let’s make another SCP, which will state AWS Config Rules Cannot Be Disabled.

Enable SCPs for your organization and attach them

At this point, we have an Organization and some SCPs, but they aren’t attached.

Our organization does not yet have any structure, and it is not in a state where the SCPs that we created can be attached anywhere.

SCPs have to be explicitly enabled for your Organization. Let us go ahead and do that.

  1. First, we’re going to go ahead run a command that will be disabled via SCP later.
  2. Now, we need to gather the organizational Root id to enable SCPs

    We will now enable SCPs in our Organization’s Root. Take the Id in the output above

    DO NOT PANIC BECAUSE POLICYTYPES IS EMPTY

  3. List the roots of the organization again, and (hopefully) notice that SCPs are enabled

  4. List out the SCP Policies. You’ll need these Ids in the coming commands

  5. Attach the Deny cloudhsm:* SCP to the root. Doing this will trickle through the whole organization.

  6. Attach the Keep CloudTrail Enabled SCP to the root. Doing this will trickle through the whole organization.

  7. Attach the Keep Config enabled SCP to the root. Doing this will trickle through the whole organization.

    Show which policies are attached to our root object.

  8. Ok, Now let’s see what we’ve disabled*

    BUT, BUT, we just disabled that!! WHAT?!

    SCPs don’t impact the root of your organization. The Aristocrats! Gilbert Gotfried delivering the punchline "The Aristcats!"

    Presumably, this means that if you have an admin/root user in the root of your organization, you can recover. Maybe. With the help of support. ( don’t try this for funsies, folks! )

Step Three: Build out an Organization CloudTrail

Now that we’ve laid the groundwork let’s build out this amazing organization that we’re so excited to try out!

Make an S3 bucket to drop the CloudTrail logs into

We’re now ready to create an S3 bucket into which we’ll stash our CloudTrail Logs for our entire organization, automatically, as the org grows or shrinks.

Pretty cool, right? Before we can create the S3 bucket that we’re gonna drop our cloudtrails into, we need to get the organization’s OrgId

  1. Use the CLI to grab your OrgId

Now move over to add the CloudFormation Stack that is going to build out the S3 bucket with the right bucket policy for our org to log cloudtrail data into it.

  1. As your root user, navigate to Services -> CloudFormation
  2. Create Stack
  3. In the Choose a Template section put the radio button onto Specify an Amazon S3 URL
  4. In that URL place https://s3-us-west-2.amazonaws.com/awsadvent-2018-organizations/StepThreeCFNs/phase_3_s3_bucket.yml
  5. Next
  6. Stack Name CloudTrailS3Bucket
  7. OrgId your-org-id-from-the-cli-command-above
  8. Next
  9. Nothing needed here on the Options screen
  10. Next
  11. Check the box acknowledging that AWS CloudFormation might create IAM resources with custom names.
  12. Create
  13. Wait for the stack to reach a Create Completed state "Image of the s3 bucket's stack reaching a complete state"

Create an Organizational CloudTrail

Normally, I’d have dropped the CloudTrail creation into CloudFormation, because I’m not a monster… BUT… If you aren’t already aware, you can consider this my heads up to you that new features frequently get CLI/API support well before they manifest in CloudFormation.

CloudFormation does not yet support organizational CloudTrails. TO THE CLI!

  1. Gather the S3 bucket name that we created earlier.
  2. Now we have to enable all organizational features

    Even though this is an error, it’s the one that we want. Features are enabled 👍👍

  3. Now we have to enable service access for cloudtrail. SCPs don’t impact service access.

  4. Finally, we create the actual trail

Ok, you’ve done a lot so far. And you’re going to be happy that you laid out all this prep work once you have to start answering questions like “Who in X account built out as many EC2s as their account would allow?”

Or, “Did Frank from Accounting actually delete the RDS Database in their AWS Account?”

Step Four: Finally Build Out Some Organization

Let’s get to the purpose that you’re here.. Building out some Organizations!

a directed graph of the organization layout with SCPs

Make An Organizational Unit ( OU ) for developers

You work in a progressive organization that wants individual developers to have their own AWS Accounts, huzzah!

We’re going to build an OU to stuff our developer accounts in, and then we’ll invite some developers into our org

  1. Gather the existing organization’s root. Since we don’t yet have any OUs, all things are rooted from the root.
  2. Now let’s create a subordinate OU
  3. Invite an email address
  4. Or invite an account id
  5. Now go sign in as that account that you just invited. Accept the invitation. This is what the AWS Organizations console should look like. We have one OU named Developer Accounts. Orgs Console organization view
  6. At this point, our newly invited account needs to be moved to our developers OU. When the account joins our organization, it’s parented by the root of the org.

Deep Breath, We’ve done it! 🎉🎊

  1. Check your permissions. I have some credentials stashed away in ~/.aws/credentials for this account (488887740717). When I try and run aws cloudhsmv2 describe-clusters, I will now expect to get a AccessDeniedExeption.
  2. Here’s a final look at the Organizations console with our account placed in the OU. At this point, we can craft additional SCPs or what have you at the OU level, and those SCPs would only apply to the Accounts in the OU. Orgs Console organization view

About the Author

Ed Anderson is the SRE Manager at RealSelf, organizer of ServerlessDays Seattle, and occasional public speaker. Find him on twitter at @edyesed.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


A Hybrid of One: Building a Private Cloud in AWS

12. December 2018 2018 0

Introduction

Adoption of the cloud is becoming more and more popular for all types of businesses. When you’re starting, you have a blank canvas to work from – there’s no existing blueprint or guide. But what if you’re not in that position? What if you’ve already got an established security policy in place, or you’re working in a regulated industry that sets limits of what’s appropriate or acceptable for your company’s IT infrastructure?

Being able to leverage the elasticity of the public cloud is one of its biggest – if not, the biggest – advantage over a traditional corporate IT environment. Building a private cloud takes time, money and a significant amount of investment. This investment might not be acceptable to your organisation or ever generate returns…

But what if we can build a “private” cloud using public cloud services?

The Virtual Private Cloud

If you’ve looked at AWS, you’ll be familiar with the concept of a “VPC” – A “Virtual Private Cloud”, the first resource you’ll create in your AWS account (if you don’t use the default VPC created in every region when your account is created, that is!). It’s private in the sense that it’s your little bubble, to do with as you please. You control it, nurture it and manage it (hopefully with automation tools!). But private doesn’t mean isolated, and this does not fit the definition of a “private cloud.”

If you misconfigure your AWS environment, you can accidentally expose your environment to the public Internet, and an intruder may be able to use this as a stepping-stone into the rest of your network.

In this article, we’re going to look at the building blocks of your own “private” cloud in the AWS environment. We’ll cover isolating your VPC from the public internet, controlling what data enters and, crucially, leaves your cloud, as well as ensuring that your users can get the best out of their new shiny cloud.

Connecting to your “private” Cloud

AWS is most commonly accessed over the Internet. You publish ‘services’ to be consumed by your users. This is how many people think of AWS – a Load balancer with a couple of web servers, some databases and perhaps a bit of email or workflow.

In the “private” world, it’s unlikely you’ll want to provide direct access to your services over the Internet. You need to guarantee the integrity and security of your data. To maximise your use of the new environment you want to make sure it’s as close to your users and the rest of your infrastructure as possible.

AWS has two private connectivity methods you can use for this: DirectConnect and AWS managed VPN.

Both technologies allow you to “extend” your network into AWS. When you create your VPC, you allocate an IP range (that doesn’t clash with your internet network), and you can then establish a site-to-site connection to your new VPC. Any instance or service you spin up in your VPC is accessed directly from your internal network, using its private IP address. It’s just as if a new datacenter appeared on your network. Remember, you can still configure your VPC with an Internet Gateway and allocate Public IP addresses (or Elastic IPs) to your instances, which would then give them both an Internet IP and an IP on your internal network – you probably don’t want to do this!

The AWS managed VPN service allows you to establish a VPN over the Internet between your network(s) and AWS. You’re limited by the speed of your internet connection. Also, you’re accessing your cloud environment over the Internet, with all the variable performance and latency that entails.

The diagram below shows an example how of AWS Managed VPN connectivity interfaces with your network:

AWS DirectConnect allows you to establish a private circuit with AWS (like a traditional “leased line”). Your network traffic never touches the Internet or any other uncontrolled public network. You can directly connect to AWS’ routers at one of their shared facilities, or you can use a third-party service to provide the physical connectivity. The right option depends on your connectivity requirements: directly connecting to AWS means you can own the service end-to-end, but using a third party allows you greater flexibility in how you design the resiliency and the connection speed you want to AWS (DirectConnect offers physical 1GbE or 10GbE connectivity options, but you might want something in between, which is where a third party can really help here).

The diagram below shows an example of how you can architect DirectConnect connectivity between your corporate datacenter and the AWS cloud. DirectConnect also allows you to connect directly to Amazon services over your private connection, if required. This ensures that no traffic traverses the public Internet when you’re accessing AWS hosted services (such as API endpoints, S3, etc.). DirectConnect also allows you to access services across different regions, so you could have your primary infrastructure in eu-west-1 and your DR infrastructure in eu-west-2, and use the same DirectConnect to access both regions.

2-direct_connect_overview

Both connectivity options offer the same native approach to access control you’re familiar with. Network ACLs (NACLs) and Security Groups function exactly as before – you can reference your internal network IP addresses/CIDR ranges as normal and control service access by IP and port. There’s no NAT in place between your network and AWS; it’s just like another datacenter on your network.

Pro Tip: You probably want to delete your default VPCs. By default, AWS services will launch into the default VPC for a specific region, and this comes configured with the standard AWS template of ‘public/private’ subnets and internet gateways. Deleting the default VPCs and associated security groups makes it slightly harder for someone to spin up a service in the wrong place accidentally.

Workload Segregation

You’re not restricted to a single AWS VPC (by default, you’re able to create 5 per region, but this limit can be increased by contacting AWS support). VPCs make it very easy to isolate services – services you might not want to be accessed directly from your corporate network. You can build a ‘DMZ-like’ structure in your “private” cloud environment.

One good example of this is in the diagram below – you have a “landing zone” VPC where you host services that should be accessible directly from your corporate network (allowing you to create a bastion host environment), and you run your workloads elsewhere – isolated from your internal corporate network. In the example below, we also show an ‘external’ VPC – allowing us to access Internet-based services, as well as providing a secure inbound zone where we can accept incoming connectivity if required (essentially, this is a DMZ network, and can be used for both inbound and outbound traffic).

Through the use of VPC Peering, you can ensure that your workload VPCs can be reached from your inbound-gateway VPC, but as VPCs do not support transitive networking configurations by default, you cannot connect from the internal network directly to your workload VPC.

3-Multi-VPC Peering and Privatelink

Shared Services

Once your connectivity between your corporate network and AWS is established, you’ll want to deploy some services. Sure, spinning up an EC2 instance and connecting to it is easy, but what if you need to connect to an authentication service such as LDAP or Active Directory? Do you need to route your access via an on-premise web proxy server? Or, what if you want to publish services to the rest of your AWS environment or your corporate network but keep them isolated in your DMZ VPC?

Enter AWS PrivateLink: Launched at re:Invent in 2017, it allows you to “publish” a Network Load Balancer to other VPCs or other AWS Accounts without needing to establish VPC peering. It’s commonly used to expose specific services or to supply MarketPlace services (“SaaS” offerings) without needing to provide any more connectivity over and above precisely what your service requires.

We’re going to offer an example here of using PrivateLink to expose access to an AWS hosted web proxy server to our isolated inbound and workload VPCs. This gives you the ability to keep sensitive services isolated from the rest of your network but still provide essential functionality. AWS prohibit transitive VPCs for network traffic (i.e., you cannot route from VPC A to VPC C via a shared VPC B) but PrivateLink allows you to work around this limitation for individual services (basically, anything you can “hide” behind a Network Load Balancer).

Assuming we’ve created the network architecture as per the diagram above, we need to create our Network Load Balancer first. NLBs are the only load balancer type supported by PrivateLink at present.

4-load balancer.png

Once this is complete, we can then create our ‘Endpoint Service,’ which is in the VPC section of the console:

5-create endpoint service.png

Once the Endpoint Service is created, take note of the Endpoint Service Name, you’ll need this to create the actual endpoints in your VPCs.

6-endpoint service details

The Endpoint Service Name is unique across all VPC endpoints in a specific region. This means you can share this with other accounts, which are then able to discover your endpoint service. By default, you need to accept all requests to your endpoint manually, but this can be disabled (you probably don’t want this, though!). You can also whitelist specific account IDs that are allowed to create a PrivateLink connection to your endpoint.

Once your Endpoint Service is created, you then need to expose this into your VPCs. This is done from the ‘Endpoints’ configuration screen under VPCs in the AWS console. Validate your endpoint service name and select the VPC required – simple!

7-endpoint details

 

You can then use this DNS name to reference your VPC endpoint. It will resolve to an IP address in your VPC (via an Elastic Network Interface), but traffic to this endpoint will be routed directly across the Amazon network to the Network Load Balancer.

What’s in a Name?

Typically, one of the biggest hurdles with connecting between your internal network and AWS is the ability to route DNS queries correctly. DNS is key to many Amazon services, and Amazon Provided DNS (now Route53 Resolver) contains a significant amount of behind-the-scenes intelligence, such as allowing you to reach the correct Availability Zone target for your ALB or EFS mount point.

Hot off the press is the launch of Route53 Resolver, which removes the need to create your own DNS infrastructure to route requests between your AWS network and your internal network, while allowing you to continue to leverage the intelligence built into the Amazon DNS service. Previously, you would need to build your own DNS forwarder on an EC2 instance to route queries to your corporate network. This means that, from the AWS perspective, all your DNS requests are originating from a single server in a specific AZ (which might be different to the AZ of the client system), and so you’d end up getting the endpoint in a different region for your service. With a service such as EFS, this could result in increased latency and a high cross-AZ data transfer bill.

Here’s an example of how the Route53 resolver automatically picks the correct mount point target based on the location of your client system:

Pro Tip: If you’re using a lot of standardised endpoint services (such as proxy servers), using a common DNS name which can be used across VPCs is a real time-saver. This requires you to create a Route53 internal zone for each VPC (such as workload.example.com, inbound.example.com) and update the VPC DHCP Option Set to hand out this domain name via DHCP to your instances. This, then allows you to create a record in each zone with a CNAME to the endpoint service, for example:

From an instance in our workload VPC:

And the same commands from an instance in our inbound VPC:

In this example above, we could use our configuration management system to set the http_proxy environment variable to ‘proxy.privatelink:3128’ and not have to have per-VPC specific logic configured. Neat!

Closing Notes

There are still AWS services that expect to have Internet access available from your VPC by default. One example of this is AWS Fargate – the Amazon-hosted and managed container deployment solution. However, Amazon is constantly migrating more and more services to PrivateLink, meaning this restriction is slowly going away.

A full list of currently available VPC endpoint services is available in the VPC Endpoint documentation. AWS provided VPC Endpoints also give you the option to update DNS to return the VPC endpoint IPs when you resolve the relevant AWS endpoint service name (i.e. ec2.eu-west-1.amazonaws.com -> vpce-123-abc.ec2.eu-west-1.amazonaws.com -> 10.10.0.123) so you do not have to make any changes to your applications in order to use the Amazon provided endpoints.

About the Author

Jon is a freelance cloud devoperative buzzword-hater, currently governing the clouds for a financial investment company in London, helping them expand their research activities into “the cloud.”

Before branching out into the big bad world of corporate consulting, Jon spent five years at Red Hat, focusing on the financial services sector as a Technical Account Manager, and then as an on-site consultant.

When he’s not yelling at the cloud, Jon is a trustee of the charity Service By Emergency Rider Volunteers – Surrey & South London, the “Blood Runners,” who provide free out-of-hours transport services to the UK National Health Service. He is also guardian to two small dogs and a flock of chickens.

Feel free to shout at him on Twitter, or send an old-fashioned email.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Last Minute Naughty/Nice Updates to Santa’s List

11. December 2018 2018 0


The other day I was having a drink with my friend Alabaster. You might have heard of him before, but if not, you’ve heard of his boss, Santa. Alabaster is a highly educated elf who is the Administrator of the Naughty and Nice list for more than five decades now. It’s been his responsibility to manage it since it was still a paper list. He’s known for moving the list to a computer. Last year he moved it to AWS DynamoDB.

“It went great!” he told me with a wince that made me unsure of what he meant. “But then on the 23rd of December, we lost some kids.”

“What?! What do you mean, you lost some kids?”, I asked.

“Well. Let me explain. The process is a little complicated.

Migrating the list to AWS was more than just migrating data. We also had to change the way we manage the naughty and nice list. Before, with our own infrastructure, we didn’t care about the resource utilization we used, as long as the infrastructure could handle it. We were updating the list five times a minute per kid. At 1.8 billion kids that was just over 150 requests per second, constant, easy to manage.

Bushy, the elf that made the toy-making machine, is the kind of person that thinks a half-full glass is just twice the size it should be. Bushy pointed out that information about whether a child was naughty or not and their location for Christmas was only needed on December 24th. He proposed that we didn’t need to be updating the information as frequently.

So we made changes in how we updated the data. It was a big relief but it resulted in a spiky load. In December, we suddenly found ourselves with a lot of data to update. 1.8 billion records to be exact. And it failed. The autoscaling of DynamoDB mostly worked with some manual fiddling to keep increasing the number of writers fast enough. But on December 23rd we had our usual all hands on deck meeting on last-minute changes of behaviour for kids and no one was reacting to the throttling alarms. We didn’t notice until the 25th. By then some records had been lost, some gifts had been delivered to the wrong addresses.

Some kids stopped believing in Santa because someone else actually delivered their gifts late! It was the worst mistake of my career.”

“Oh, thank goodness you didn’t literally lose some kids! But, oh wow. Losing so many kid’s trust and belief must have really impacted morale at the North Pole! That sounds incredibly painful. So what did you learn and how has it changed the process for this year?” I asked.

“Well, the main difference is that we decoupled the writes. DynamoDB likes regular writes and can scale in a reasonable way if the traffic is not all peaks or increasing really fast.

So we send all the information to SQS and then use lambdas to process the writes. That gives us two ways of keeping control without risking a failed write: we can limit the writes and control them by changing the lambda concurrency and can either control the amount of writers needed with auto-scaling or manually.”

“That looks like an interesting way of smoothing a spiky load. Can you share an example?” I asked.

“I can show you the lambda code; I’ve just been playing with it.” He turned his laptop towards me showing me the code. It was empty, just a process_event function that did a write to boto3.

“That’s it?” I asked.

“Yes, we use zappa for it, so it’s mostly configuration, ” he replied.

We paired at the conference hotel bar, as you do when you find an interesting technical solution. First, Alabaster told me I had to create an SQS queue. We visited the SQS console. The main issue was that it looks like AWS has a completely different UI for the north-pole-1 region (which, to be honest, I didn’t know existed). I already had python 3.6 setup, so I only needed to create a virtual environment with python -m venv sqs-test and activate it with . sqs-test/bin/activate.

Then, he asked me to install zappa with pip install zappa. We created a file zappa_settings.json starting with the following as a base (you can use zappa init but you’ll then need to customise it for a non-webapp use-case):

I changed the profile_name and aws_region to match my credentials configuration and also the s3_bucket and the event_source arn to match my newly created SQS queue (as I don’t have access to the north-pole-1 region).

We then just sorted out a baseline with a minimalistic app.py:

This code shows the data of the context and event on CloudWatch logs. Alabaster explained that I could have quick access using zappa tail. Then I can use it to write to the naughty-nice list on DynamoDB or to whatever system I want to limit the activity.

Alabaster showed me the North Pole’s working implementation including how they had the throttling alarms setup in CloudWatch, concurrency configuration of Lambda on the lambda console (choose a function, go to the “Concurrency” panel, click “Reserve concurrency” and set the number to 1 – then increase as needed). While a burst of a million updates was handled with some delay, there was no data loss. I could see the pride in his face.

Hoping everything goes well for him this season, and that all have a good Christmas, and a good night!

About the Author

João Miguel Neves is a Lead Developer at POP https://www.wegotpop.com/, a company that manages people on movie productions. He also writes about python and cloud on his blog https://silvaneves.org/

About the Editors

Ed Anderson is the SRE Manager at RealSelf, organizer of ServerlessDays Seattle, and occasional public speaker. Find him on twitter at @edyesed.

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


AWS SAM – My Exciting First Open Source Experience

10. December 2018 2018 0

My first acquaintance with AWS Cloud happened through a wonderful tool – SAM CLI. It was a lot of fun playing around with it –  writing straightforward YAML based resource templates, deploying them to AWS Cloud or simply invoking Lambda functions locally. But after trying to debug my .NET Core Lambda function, I realized that it was not yet supported by the CLI. At this point, I started looking through the open issues and found this one that described exactly what I was experiencing it turned out to be a hot topic, so it got me interested instantly. As I reviewed the thread to get additional context about the issue, I learned that the project was looking for help.

The ability to debug Lambda on .NET locally was critical for me to fully understand how exactly my function was going to be invoked in the Cloud, and troubleshoot future problems. I decided to answer the call to help and investigate the problem further. Feeling like Superman that came to save the day, I posted a comment on the thread to say that, “I’m looking into this too.” It took no more than 10 minutes to receive a welcoming reply from members of the core team. When you receive a reply in a matter of minutes, you definitely understand that the project is truly alive and the maintainers are open to collaboration –  especially since being hosted on GitHub does not necessarily mean that a project is intended for contribution and community support. Feeling inspired, I began my investigation.

The Problem

Unfortunately the .NET Core host program doesn’t have any “debug-mode” switch (you can track the issue here), meaning that there is no way to start .NET program so it will immediately pause after launching, wait for the debugger to attach to it, and only then proceed (for example Java (using JDWP and its suspend flag) and Node.js are capable of doing this), and SAM needed some way to work around this. After several days of investigation and prototyping, I came up with a prototype of a working solution that used a remote debugger attachment to the .NET Lambda runner inside of the running Docker container (to execute AWS Lambda function locally, SAM internally spins up a Docker container with a Lambda-like environment).

 

I joined the AWS Serverless Application Slack organization to collaborate with the maintainers and understand more about the current CLI design. Thanks to the very welcoming community on the Slack channel, I almost immediately started to feel like I was part of the SAM team! . At this point, I was ready to write up a POC of .NET Core debugging with SAM. I have a README.md you can look through which explains my solution in greater detail.

 

My POC was reviewed by the maintainers of SAM CLI. Together we discussed and resolved all open questions, shaped up the design to perfection and agreed upon the approach. To streamline the process and break it down, one of the members of the core team suggested that I propose a solution and identify the set of tasks through a design document. I had no problems doing so because the SAM CLI project has a well defined and structured template for writing this kind of document. The core team reviewed and approved. Everything was set up and ready to go, and I started implementing the feature.

Solution

Docker Lambda

My first target was the Docker Lambda repo. More specifically, it was the implementation of a “start in break mode” flag for .NET Core 2.0 and 2.1 runner (implementation details here).

I must admit, that this was my very first open source pull request, and it turned out to be overkill, so my apologies to the maintainer. PR included many more changes than required to solve the problem. The unnecessary changes were related to code style refactoring and minor improvements. And don’t get me wrong – best practices and refactoring changes are not undesired nor unwelcome, but they should be implemented in a separate PR to be a good citizen in the open source community. Okay, with that lesson learned, I’ve opened a slim version of my original PR with only the required code changes and with detailed explanations of them. SAM CLI is up next on our list.

 

 

AWS SAM CLI

Thanks to the awesome development guide at SAM CLI repo, it was super easy to set up the environment and get going. Even though I am not a Python guy, the development felt smooth and straightforward. The CLI codebase has a fine-grained modular architecture, and everything was clearly named and documented, so I faced no problems getting around the project. Configured Flake8 and Pylint linters kept an eye on my changes, so following project code-style guidelines was just a matter of fixing warnings if they appear. And of course, decent unit tests code coverage (97% at the time of writing) not only helped me to rapidly understand how each component worked within the system, but also made me feel rock solid confident as I was introducing changes.

 

The core team and contributors all deserve great thumbs up 👍 for keeping the project in a such a wonderful and contribution-friendly state!

 

However, I did encounter some troubles running unit and integration tests locally on my beloved Windows machine. I got plenty of unexpected Python bugs, “windows-not-supported” limitations and stuff like that along the way, at one point I was in complete despair. But help comes if you seek it. After asking the community for guidance, we collectively came up with a Docker solution by running tests inside Ubuntu container. And finally, I was able to setup my local environment and run all required tests before submitting the pull request. Ensuring that you hadn’t broken anything is a critical process while collaborating on any kind of project, and especially open source!

According to my initial approach to .NET Core debugging, I implemented --container-name feature, which should’ve allowed users of SAM CLI to specify a name for the running Lambda container, to identify it later to perform attaching. During review, the core team found some corner cases when this approach introduces some limitations and diversity for the whole debugging experience compared to the other runtimes, so I started to look for possible workarounds.

I looked at the problem from a different perspective and came up with a solution, that enabled .NET Core debugging for AWS SAM CLI with almost no changes made 🎉 (more about this approach here and here). The team immediately loved this approach, because it closely aligned with the way other runtimes deal with debugging currently, providing a consistent experience for the users. I am happy, that I got valuable feedback from the community because exactly that got me thinking about the way to improve my solution. Constructive criticism, when handled properly, makes perfection. Now, the feature is merged into develop and is waiting for the next release to come into play! Thanks to everyone, who’d taken part into this amazing open source journey to better software!

Conclusion

Bottom line, I can say that I had a smooth and pleasant first open source experience. I’m happy that it was with SAM, as I had a super interesting time collaborating with other passionate SAM developers and contributors across the globe through GitHub threads, emails and Slack. I liked how rapid and reactive all of those interactions were – it was a great experience writing up design docs for various purposes, and being part of a strong team collaborating to reach the common goal. It was both challenging and fun to quickly explore SAM codebase and then follow its fully automated code-style rules. And, lastly, it was inspiring to experience how easy it is to contribute to a big and very well-known open source project like SAM framework.

 

Based on my experience, I came up with this battle-tested and bulletproof checklist:

  1. Before getting your hands dirty with code, take your time and engage in discussion with the core team to make sure you understand the issue and try to get as much context on it as possible. 🔍
  2. I suggest writing up a descriptive design doc, which explains the problem (or feature) and proposed solution (or implementation) in great detail, especially if it’s a common practice on the project you’re contributing to
  3. Don’t be afraid or embarrassed to ask for help from the community or the maintainers if you’re stuck. Everyone wants you to succeed 🎉
  4. Unit tests are a great source of truth in decent projects, so look through them to get a better understanding of all the moving parts. Try to cover your changes with required tests, as these tests also help future contributors (which could be you)!
  5. Always try your best to match project code-style – thanks to modern linters this one is a piece of 🍰
  6. Keep your code changes small and concise. Don’t try to pack every single code style and best practices change into a tiny PR intended to fix a minor bug
  7. The PR review process is a conversation between you and other collaborators about the changes you’ve made. Reading and reviewing every comment you receive carefully, helps to build a common understanding of the problem and avoid confusion. Never take those comments personally and try to be polite and respectful while working on them. Don’t be afraid to disagree or have a constructive discussion with reviewers. Try to come to a mutual agreement in the end and, if required, introduce specified changes
  8. Try not to force your pull request as the team may have some important tasks at hand or approaching release deadlines. Remember that good pull requests get pulled and not pushed.

 

It’s good to know that anyone from the community can have an impact just by leaving a comment or fixing a small bug. It’s cool that new features can be brought in by the community with the help of the core team. Together we can build better open source software!

 

Based on my experience, SAM is a good and welcoming project to contribute to thanks to the well-done design and awesome community. If you have any doubt, just give it a try. SAM could be a perfect start for your open source career 😉

 

When you start it is hard to stop: #795, #828, #829, #843

Contribute with caution

Thank You!

I want to give special thanks to @sanathkr, @jfuss, @TheSriram, @mikemorain and @mhart for supporting me continuously along this path.

 

About the Author

My name is Nikita Dobriansky, from sunny Odessa, Ukraine. Throughout my career in software development, I’ve been working with C# desktop applications. Currently, I’m a C# engineer at Lohika, building bleeding-edge UWP apps. I am super excited about .NET Core and container-based applications. For me, AWS is a key to better serverless ☁️ future.

In my free time, I enjoy listening to the great music and making some myself 🎸

@ndobryanskyy

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Using Infrastructure as Code as a Poor Man’s DR

09. December 2018 2018 0

What is DR?

Let’s start by setting the context of what I mean by Disaster Recovery(DR). There are different interpretations of DR and High Availability, with a very thin and moving line between the two. Here I am specifically referring to the ability to recover your infrastructure from a disaster such as when an AWS region is unavailable. I am not talking about situations where immediate failover is needed.

What is Infrastructure as Code

Infrastructure as Code (IaC) is something that has been around for a while now, but many people are just starting to fully embrace and see the benefits of it. Like DR, this is something that many people have taken to mean many different things over the years. People refer to BASH scripts that generate KVM VMs as IaC. While this is technically correct, this is not IaC. I am explicitly talking about tools such as Terraform, that are designed to generate infrastructure based on a configuration file.

What IaC for DR?

Go to any company or organization that does not have a viable DR strategy and ask them why that is the case. Nine times out of ten, the answer that you get will relate to cost. That makes sense, having a true DR environment can be very expensive, additionally, for people that are not technical and have never experienced a true IT disaster, it can be tough to comprehend why this is all needed. These factors make it very difficult for IT to get approval to put a secondary environment in place. This is where IaC comes in.

If your IaC is properly set up, you can essentially get DR for free. How? If a disaster takes out your infrastructure, you just , and you have your infrastructure back.

But wait, my IaC tool deploys to the AWS region that is down

If your tool is improperly configured, you may not benefit from the DR capabilities of IaC. You need to make sure that you are abstracting the providers and regions out of the actual infrastructure configuration. This allows you to quickly change the region that you are pointing at and re-deploy. For example, in Terraform, you would want to have a separate provider.tf file, that has a provider section with the region specified like this: provider “aws” { region = “eu-west-1” } This will allow you to change one line simply, and re-deploy your exact infrastructure to another region. This is as opposed to having the region information embedded in individual .tf files, which unfortunately I see floating around pretty often.

What if all of AWS (or GCP, or Azure) completely goes down and not just a region?

A complete outage of a service provider is another concern that I hear from time to time. I have a couple of different thoughts about this scenario.

My first thought is that the chances of that happening are so vanishingly small, that it is hardly worth thinking about. If all of AWS is down, you likely have something more serious, like worldwide thermo-nuclear war. However, as engineers, sysadmins, and other assorted IT professionals, we have a habit of not being able to not think about these extreme cases.

For those situations, you can have standby code. What do I mean by this? I mean that you can develop code that deploys the equivalent of your current infrastructure in another environment. Now, this is obviously time-consuming, and since none of us have a ton of spare time, that is a cost, and one that personally I don’t think is worth it, but it’s possible, and up to each reader to decide if it is needed for their environment.

Ok, I have my infrastructure back, what about my data?

Well, you are still doing backups, right? I am making a case for replacing a dedicated DR environment with code; I am not making a case for throwing basic common sense out the window.

That being said, there are times where it would take an impractical amount of time to restore data from backups just because there was a 1-2 hour outage. Especially when you can re-deploy your infrastructure from code to another region in minutes.

This is where I advocate for a hybrid approach between a complete IaC DR plan and a tradition DR setup. In this type of solution, you would have a replicated database (or other data source) running at all times in another region that you plan to use for DR purposes. Then, if disaster strikes, your data is sitting there just waiting for you to deploy the networking and compute resources to access it.

Since this does require keeping some infrastructure running at all times, it does cost some money. However, it will cost far less than having a whole second DR site sitting around waiting and may be an easier pill for the people that have to spend the money to swallow.

Conclusion

I hope that after reading this article you will have an understanding of the feasibility of using IaC as a means to have a DR environment in places where it otherwise would not be feasible. I further hope that you see the benefits of this solution in situations where a full disaster recovery solution is possible, but possibly not needed. Perhaps that money could better be spent elsewhere if you already have IaC in place to cover the worst case scenario.

What’s next

This article introduces the DR topic to get those that read it thinking about using Infrastructure as Code as a possible disaster recovery plan and solution. It just begins to scratch the surface of what is possible and the different considerations that need to be made.

Please visit my website at https://www.adair.tech over the next several weeks as I will publish follow up articles there that will delve further into details, tools, and specific plans for accomplishing this. You can also contact me via my website or Twitter to have a 1-1 conversation on the topic and explore your particular use case in more depth.

About the Author

Brad Adair is an experienced IT professional with over a decade of experience in systems engineering and administration, cloud engineering and architecture, and IT management. He is the President of Adair Technology, LLC., which is a Columbus based IT consulting firm specializing in AWS architecture and other IT infrastructure consulting. He is also an AWS Certified Solutions Architect. Outside of the office he enjoys sports, politics, Disney World, and spending time with his wife and kids.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Multi-region Serverless APIs: I’ve got a fever and the only cure is fewer servers

08. December 2018 2018 0

Meet SAM

Let’s talk about the hottest thing in computers for the past few years. No, not Machine Learning. No, not Kubernetes. No, not big data. Fine, one of the hottest things in computers. Right, serverless!

It’s still an emerging and quickly changing field, but I’d like to take some time to demonstrate how easy it is to make scalable and reliable multi-region APIs using just a few serverless tools and services.

It’s actually deceptively simple. Well, for a “blog post”-level application, anyway.

We’re going to be managing this application using the wonderful AWS Serverless Application Model (SAM) and SAM CLI. Far and away the easiest way I have ever used for creating and deploying serverless applications. And, in keeping with contemporary practices, it even has a cute little animal mascot.

SAM is a feature of CloudFormation that provides a handful of short-hand resources that get expanded out to their equivalent long-hand CloudFormation resources upon ChangeSet calculation. You can also drop down into regular CloudFormation whenever you need to to manage resources and configurations not covered by SAM.

The SAM CLI is a local CLI application for developing, testing, and deploying your SAM applications. It uses Docker under the hood to provide as close to a Lambda execution environment as possible and even allows you to run your APIs locally in an APIGateway-like environment. It’s pretty great, IMO.

So if you’re following along, go ahead and install Docker and the SAM CLI and we can get started.

The SAMple App

Once that’s installed, let’s generate a sample application so we can see what it’s all about. If you’re following along on the terminal, you can run sam init -n hello-sam -r nodejs8.10 to generate a sample node app called hello-sam. You can also see the output in the hello-sam-1 folder in the linked repo if you aren’t at a terminal and just want to read along.

The first thing to notice is the README.md that is full of a huge amount of information about the repo. For the sake of brevity, I’m going to leave learning the basics of SAM and the repo structure you’re looking at as a bit of an exercise for the reader. The README and linked documentation can tell you anything you need to know.

The important thing to know is that hello_world/ contains the code and template.yaml contains a special SAM-flavored CloudFormation template that controls the application. Take some time to familiarize yourself with it if you want to.

SAM Local

So what can SAM do, other than give us very short CFN templates? Well the SAM CLI can do a lot to help you in your local development process. Let’s try it out.

Step 0 is to install your npm dependencies so your function can execute:

Alright, now let’s have some fun.

There are a few local invocation commands that I won’t cover here because we’re making an API. The real magic with the CLI is that you can run your API locally with sam local start-api. This will inspect your template, identify your API schema, start a local API Gateway, and mount your functions at the correct paths. It’s by no means a perfect replica of running in production, but it actually does a surprisingly great job.

When we start the API, it will mount our function at /hello, following the path specified in the Events attribute of the resource.

Now you can go ahead and curl against the advertised port and path to execute your function.

Then on the backend, you’ll see it executing your function in Docker:

You can change your code at-will and the next invocation will pick it up. You can also attach a debugger to the process if you aren’t a “debug statement developer.”

Want to try to deploy it? I guess we might as well – it’s easy enough. The only pre-requisite is that we need an S3 bucket for our uploaded code artifact. So go ahead and make that – call it whatever you like.

Now, we’ll run a sam package. This will bundle up the code for all of your functions and upload it to S3. It’ll spit out a rendered “deployment template” that has the local CodeUris swapped out for S3 URLs.

If you check out deploy-template.yaml, you should see…very few remarkable differences. Maybe some of the properties have been re-ordered or blank lines removed. But the only real difference you should see is that the relative CodeUrl of CodeUri: hello_world/ for your function has been resolved to an S3 URL for deployment.

Now let’s go ahead and deploy it!

Lastly, let’s find the URL for our API so we can try it out:

Cool, let’s try it:

Nice work! Now that you know how SAM works, let’s make it do some real work for us.

State and Data

We’re planning on taking this multi-region by the end of this post. Deploying a multi-region application with no state or data is both easy and boring. Let’s do something interesting and add some data to our application. For the purposes of this post, let’s do something simple like storing a per-IP hit counter in DynamoDB.

We’ll go through the steps below, but if you want to jump right to done, check out the hello-sam-2 folder in this repository.

SAM offers a SimpleTable resource that creates a very simple DynamoDB table. This is technically fine for our use-case now, but we’ll need to be able to enable Table Streams in the future to go multi-region. So we’ll need to use the regular DynamoDB::Table resource:

We can use environment variables to let our functions know what our table is named instead of hard-coding it in code. Let’s add an environment variable up in the Globals section. This ensures that any functions we may add in the future automatically have access to this as well.

Change your Globals section to look like:

Lastly, we’ll need to give our existing function access to update and read items from the table. We’ll do that by setting the Policies attribute of the resource, which turns into the execution role. We’ll give the function UpdateItem and GetItem. When you’re done, the resource should look like:

Now let’s have our function start using the table. Crack open hello_world/app.js and replace the content with:

On each request, our function will read the requester’s IP from the event, increment a counter for that IP in the DynamoDB table, and return the total number of hits for that IP to the user.

This’ll probably work in production, but we want to be diligent and test it because we’re responsible, right? Normally, I’d recommend you spin up a DynamoDB-local Docker container, but to keep things simple for the purposes of this post, let’s create a “local dev” table in our AWS account called HitsTableLocal.

And now let’s update our function to use that table when we’re executing locally. We can use the AWS_SAM_LOCAL environment variable to determine if we’re running locally or not. Toss this at the top of your app.js to select that table when running locally:

Now let’s give it a shot! Fire up the app with sam local start-api and let’s do some curls.

Nice! Now let’s deploy it and try it out for real.

Not bad. Not bad at all. Now let’s take this show on the road!

Going Global

Now we’ve got an application, and it even has data that our users expect to be present. Now let’s go multi-region! There are a couple of different features that will underpin our ability to do this.

First is the API Gateway Regional Custom Domain. We need to use the same custom domain name in multiple regions, the edge-optimized custom domain won’t cut it for us since it uses CloudFront. The regional endpoint will work for us, though.

Next, we’ll hook those regional endpoints up to Route53 Latency Records in order to do closest-region routing and automatic failover.

Lastly, we need to way to synchronize our DynamoDB tables between our regions so we can keep those counters up-to-date. That’s where DynamoDB Global Tables come in to do their magic. This will keep identically-named tables in multiple regions in-sync with low latency and high accuracy. It uses DynamoDB Streams under the hood, and ‘last writer wins’ conflict resolution. Which probably isn’t perfect, but is good enough for most uses.

We’ve got a lot to get through here. I’m going to try to keep this as short and as clear as possible. If you want to jump right to the code, you can find it in the hello-sam-3 directory of the repo.

First things first, let’s add in our regional custom domain and map it to our API. Since we’re going to be using a custom domain name, we’ll need a Route53 Hosted Zone for a domain we control. I’m going to pass through the domain name and Hosted Zone Id via a CloudFormation parameter and use it below. When you deploy, you’ll need to supply your own values for these parameters.

Toss this at the top of template.yaml to define the parameters:

Now we can create our custom domain, provision a TLS certificate for it, and configure the base path mapping to add our API to the custom domain – put this in the Resources section:

That !Ref ServerlessRestApi references the implicit API Gateway that is created as part of the AWS::Serverless::Function Event object.

Next, we want to assign each regional custom domain to a specific Route53 record. This will allow us to perform latency-based routing and regional failover through the use of custom healthchecks. Let’s put in a few more resources:

The AWS::Route53::Record resource creates a DNS record and assigns it to a specific AWS region. When your users query for your record, they will get the value for the region closest to them. This record also has a AWS::Route53::HealthCheck attached to it. This healthcheck will check your regional endpoint every 30 seconds. If your regional endpoint has gone down, Route53 will stop considering that record when a user queries for your domain name.

Our Route53 Healthcheck is looking at /health on our API, so we’d better implement that if we want our service to stay up. Let’s just drop a stub healthcheck into app.js. For a real application you could perform dependency checks and stuff, but for this we’ll just return a 200:

The last piece, we, unfortunately, can’t control directly with CloudFormation; we’ll need to use regular AWS CLI commands. Since Global Tables span regions, it kind of makes sense. But before we can hook up the Global Table, each table needs to exist already.

Through the magic of Bash scripts, we can deploy to all of our regions and create the Global Table all in one go!

For a more idempotent (but more verbose) version of this script, check out hello-sam-3/deploy.sh.

Note: if you’ve never provisioned an ACM Certificate for your domain before, you may need to check your CloudFormation output for the validation CNAMEs.

And…that’s all there is to it. You have your multi-region app!

Let’s try it out

So let’s test it! How about we do this:

  1. Get a few hits in on our home region
  2. Fail the healthcheck in our home region
  3. Send a few hits to the next region Route53 chooses for us
  4. Fail back to our home region
  5. Make sure the counter continues at the number we expect

Cool, we’ve got some data. Let’s failover! The easiest way to do this is just to tell Route53 that up actually means down. Find the healthcheck id for your region using aws route53 list-health-checks and run:

Now let’s wait a minute for it to fail over and give it another shot.

Look at that, another region! And it started counting at 3. That’s awesome, our data was replicated. Okay, let’s fail back, you know the drill:

Give it a minute for the healthcheck to become healthy again and fail back. And now let’s hit the service a few more times:

Amazing. Your users will now automatically get routed to not just the nearest region, but the nearest healthy region. And all of the data is automatically replicated between all active regions with very low latency. This grants you a huge amount of redundancy, availability, and resilience to service, network, regional, or application failures.

Now not only can your app scale effortlessly through the use of serverless technologies, it can failover automatically so you don’t have to wake up in the middle of the night and find there’s nothing you can do because there’s a network issue that is out of your control – change your region and route around it.

Further Reading

I don’t want to take up too much more of your time, but here’s some further reading if you wish to dive deeper into serverless:

  • A great Medium post by Paul Johnston on Serverless Best Practices
  • SAM has configurations for safe and reliable deployment and rollback using CodeDeploy!
  • AWS built-in tools for serverless monitoring are lackluster at best, you may wish to look into external services like Dashbird or Thundra once you hit production.
  • ServerlessByDesign is a really great web app that allows you to drag, drop, and connect various serverless components to visually design and architect your application. When you’re done, you can export it to a working SAM or Serverless repository!

About the Author

Norm recently joined Chewy.com as a Cloud Engineer to help them start on their Cloud transformation. Previously, he ran the Cloud Engineering team at Cimpress. Find him on twitter @nromdotcom.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Working with AWS Limits

07. December 2018 2018 0

How rolling out EC2 Nitro system based instance types surfaced DNS query rate limit

Introduction

Amazon Web Services (AWS) is a great cloud platform enabling all kinds of businesses and organizations to innovate and build on a global scale and with great velocity. However, while it is a highly scalable infrastructure, AWS does not have infinite capacity. Each AWS service has a set of service limits to ensure a quality experience for all customers of AWS.

There are also some limits that customers might accidentally discover while deploying new applications or services, or while trying to scale up existing infrastructure.

Keeping Pace with AWS Offerings

We want to share our discovery and mitigation of one such limit. As prudent customers practicing the Well Architected Framework practices, we track new AWS services and updates to existing ones to take advantage of new capabilities and potential cost savings. At re:Invent 2017, the Nitro system architecture was introduced. Consequently, as soon as relevant EC2 instance types became generally available, we started updating our infrastructure to utilize new M5 and C5 instance types. We updated relevant CloudFormation templates, Launch configurations and built new AMIs to enable the new and improved elastic network interface (ENI). We were now ready to start the upgrade process.

Preparing for Success in Production by Testing Infrastructure

We were eager to try out new instance types, so we launched a couple of test instances using our common configuration to start our testing. After some preliminary testing (mostly kicking proverbial tires) we started the update of our test environment.

Our test environment is very similar to the production environment. We try to use the same configuration with modified parameters to account for a lighter load on our test instances (e.g., smaller instances and Auto Scaling groups). We updated our stacks with revised CloudFormation templates, rebuilt Auto Scaling groups using new Launch configurations with success. We did not observe any adverse effects on our infrastructure while running through some tests. The environment worked as expected and developers continued to deploy and test their changes.

Deploying to Production

After testing in our test environment and baking for a couple of weeks, we felt confident that we were ready to deploy the new instance types into production. We purchased reserved M5 and C5 instances and were saving money by utilizing them, and we also observed performance improvements as well. We started with some second-tier applications and services. These upgraded smoothly, and it added to our confidence in the changes we were making to the environment. It was exciting to see, and we could not wait to tackle our core applications and services, including infrastructure running our main site.

Everything was in place: we had new instances running in the test environment, and partially in the production environment; we notified our engineering team about the upgrade.

In our environment, we share on-call responsibilities with development. Every engineer is engaged in the success of the company through shared responsibility for the health of the site. We followed our process of notifying the on-call engineers about the changes to the environment. We pulled up our monitoring and “Golden Signal” dashboards to watch for anomalies. We were ready!

The update process went relatively smoothly, and we replaced older M4 and C4 instances with new and shiny M5 and C5 ones. We saw some performance gains,  e.g., somewhat faster page loads. Dashboards didn’t show any issues or anomalies. We started to check it off our to-do list so we could move on to the next project in our backlog.

It’s the Network…

We were paged. Some of the instances in an availability zone (AZ) were throwing errors that we initially attributed to network connectivity issues. We verified that presumed “network failures” were limited only to a single AZ, so we decided to divert traffic away from that AZ and wait for the network to stabilize. After all, this is why we run multi-zone deployment.

We wanted to do due-diligence to make sure we understood the underlying problem. We started digging into the logs and did not see anything abnormal. We chalked it off to transient network issues and continued to monitor our infrastructure.

Some time passed without additional alerts, and we decided to bring the AZ back into production. No issues observed; our initial assessment must be correct.

Then, we got another alert. This time a second AZ was having issues. We thought that it must be a bad day for AWS networking. We knew how to mitigate it: take that AZ out of production and wait it out. While we were doing that, we got hit with another alert; it looked like yet another AZ was having issues. At this point, we were concerned that something wasn’t working as we expected and that maybe we needed to modify the behavior of our application. We dove deeper into our logs.

Except when it’s DNS…

This event was a great example of our SREs and developers coming together and working on the problem as one team. One engineer jumped on a support call with AWS, while our more experienced engineers started close examination of logged messages and events. Right then we noticed that our application logs contained messages that we hadn’t focused on before: failures to perform DNS name resolution.

We use Route 53 for DNS, so we had not experienced these kinds of errors before on the M4 or C4 instances. We jumped on the EC2 instances and confirmed that name resolution worked as we expected. We were really puzzled about the source of these errors. We checked to see if we had any production code deploys that might have introduced them, and we did not find anything suspicious or relevant.


In the meantime, our customers were experiencing intermittent errors while accessing our site and that fact did not sit well with us.

Luckily, the AWS support team was working through the trouble with us. They checked and confirmed that they did not see any networking outages being reported by their internal tools. Our original assumption was incorrect. AWS support suggested that we run packet capturing focused on DNS traffic between our hosts to get additional data. Coincidentally, one of our SREs was doing exactly that and analyzing captured data. Analysis revealed a very strange pattern: while many of the name resolution queries were successful, some were failing. Moreover, there was no pattern as to which names would fail to resolve. We also observed that we triggered about 400 DNS queries per second.

We shared our findings with AWS support. They took our data and contacted us with a single question. “Have you recently upgraded your instances?”

“Oh yes, we upgraded earlier that day,” we responded.

AWS support then reminded us that each Amazon EC2 instance limits the number of packets sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface (https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html?shortFooter=true#vpc-dns-limits). So the limit was not new, however, on newer instance types AWS had eliminated internal retries and thus made DNS resolution errors more visible to newer instances. To mitigate the impact service team recommended implementing DNS caching on the instances.

At first, we were skeptical about their suggestion. After all, we did not seem to breach the limit with our 400 or so requests per second. However, we did not have any better ideas, so we decided to pursue two solutions. Most important we needed to improve the experience of our customers by rolling back our changes. We did that and immediately stopped seeing DNS errors. Second, we started working on implementing local DNS caching on affected EC2 instances.

AWS support recommended using nscd (https://linux.die.net/man/8/nscd). Based on our personal experiences with various DNS tools and implementations, we decided to use bind (https://www.isc.org/downloads/bind/) configured to act only as a caching server with XML statistics-channels enabled. The reason for that requirement was our desire to understand the nature of DNS queries performed by our services and applications with the aim of improving how they interacted with DNS infrastructure.

Infrastructure as Code and the Magic DNS Address

A fair number of our EC2 instances run Ubuntu. We were hoping to utilize main package repositories to install bind and use our custom configuration to achieve our goal of reducing number of DNS queries per host.

On Ubuntu hosts, we used our configuration management system (Chef) to install bind9 package and configure it so it would listen only on localhost (127.0.0.1, ::1) interface, query AWS VPC DNS server, cache DNS query results, log statistics, and expose them via port 954. Typically, AWS VPC provides DNS server running on a reserved IP address at the base of the VPC IPv4 network range, plus two. We started coding a solution to calculate that IP address based on the network parameters of our instances, when we noticed that there is also an often overlooked “magic” DNS address available to use: 169.254.169.253. That made our effort easier since we could hard code that address in our configuration template file. We also need to remember to preserve the content of the original resolver configuration file and prepend loopback address (127.0.0.1) to it. That way, our local caching server will be queried first, but in case it was not running for some reason, clients had a fallback address to query.

Prepending of loopback address was achieved by adding the following option to the dhclient configuration file (/etc/dhcp/dhclient.conf):

Preservation of the original content of /etc/resolv.conf was done by creating /etc/default/resolvconf with the following statement:

And to apply these changes we needed to restart networking on our instances and unfortunately, the only way that we found to get it done was not what we were hoping for (service networking restart):

We tested our changes using Chef’s excellent test kitchen and inspec tools (TDD!) and were ready to roll out changes once more. This time we were extra cautious and performed canary deploy with subsequent long-term validation before updating the rest of our EC2 Ubuntu fleet. The results were as expected, that is to say, we did not see any negative impact on our site. We were observing better response times and were saving money by utilizing our reserved instances.

Conclusion

We learned our lesson through this experience: respect service limits imposed by AWS. If something isn’t working right, don’t just assume it’s the network (or dns!), check the documentation.

We benefited from our culture of “one team” with site reliability and software development engineers coming together and working towards the common goal of delighting our customers. Having everyone being part of the on-call rotation ensured that all engineers were aware of changes and their impact in the environment. Mutual respect and open communication allowed for quick debugging and resolution when problems arose with everyone participating in the process to restore customer experience.

By treating our infrastructure as code, we were able to target our fixes with high precision and roll back and forward efficiently. Because everything was automated, we could focus on the parts that needed to change (like the local dns cache) and quickly test on both the old and new instances.

We became even better at practicing test driven development. Good tests make our infrastructure more resilient and on-call quieter which improves our overall quality of life.

Our monitoring tools and dashboards are great, however, it is important to be ready to look beyond what is presently measured and viewable. We must be ready to dive deep into the application and its behaviors. It’s also important to take time to iterate on improving tools and dashboards to be more relevant after an event.

We are also happy to know that AWS works hard on eliminating or increasing limits as services improve so that we will not need to go through these exercises too often.

We hope that sharing this story might be helpful to our fellow SREs out there!

About the Author

Bakha Nurzhanov is a Senior Site Reliability Engineer at RealSelf. Prior to joining RealSelf, he had been solving interesting data and infrastructure problems in healthcare IT, and worked on payments infrastructure engineering team at Amazon. His Twitter handle is @bnurzhanov.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Athena Savior of Adhoc Analytics

06. December 2018 2018 0

Introduction

Companies strive to attract customers by creating an excellent product with many features. Previously, product to reality took months to years. Nowadays, product to reality can take a matter of weeks. Companies can fail-fast, learn and move ahead to make it better. Data analytics often takes a back seat becoming a bottleneck.

Some of the problems that cause bottlenecks are

  • schema differences,
  • missing data,
  • security restrictions,
  • encryption

AWS Athena, an ad-hoc query tool can alleviate these problems. The main compelling characteristics include :

  • Serverless
  • Query Ease
  • Cost ($5 per TB of data scanned)
  • Availability
  • Durability
  • Performance
  • Security

Athena behind the scene uses Hive and Presto for analytical queries of any size, stored in S3. Athena processes structured, semi-structured and unstructured data sets including CSV, JSON, ORC, Avro, and Parquet. There are multiple languages supported for Athena drivers to query datastores including java, python, and other languages.

Let’s examine a few different use cases with Athena.

Use cases

Case 1: Storage Analysis

Let us say you have a service where you store user data such as documents, contacts, videos, and images. You have an accounting system in the relational database whereas user resources in S3 orchestrated through metadata housed in DynamoDB.  How do we get ad-hoc storage statistics individually as well as the entire customer base across various parameters and events?

Steps :

  • Create AWS data pipeline to export  Relational Database data to S3
    • Data persisted in S3 in CSV
  • Create AWS data pipeline to export  DynamoDB data to S3
    • Data persisted in S3 in JSON string
  • Create Database in Athena
  • Create tables for data sources
  • Run queries
  • Clean the resources

Figure 1: Data Ingestion

Figure 2: Schema and Queries

Case 2: Bucket Inventory

Why is S3 usage growing out of sync from user base changes? Do you know how your S3 bucket is being used? How many objects did it store? How many duplicate files? How many deleted?

AWS Bucket Inventory helps to manage the storage and provides audit and report on the replication and encryption status the objects in the bucket. Let us create a bucket and enable Inventory and perform the following steps.

Steps :

  • Go to S3 bucket
  • Create buckets vijay-yelanji-insights for objects and vijay-yelanji-inventory for inventory.
  • Enable inventory
    • AWS generates report into the inventory bucket at regular intervals as per schedule job.
  • Upload files
  • Delete files
  • Upload same files to check duplicates
  • Create Athena table pointing to vijay-yelanji-inventory
  • Run queries as shown in Figure 5 to get S3 usage to take necessary actions to reduce the cost.

Figure 3: S3 Inventory

Figure 4: Bucket Insights


Figure 5: Bucket Insight Queries

Case 3: Event comparison

Let’s say you are sending a stream of events to two different targets after pre-processing the events very differently and experiencing discrepancy in the data. How do you fix the events counts? What if event and or data are missing? How do you resolve inconsistencies and or quality issues?

If data is stored in S3, and the data format is supported by Athena, you expose it as tables and identify the gaps as shown in figure 7

Figure 6: Event Comparison

Steps:

  • Data ingested in S3 in snappy or JSON and forwarded to the legacy system of records
  • Data ingested in S3 in CSV (column separated by ‘|’ ) and forwarded to a new system of records
    • Event Forwarder system consumes the source event, modifies the data before pushing into the multiple targets.
  • Create Athena table from legacy source data and compare it problematic event forwarder data.


Figure 7: Comparison Inference

 

Case 4: API Call Analysis

If you have not enabled CloudWatch or set up your own ELK stack, but need to analyze service patterns like total HTTP requests by type, 4XX and 5XX errors by call types, this is possible by enabling  ELB access logs and reading through Athena.

Figure 8: Calls Inference

Steps :

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/access-log-collection.html

You can do the same on CloudTrail Logs with more information here:

https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html

 

Case 5: Python S3 Crawler

If you have  tons of JSON data in S3 spread across directories and files, want to analyze keys and its values, all you need to do is use python libraries like PyAthena or JayDeBe to read compressed snappy files after unzipping through SnZip and set these keys into Set data structure before passing as columns to the Athena as shown in Figure 10

Figure 9: Event Crawling

Figure 10: Events to Athena

Limitations

Athena has some limitations including:
  • Data must reside in S3.
  • To reduce the cost of the query and improve performance, data must be compressed, partitioned and converted to columnar formats.
  • User-defined functions, stored procedure, and many DDL are not supported.
  • If you are generating data continuously or has large data sets, want to get insights into real-time or frequently you should rely on analytical and visualization tools such as RedShift, Kinesis, EMR, Denodo, Spotfire and Tableau.
  • Check Athena FAQ to understand more about its benefits and limitations.

Summary

In this post, I shared how to leverage Athena to get analytics and minimize bottlenecks to product delivery. Be aware that some of the methods used were implemented when Athena was new. New tools may have changed how best to solve these use cases. Lately, it has been integrated with Glue for building, maintaining, and running ETL jobs and then QuickSight for visualization.

Reference

Athena documentation is at https://docs.aws.amazon.com/athena/latest/ug/what-is.html

About the Author

Vijay Yelanji (@VijayYelanji) is an architect at Asurion working at San Mateo, CA. has more than 20+ years of experience across various domains like Cloud enabled Micro Services to support enterprise level Account, File, Order, and Subscription Management Systems, Websphere Integration Servers and Solutions, IBM Enterprise Storage Solutions, Informix Databases, and 4GL tools.

In Asurion, he was Instrumental in designing and developing multi-tenant, multi-carrier, highly scalable Backup and Restore Mobile Application using various AWS services.

You can download the Asurion Memories application for free at 

Recently Vijay presented a topic  ‘Logging in AWS’ at AWS Meetup, Mountain View, CA.

Many thanks to AnanthakrishnaChar, Kashyap and Cathy, Hui for their assistance in fine-tuning some of the use cases.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.


Scaling up An Existing Application with Lambda Functions

05. December 2018 2018 0

What is Serverless?

Serverless infrastructure is the latest buzzword in tech, but from the name itself, it’s not clear what it means. At its core, serverless is the next logical step in infrastructure service providers abstracting infrastructure away from developers so that when you deploy your code, “it just works.”

Serverless doesn’t mean that your code is not running on a server. Rather, it means that you don’t have to worry about what server it’s running on, whether that server has adequate resources to run your code, or if you have to add or remove servers to properly scale your implementation. In addition to abstracting away the specifics of infrastructure, serverless lets you pay only for the time your code is explicitly running (in 100ms increments).

AWS-specific Serverless: Lambda

Lambda, Amazon Web Service’s serverless offering, provides all these benefits, along with the ability to trigger a Lambda function with a wide variety of AWS services.` Natively, Lambda functions can be written in:

  • Java
  • Go
  • PowerShell
  • Node.js
  • C#
  • Python

However, as of November 29th, AWS announced that it’s now possible to use the Runtime API to add any language to that list. Adapters for Erlang, Elixir, Cobol, N|Solid, and PHP are currently in development as of this writing.

What does it cost?

The Free Tier (which does not expire after the 12-month window of some other free tiers) allows up to 1 million requests per month, with the price increasing to $0.20 per million requests thereafter. The free tier also includes 400,000 GB-seconds of compute time.

The rate at which this compute time will be used up depends on how much memory you allocate to your Lambda function on its creation. Any way you look at it, many workflows can operate in the free tier for the entire life of the application, which makes this a very attractive option to add capacity and remove bottlenecks in existing software, as well as quickly spin up new offerings.

What does a Lambda workflow look like?

Using the various triggers that AWS provides, Lambda offers the promise of never having to provision a server again. This promise, while technically possible, is only fulfilled for various workflows and with complex stringing together of various services.

Complex Workflow With No Provisioned Servers

However, if you have an existing application and don’t want to rebuild your entire infrastructure on AWS services, can you still integrate serverless into your infrastructure?

Yes.

At its most simple implementation, AWS Lambda functions can be the compute behind a single API endpoint. This means rather than dealing with the complex dependency graph we see above; existing code can be abstracted into Lambda functions behind an API endpoint, which can be called in your existing code. This approach allows you to take codepaths that are currently bottlenecks and put them on infrastructure that can run in parallel as well as scale infinitely.

Creating an entire serverless infrastructure can seem daunting. Instead of exploring how you can create an entire serverless infrastructure all at once, we will look at how to eliminate bottlenecks in an existing application using the power that serverless provides.

Case Study: Resolving a Bottleneck With Serverless

In our hypothetical application, users can set up alerts to receive emails when a certain combination of API responses all return true. These various APIs are queried hourly to determine whether alerts need to be sent out. This application is built as a traditional monolithic application, with all the logic and execution happening on a single server.

Important to note is that the application itself doesn’t care about the results of the alerts. It simply needs to dispatch them, and if the various API responses match the specified conditions, the user needs to get an email alert.

What is the Bottleneck?

In this case, as we add more alerts, processing the entire collection of alerts starts to take more and more time. Eventually, we will reach a point where the checking of all the alerts will overrun into the next hour, thus ensuring the application is perpetually behind in checking a user’s alerts.

This is a prime example of a piece of functionality that can be moved to a serverless function, because the application doesn’t care about the result of the serverless function, we can dispatch all of our alert calls asynchronously and take advantage of the auto-scaling and parallelization of AWS Lambda to ensure all of our events are processed in a fraction of the time.

Refactoring the checking of alerts into a Lambda function takes this bottleneck out of our codebase and turns it into an API call. However, there are now a few caveats that we have to resolve if we want to run this as efficiently as possible.

Caveat: Calling an API-invoked Lambda Function Asynchronously

If you’re building this application in a language or library that’s synchronous by default, turning these checks into API calls will just result in API calls each waiting for the previous to finish. While it’s possible you’ll receive a speed boost because your Lambda function is set up to be more powerful than your existing server, you’ll eventually run into a similar problem as we had before.

As we’ve already discussed, all our application needs to care about is that the call was received by API Gateway and Lambda, not whether it finished or what the result of the alert was. This means, if we can get API Gateway to return as soon as it receives the request from our application, we can run through all these requests in our application much more quickly.

In their documentation for integrating API Gateway and Lambda, AWS has provided documentation on how to do just this:

To support asynchronous invocation of the Lambda function, you must explicitly add the X-Amz-Invocation-Type:Event header to the integration request.

This will make the loop in our application that dispatches the requests run much faster and allow our alert checks to be parallelized as much as possible.

Caveat: Monitoring

Now that your application is no longer responsible for ensuring these API calls complete successfully, failing APIs will no longer trigger any monitoring you have in place.

Out of the box, Lambda supports Cloudwatch monitoring where you can check if there were any errors in the function execution. You can also set up Cloudwatch to monitor API Gateway as well. If the existing metrics don’t fit your needs, you can always set up custom metrics in Cloudwatch to ensure you’re monitoring everything that makes sense for your application.

By integrating Cloudwatch into your existing monitoring solution, you can ensure your serverless functions are firing properly and always available.

Tools for Getting Started

One of the most significant barriers to entry for serverless has traditionally been the lack of tools for local development and the fact that the infrastructure environment is a bit of a black box.

Luckily, AWS has built SAM (Serverless Application Model) CLI, which can be run locally inside Docker to give you a simulated serverless environment. Once you get the project installed and API Gateway running locally, you can hit the API endpoint from your application and see how your serverless function performs.

This allows you to test your new function’s integration with your application and iron out any bugs before you go through the process of setting up API Gateway on AWS and deploying your function.

Serverless: Up and Running

Once you go through the process of creating a serverless function and getting it up and running, you’ll see just how quickly you can remove bottlenecks from your existing application by leaning on AWS infrastructure.

Serverless isn’t something you have to adopt across your entire stack or adopt all at once. By shifting to a serverless application model piece by piece, you can avoid overcomplicating your workflow while still taking advantage of everything this exciting new technology has to offer.

About the Author

Keanan Koppenhaver @kkoppenhaver, is CTO at Alpha Particle, a digital consultancy that helps plan and execute digital projects that serve anywhere from a few users a month to a few million. He enjoys helping clients build out their developer teams, modernize legacy tech stacks, and better position themselves as technology continues to move forward. He believes that more technology isn’t always the answer, but when it is, it’s important to get it right.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.