AWS Advent 2014 – Exploring AWS Lambda

Today’s post comes to us from Mark Nunnikhoven, who is the VP of Cloud & Emerging Technologies .

At this year’s re:Invent, AWS introduced a new service (currently in preview) call Lambda. Mitch Garnaat already introduced the service to the advent audience in the first post of the month.

Take a minute to read Mitch’s post if you haven’t already. He provides a great overview of the service, it’s goals, and he’s created a handy tool, Kappa, that simplifies using the new service.

Going Deeper

Of course Mitch’s tool is only useful if you already understand what Lambda does and where best to use it. The goal of this post is to provide that understanding.

I think Mitch is understating things when he says that “there are some rough edges”. Like any AWS service, Lambda is starting out small. Thankfully–like other services–the documentation for Lambda is solid.

There is little point in creating another walk through setting up a Lambda function. This tutorial from AWS does a great job of the step-by-step.

What we’re going to cover today are the current challenges, constraints, and where Lambda might be headed in the future.

Challenges

1. Invocation vs Execution

During a Lambda workflow 2 IAM roles are used. This is the #1 area where people get caught up.

A role is an identity used in the permissions framework of AWS. Roles typically have policies attached that dictate what the role can do within AWS.

Roles are a great way to provide (and limit) access within passed access and secret keys around.

Lambda uses 2 IAM roles during it’s workflow, an invocation role and an execution role. While the terminology is consistent within computer science it’s needlessly confusing for some people.

Here’s the layman’s version:

  • invocation role => the trigger
  • execution role => the one that does stuff

This is an important difference because while the execution role is consistent in the permissions it needs, the invocation role (the trigger) will need different permissions depending on where you’re using you Lambda function.

If you’re hooking your Lambda function to an S3 bucket, the invocation role will need the appropriate permissions to have S3 call your Lambda function. This typically includes the lambda:InvokeAsync permission and a trust policy that allows the bucket to assume the invocation role.

If you’re hooking your function into a Kinesis event stream, the same logic applies but in this case you’re going to have to allow the invocation role access to your Kinesis stream since it’s a pull model instead of the S3 push model.

The AWS docs sum this up with the following semi-helpful diagrams:

S3 push model for Lambda permissionsS3 push model for Lambda permissions Kinesis pull model for Lambda permissionsKinesis pull model for Lambda permissions

Remember that your invocation role always needs to be able to assume a role (sts:AssumeRole) and access the event source (Kinesis stream, S3 bucket, etc.)

2. Deployment of libraries

TL:DR Thank Mitch for starting Kappa.

The longer explanation is that packaging up the dependencies of your code can be a bit of the pain. That’s because we have little to no visibility into what’s happening.

Until the service and associated tooling matures a bit, we’re back to world of printf or at least

For Lambda a deployment package is your javascript code and any supporting libraries. These need to be bundled into a .zip file. If you’re just deploying a simple .js file, .zip it and you’re good to go.

If you have addition libraries that you’re providing, buckle up. This ride is about to get real bumpy.

The closest things we have to a step-by-step on providing additional libraries is this step from one of the AWS tutorials.

The instructions here are to install a separate copy of node.js, create a subfolder, and then install the required modules via npm.

Now you’re going to .zip your code file and the modules from the subfolder but not the folder itself. From all appearances the .zip needs to be a flat file.

I’m hopeful there will be more robust documentation on this soon but in the meantime please share your experiences in the AWS forums or on Twitter.

Constraints

As Lambda is in preview there are additional constraints beyond what you can expect when it is launched into production.

Current constraints:

  1. functions must executed in <= 1GB of memory
  2. functions must complete execution in <= 60 seconds
  3. functions must be written in Javascript (run on node.js)
  4. functions can only access 512 MB of temp disk space
  5. functions can only open 1024 file descriptors
  6. functions can only use 1024 threads+processes

These constraints also leads to some AWS recommendations that are worth reading and taking to heart however one stands out above all the others.

Write your Lambda function code in a stateless style”, AWS Lambda docs.

This is by far the best piece of advice that one can offer when it comes to Lambda design patterns. Do not try to bolt state on using another service or data store. Treat Lambda as an opportunity to manipulate data mid-stream. Lambda functions execute concurrently.Thinking of it in functional terms will save you a lot of headaches down the road.

The Future?

One of the most common reactions I’ve heard about AWS Lambda is, “So what?”. That’s understandable but if you look at AWS’ track record, they ship very simple but useful services and iterate very quickly on them.

While Lambda may feel limited today, expect things to change quickly. Kinesis, DynamoDB, and S3 are just the beginning. The “custom” route today provides a quick and easy way to offload some data processing to Lambda but that will become exponentially more useful as “events” start popping up in other AWS services.

Imagine trigger Lambda functions based on SNS messages, CloudWatch Log events, Directory Service events, and so forth.

Look to tagging in AWS as an example. It started very simple in EC2 and over the past 24 months has expanded to almost every service and resource in the environment. Event’s will most likely follow the same trajectory and with every new event Lambda gets even more powerful.

Additional Reading

Getting in on the ground floor of Lambda will allow you to shed more and more of your lower level infrastructure as more events are rolled out to production.

Here’s some holiday reading to ensure you’re up to speed:


AWS Advent 2014 – A Quick Look at AWS CodeDeploy

Today’s AWS Advent post comes to us from Mitch Garnaat, the creator of the AWS python library boto and who is currently herding clouds and devops over at Scopely. He’s gonna walk us through a quick look at AWS CodeDeploy

Software deployment. It seems like such an easy concept. I wrote some new code and now I want to get it into the hands of my users. But there are few areas in the world of software development where you find a more diverse set of approaches to a such a simple-sounding problem. In all my years in the software business, I don’t think I’ve ever seen two deployment processses that are the same. So many different tools. So many different approaches. It turns out it is a pretty complicated problem with a lot of moving parts.

But there is one over-arching trend in the world of software deployment that seems to have almost universal appeal. More. More deployments. More often.

Ten years ago it was common for a software deployment to happen a few times a year. Software changes would be batched up for weeks or months waiting for a release cycle and once the release process started, development stopped. All attention was focused on finding and fixing bugs and, eventually, releasing the code. It was very much a bimodal process: develop for a while and then release for a while.

Now the goal is to greatly shorten the time it takes to get a code change deployed, to make the software deployment process quick and easy. And the best way to get good at something is to do it a lot of times.

“Repetition is the mother of skill.” – Anthony Robbins

If we force ourselves to do software deployment frequently and repeatedly we will get better and better at it. The process I use to put up holiday lights is appallingly inefficient and cumbersome. But since I only do it once a year, I put up with it. If I had to put those lights up once a month or once a week or once a day, the process would get better in a hurry.

The ultimate goal is Continuous Deployment, a continuous pipeline where each change we commit to our VCS is pushed through a process of testing and then, if the tests succeed, is automatically released to production. This may be an aspirational goal for most people and there may be good reasons not to have a completely automated pipeline (e.g. dependencies on other systems) but the clear trend is towards frequent, repeatable software deployment without the bimodal nature of traditional deployment techniques.

Why AWS CodeDeploy Might Help Your Deployment Process

Which brings us to the real topic of today’s post. AWS CodeDeploy is a new service from AWS specifically designed to automate code deployment and eliminate manual operations.

This post will not be a tutorial on how to use AWS CodeDeploy. There is an excellent hands-on sample deployment available from AWS. What this post will focus on is some of the specific features provided by AWS CodeDeploy that might help you achieve the goal of faster and more automated software deployments.

Proven Track Record

This may seem like a contradiction given that this is a new service from AWS but the underlying technology in AWS CodeDeploy is not new at all. This is a productization of an internal system calledApollo that has been used for software deployments within Amazon and AWS for many years.

Anyone who has worked at Amazon will be familiar with Apollo and will probably rave about it. Its rock solid and has been used to deploy thousands of changes a day across huge fleets of servers within Amazon.

Customizable Deployment Configurations

You can control how AWS CodeDeploy will roll out the deployment to your fleet using a deployment configuration. There are three built-in configurations:

  • All At Once – Deploy the new revision to all instances in the deployment group at once. This is probably not a good idea unless you have a small fleet or you have very good acceptance tests for your new code.

  • Half At A Time – Deploy the new revision to half of the instances at once. If a certain number of those instances fail then fail the deployment.

  • One At A Time – Deploy the new revision to one instance at a time. If deployment to any instance fails, then fail the deployment.

You can also create custom deployment configurations if one of these models doesn’t fit your situation.

Auto Scaling Integration

If you are deploying your code to more than one instance and you are not currently using Auto Scaling you should stop reading this article right now and go figure out how to integrate it into your deployment strategy. In fact, even if you are only using one instance you should use Auto Scaling. Its a great service that can save you money and allow you to scale with demand.

Assuming that you are using Auto Scaling, AWS CodeDeploy can integrate with your Auto Scaling groups. By using lifecycle hooks in Auto Scaling AWS CodeDeploy can automatically deploy the specified revision of your software on any new instances that Auto Scaling creates in your group.

Should Work With Most Apps

AWS CodeDeploy uses a YAML-format AppSpec file to drive the deployment process on each instance. This file allows you to map source files in the deployment package to their destination on the instance. It also allows a variety of hooks to be run at various times in the process such as:

  • Before Installation
  • After Installation
  • When the previous version of your Application Stops
  • When the Application Starts
  • After the service has been Validated

These hooks can be arbitrary executables such as BASH scripts or Python scripts and can do pretty much anything you need them to do.

Below is an example AppSpec file.

GUI and CLI

AWS CodeDeploy can be driven either from the AWS Web Console or from the AWS CLI. In general, my feeling is that GUI interfaces are great for monitoring and other read-only functions but for command and control I strongly prefer CLI’s and scripts so its great that you can control every aspect of AWS CodeDeploy via the AWSCLI or any of the AWS SDK’s. I will say that the Web GUI for AWS CodeDeploy is quite well done and provides a really nice view of what is happening during a deployment.

Free (like beer)

There is no extra charge for using AWS CodeDeploy. You obviously pay for all of the EC2 instances you are using just as you do now but you don’t have to pay anything extra to use AWS CodeDeploy.

Other Things You Should Know About AWS CodeDeploy

The previous section highlights some features of AWS CodeDeploy that I think could be particularly interesting to people considering a new deployment tool.

In this section, I want to mention a couple of caveats. These are not really problems but just things you want to be aware of in evaluating AWS CodeDeploy.

EC2 Deployments Only

AWS CodeDeploy only supports deployments on EC2 instances at this time.

Agent-Based

AWS CodeDeploy requires an agent to be installed on any EC2 instance that it will be deploying code to. Currently, they support Amazon Linux, Ubuntu, and Windows Server.

No Real Rollback Capability

Because of the way AWS CodeDeploy works, there really isn’t a true rollback capability. You can’t deploy code to half of your fleet and then undeploy that latest revision. You can simulate a rollback by simply creating a new deployment of your previous version of software but there is no Green/Blue type rollback available.

Summary

We just created a new deployment pipeline at work that implements a type of BLUE/GREEN deployment and is based on Golden AMI’s. We are very happy with that and I don’t think we will be revisiting that anytime soon. However if I was starting that project today, I would certainly give a lot of thought to using AWS CodeDeploy. It has a nice feature set, can be easily integrated into most environmenets and code bases, and is based on rock-solid, proven technology. And the price is right!

Links:


AWS Advent 2014 – Managing EC2 Security Groups using Puppet

Today’s post on managing EC2 Security Groups with Puppet comes to use from Gareth Rushgrove, the awesome curator of DevOps Weekly and who is currently an engineer at PuppetLabs.

At Puppet Labs we recently shipped a module to make managing AWS easier. This tutorial shows how it can be used to manage your security groups. EC2 Security groups act as a virtual firewall and are used to isolate instances and other AWS resources from each other and the internet.

An example

You can find the full details about installation and configuration for the module in the official READMEbut the basic version, assuming a working Puppet and Ruby setup, is:

You’ll also want to have your AWS API credentials in environment variables (or use IAM if you’re running from within AWS).

First lets create a simple security group called test-sg in the us-east-1 region. Save the following to a file called securitygroup.pp:

Now lets run Puppet to create the group:

You should see something like the following output:

We’re running here with apply and the --test flag so we can easily see what’s happening, but if you have a Puppet master setup you can run with an agent too.

You will probably change your security groups over time as you’re infrastructure evolves. And managing that evolution is where Puppet’s declarative approach really shines. You can have confidence in the description of your infrastructure in code because Puppet can tell you about any changes when it runs.

Next lets add a new ingress rule to our existing group. Modify the securitygroup.pp file like so:

And again lets run Puppet to modify the group:

You should see something like the following output:

Note the information about changes to the ingress rules as we expected. You can also check the changes in the AWS console.

The module also has full support for the Puppet resource command, so all of the functionality is available from the command line as well as the DSL. As an example lets clean-up and delete the group created above.

Hopefully that’s given you an idea of what’s possible with the Puppet AWS module. You can see more examples of the module in action in the main repository.

Advantages

Some of the advantages of using Puppet for managing AWS resources are:

  • The familiar DSL – if you’re already using Puppet the syntax will already be familiar, if you’re not already using Puppet you’ll find lots of good references and documentation
  • Puppet is a declarative tool – Puppet is used to declare the desired state of the world, this means it’s useful for maintaining state and changing resources over time, as well as creating new groups
  • Existing tool support – whether it’s the Geppetto IDE, testing tools like rspec-puppet or syntax highlighting for your favourite editor lots of supporting tooling already exists

The future

The current preview release of the module supports EC2 instances, security groups and ELB load balancers, with work on support for VPC, Route53 and Autoscaling Groups available soon. We’re looking for as much feedback as possible at the moment so feel free to report issues on GitHub), ask questions on the puppet-user mailing list or contact me on twitter at @garethr


AWS Advent 2014 – Finding AWS Resources Across Regions, Services, and Accounts with skew

Our first AWS Advent post comes to us from Mitch Garnaat, the creator of theAWS python library boto and who is currently herding clouds and devops over at Scopely. He’s gonna walk us through how we can discover more about our Amazon Resources using the awesome tool he’s been building, called skew.

If you only have one account in AWS and you only use one service in one region, this article probably isn’t for you. However, if you are like me and manage resources in many accounts, across multiple regions, and in many different services, stick around.

There are a lot of great tools to help you manage your AWS resources. There is the AWS Web Console, the AWSCLI, various language SDK’s like boto, and a host of third-party tools. The biggest problem I have with most of these tools is that they limit your view of resources to a single region, a single account, and a single service at a time. For example, you have to login to the AWS Console with one set of credentials representing a single account. And once you are logged in, you have to select a single region. And then, finally, you drill into a particular service. The AWSCLI and the SDK’s follow this same basic model.

But what if you want to look at resources across regions? Across accounts? Across services? Well, that’s where skew comes in.

Skew

Skew is a Python library built on top of botocore. The main purpose of skew is to provide a flat, uniform address space for all of your AWS resources.

The name skew is a homonym for SKU (Stock Keeping Unit). SKU’s are the numbers that show up on the bar codes of just about everything you purchase and that SKU number uniquely identifies the product in the vendor’s inventory. When you make a purchase they scan the barcode containing the SKU and can instantly find the pricing data for the item.

Similary, skew uses a unique identifier for each one of your AWS resources and allows you to scanthe SKU and quickly find the details for that resource. It also provides some powerful mechanisms to find sets of resources by allowing wildcarding and regular expressions within the SKU’s.

ARN’t You Glad You Are Reading This?

So, what do we use for a unique identifier for all of our AWS resources? Well, as it turns out, AWS has already solved that problem for us. Each resource in AWS can be identified by an Amazon Resource Name or ARN. The general form for ARN’s are:

So, the ARN for an EC2 instances might look like this:

This tells us the instance is in the us-west-2 region, running in the account identified by the account number 123456789012 and the instance has an instance ID of i-12345678.

Getting Started With Skew

The easiest way to install skew is via pip.

Because skew is based on botocore, as is AWSCLI, it will use the same credentials as those tools. You need to make a small addition to your ~/.aws/config file to help skew map AWS account ID’s to the profiles in the config file. Check the README for details on that.

Let’s Find Some Stuff

Once we have skew installed and configured, we can use it to find resources based on their ARN’s. For example, using the example ARN above:

Ok, that wasn’t very exciting. How do I get at my actual resource in AWS? Well, the scan method returns an ARN object and this object supports the iterator pattern in Python. This makes sense since as we will see later this ARN can actually return a lot of objects, not just one. So if we want to get our object we can:

Iterating on an ARN returns a list of Resource objects and each of these Resource objects represents one resource in AWS. Resource objects have a number of attributes like id and they also have an attribute called data that contains all of the data about that resource. This is the same information that would be returned by the AWSCLI or an SDK.

Wildcards And Regular Expressions

Finding a single resource in AWS is okay but one of the nice things about skew is that it allows you to quickly find lots of resources in AWS. And you don’t have to worry about which region those resources are in or in which account they reside.

For example, let’s say we want to find all EC2 instances running in all regions and in all of my accounts:

In that one little line of Python code, a lot of stuff is happening. Skew will iterate through all of the regions supported by the EC2 service and, in each region, will authenticate with each of the account profiles listed in your AWS config file. It will then find all EC2 instances and finally return the complete list of those instances as Resource objects.

In addition to wildcards, you can also use regular expressions as components in the ARN. For example:

This will find all DynamoDB tables in all US regions for all accounts.

Some Useful Examples

Here are some examples of things you can do quickly and easily with skew that would be difficult in most other tools.

Find all unattached EBS volumes across all regions and accounts and tally the size of wasted space.

Audit all EC2 security groups to find CIDR rules that are not whitelisted.

Find all EC2 instances that are not tagged in any way.

Building ARN’s Interactively

The ARN provides a great way to uniquely identify AWS resources but it doesn’t exactly roll off the tongue. Skew provides some help for constructing ARN’s interactively.

First, start off with a new ARN object.

Each ARN object contains 6 components:

  • scheme – for now this will always be arn
  • provider – again, for now always aws
  • service – the Amazon Web Service
  • region – the AWS region
  • account – the ID of the AWS account
  • resource – the resource type and resource ID

All of these are available as attributes of the ARN object.

If you want to build up the ARN interactively, you can ask each of the components what choices are available.

You can also try out your regular expressions to make sure they return the results you expect.

To set the value of a particular component, use the pattern attribute.

Once you have the ARN that you want, you can enumerate it like this:

Running Queries Against Returned Data

A recent feature of skew allows you to run queries against the resource data. This feature makes use of jmespath which is a really nice JSON query engine. It was originally written in Python for use on the AWSCLI but is now available in a number of other languages. If you have ever used the --queryoption of the AWSCLI, then you have used jmespath.

If you append a jmespath query to the end of the ARN (using a | as a separator) skew will send the data for each of the returned resources through the jmespath query and store the result in thefiltered_data attribute of the resource object. The original data is still available as the dataattribute. For example:

Then each resource returned would have the instance type store in the filtered_data attribute of theResource object. This is obviously a very simple example but jmespath is very powerful and the interactive query tool available on http://jmespath.org/ allows you to try your queries out beforehand to get exactly what you want.

CloudWatch Metrics

One other feature of skew is easy access to CloudWatch metrics for AWS resources. If we refer back to the very first interative session in the post, we can show how you would access those CloudWatch metrics for the instance.

We can find the available CloudWatch metrics with the metric_names attribute and then we can retrieve the desired metric using the get_metric_data method. The README for skew contains a bit more information about accessing CloudWatch metrics.

Wrap Up

Skew is pretty new and is still changing a lot. It currently supports only a subset of available AWS resource types but more are being added all the time. If you manage a lot of AWS resources, I encourage you to give it a try. Feedback, as always, is very welcome as are pull requests!


AWS Advent 2014 – High-Availability in AWS with keepalived, EBS and Elastic Network Interfaces

Today’s post on how to achieve high availability in AWS with keepalived comes to us from Julian Dunn, who’s currently helping improve things at Chef.

Introduction

By now, most everyone knows that running infrastructure in AWS is not the same as a traditional data center, thus putting a lie to claims that you can just “lift and shift to the cloud”. In AWS, one normally achieves “high-availability” by scaling horizontally. For example, if you have a WordPress site, you could create several identical WordPress servers and put them all behind an Elastic Load Balancer (ELB), and connect them all to the same database. That way, if one of these servers fails, the ELB will stop directing traffic to it, but your site will still be available.

But about that database – isn’t it also a single-point-of-failure? You can’t very well pull the same horizontal-redundancy trick for services that explicitly have one writer (and potentially many readers). For a database, you could probably use Amazon Relational Database Server (RDS), but suppose Amazon doesn’t have a handy highly-available Platform-as-a-Service variant for the service you need?

In this post, I’ll show you how to use that old standby, keepalived, in conjunction with Virtual Private Cloud (VPC) features, to achieve real high-availability in AWS for systems that can’t be horizontally replicated.

Kit of Parts

To create high-availability out of two (or more) systems, you need the following components:

  • A service IP (commonly referred to as a VIP, for virtual IP) that can be moved between the systems to which client systems will communicate
  • A block device containing data served by the currently-active system that can be detached and reattached to others, should the active one fail
  • Some kind of cluster coordination system to handle master/backup election, as well as doing all the housekeeping to move the service IP and block device to the active node.

In AWS, we’ll use:

  • Private secondary addresses on an Elastic Network Interface (ENI) as the service IP.
  • A separate Elastic Block Storage (EBS) volume as the block device
  • keepalived as the cluster coordination system.

There are a few limitations to this approach in AWS. Most important is that all instances and the block storage device must live in the same VPC subnet, which implies that they live in the same availability zone (AZ).

Just Enough keepalived for HA

Keepalived for Linux has been around for over ten years, and while it is very robust and reliable, it can be very difficult to grasp because it is designed for a variety of use cases, some very distinct from the one we are going to implement. Software design diagrams like this one do not necessarily aid in understanding how it works.

For the purposes of building an HA system, you need only know a few things about keepalived:

  • As previously mentioned, keepalived serves as a cluster coordination system between two or more peers.
  • Keepalived uses the Virtual Router Redundancy Protocol (VRRP) for assigning the service IP to the active instance. It does this by talking to the Linux netlink layer directly. Thus, don’t try to useifconfig to examine whether the master’s interface has the VIP, as ifconfig doesn’t use netlink system calls and the VIP won’t show up! Use ip addr instead.
  • VRRP is normally run over multicast in a closed network segment. However, in a cloud environment where multicast is not permitted, we must use unicast, which implies that we need to list all peers participating in the cluster.
  • Keepalived has the ability to invoke external scripts whenever a cluster member transitions from backup to master (or vice-versa). We will use this functionality to associate and mount the EBS block device (or the inverse, when transitioning from master to backup).

Building the HA System

We’ll spin up two identical systems in the same VPC subnet for our master and backup nodes. To avoid passing AWS access and secret keys to the systems, I’ve created an IAM instance profile & role called awsadvent-ha with a policy document to let the systems manage ENI addresses and EBS volumes:

For this exercise I used Fedora 21 AMIs, because Fedora has a recent-enough version of keepalived with VRRP-over-unicast support:

You’ll notice that one of the security groups I’ve placed the machines into is entitled internal-icmp, which is a group I created to allow the instances to ping each other (send ICMP Echo Request and receive ICMP Echo Reply). This is what keepalived will use as a heartbeat mechanism between nodes.

We also need a separate EBS volume for the data, so let’s create one in the same AZ as the instances:

Note that the volume needs to be partitioned and formatted at some point; I don’t do that in this tutorial.

Installing and configuring keepalived

Once the two machines are up and reachable, it’s time to install and configure keepalived. SSH to them and type:

I intend to write the external failover scripts called by keepalived in Ruby, so I’m going to install that, and the fog gem that will let me communicate with the AWS API:

keepalived is configured using the /etc/keepalived/keepalived.conf file. Here’s the configuration I used for this demo:

A couple of notes about this configuration:

  • 172.31.40.96 is the current machine; 172.31.40.95 is its peer. The peer has the IPs reversed in the unicast_srcip and unicast_peer clauses, so make sure to change this. (A configuration management system sure would help here…)
  • 172.31.36.57 is the virtual IP address which will be bound as a secondary IP address to the active master’s Elastic Network Interface. You can pick anything unused in your subnet.

The notify script, awsha.rb

As previously mentioned, the external script is invoked whenever a master-to-backup or backup-to-master event occurs, via the notify_backup and notify_master directives in keepalived.conf. Upon receiving an event, it will associate and mount (or unmount and disassociate) the EBS volume from the instance, and attach or release the ENI secondary address.

The script is too long to reproduce inline here, so I’ve included it as a separate Gist.

Note: For brevity, I’ve eliminated a lot of error-handling from the script, so it may or may not work out-of-the-box. In a real implementation, you need to check for many error conditions like open files on a disk volume, poll for the EC2 API to attach/release the volume, etc.

Putting it all together

Start keepalived on both servers:

One of them will elect itself the master, assign the ENI secondary IP to itself, and attach and mount the block device on /mnt. You can see which is which by checking the service status:

The other machine will say that it’s transitioned to backup state:

To force a failover, stop keepalived on the current master. The backup system will detect that the master went away, and transition to primary:

After a while, the backup should be reachable on the VIP, and have the disk volume mounted under/mnt.

If you now start keepalived on the old master, it should come back online as the new backup.

Wrapping Up

As we’ve seen, it’s not always possible to architect systems in AWS for horizontal redundancy. Many pieces of software, particularly those involving one writer and many readers, cannot be set up this way.

In other situations, it’s not desirable to build horizontal redundancy. One real-life example is a highly-available large shared cache system (e.g. squid or varnish) where it would be costly to rebuild terabytes of cache on instance failure. At Chef Software, we use an expanded version of the tools shown here to implement our Chef Server High-Availability solution.

Finally, I also found this presentation by an AWS solutions architect in Japan very useful in identifying what L2 and L3 networking technologies are available in AWS:http://www.slideshare.net/kentayasukawa/ip-multicast-on-ec2


AWS Advent 2014 – SparkleFormation: Build infrastructure with CloudFormation without losing your sanity.

Today’s post on taming CloudFormation with SparkleFormation, comes to us from Cameron Johnston of Heavy Water Operations.

The source code for this post can be found at https://github.com/hw-labs/sparkleformation-starter-kit

Introduction

This article assumes some familiarity with CloudFormation concepts such as stack parameters, resources, mappings and outputs. See the AWS Advent CloudFormation Primer for an introduction.

Although CloudFormation templates are billed as reusable, many users will attest that as these monolithic JSON documents grow larger, they become “all encompassing JSON file[s] of darkness,” and actually reusing code between templates becomes a frustrating copypasta exercise.

From another perspective these JSON documents are actually just hashes, and with a minimal DSL we can build these hashes programmatically. SparkleFormation provides a Ruby DSL for merging and compiling hashes into CFN templates, and helpers which invoke CloudFormation’s intrinsic functions (e.g. Ref, Attr, Join, Map).

SparkleFormation’s DSL implementation is intentionally loose, imposing little of its own opinion on how your template should be constructed. Provided you are already familiar with CloudFormation template concepts and some minimal ammount of Ruby, the rest is merging hashes.

Templates

Just as with CloudFormation, the template is the high-level object. In SparkleFormation we instantiate a new template like so:

But an empty template isn’t going to help us much, so let’s step into it and at least insert the requiredAWSTemplateFormatVersion specification:

In the above case we use the _set helper method because we are setting a top-level key with a string value. When we are working with hashes we can use a block syntax, as shown here adding a parameter to the top-level Parametershash that CloudFormation expects:

Reusability

SparkleFormation provides primatives to help you build templates out of reusable code, namely:

  • Components
  • Dynamics
  • Registries

Components

Here’s a component we’ll name environment which defines our allowed environment parameter values:

Resources, parameters and other CloudFormation configuration written into a SparkleFormation component are statically inserted into any templates using the load method. Now all our stack templates can reuse the same component so updating the list of environments across our entire infrastructure becomes a snap. Once a template has loaded a component, it can then step into the configuration provided by the component to make modifications.

In this template example we load the environment component (above) and override the allowed values for the environment parameter the component provides:

Dynamics

Where as components are loaded once at the instantiation of a SparkleFormation template, dynamics are inserted one or more times throughout a template. They iteratively generate unique resources based on the name and optional configuration they are passed when inserted.

In this example we insert a launch_config dynamic and pass it a config object containing a run list:

The launch_config dynamic (not pictured) can then use intrisic functions like Fn::Join to insert data passed in the config deep inside a launch configuration, as in this case where we want our template to tell Chef what our run list should be.

Registries

Similar to dynamics, a registry entry can be inserted at any point in a SparkleFormation template or dynamic. e.g. a registry entry can be used to share the same metadata between both AWS::AutoScaling::LaunchConfiguration and AWS::EC2::Instance resources.

Translating a ghost of AWS Advent past

This JSON template from a previous AWS Advent article provisions a single EC2 instance into an existing VPC subnet and security group:

Not terrible, but the JSON is a little hard on the eyes. Here’s the same thing in Ruby, using SparkleFormation:

Without taking advantage of any of SparkleFormation’s special capabilities, this translation is already a few lines shorter and easier to read as well. That’s a good start, but we can do better.

The template format version specification and parameters required for this template are common to any stack where EC2 compute resources may be used, whether they be single EC2 instances or Auto Scaling Groups, so lets take advantage of some SparkleFormation features to make them reusable.

Here we have a base component that inserts the common parameters into templates which load it:

Now that the template version and common parameters have moved into the new base component, we can make use of them by loading that component as we instantiate our new template, specifying that the template will override any pieces of the component where the two intersect.

Let’s update the SparkleFormation template to make use of the new base component:

Because the basecomponent includes the parameters we need, the template no longer explicitly describes them.

Advanced tips and tricks

Since SparkleFormation is Ruby, we can get a little fancy. Let’s say we want to build 3 subnets into an existing VPC. If we know the VPC’s /16 subnet we can provide it as an environment variable (export VPC_SUBNET="10.1.0.0/16"), and then call that variable in a template that generates additional subnets:

Of course we could place the subnet and route table association resources into a dynamic, so that we could just call the dynamic with some config:

Okay, this all sounds great! But how do I operate it?

SparkleFormation by itself does not implement any means of sending its output to the CloudFormation API. In this simple case, a SparkleFormation template named ec2_example.rb is output to JSON which you can use with CloudFormation as usual:

The knife-cloudformation plugin for Chef’s knife command adds sub-commands for creating, updating, inspecting and destroying CloudFormation stacks described by SparkleFormation code or plain JSON templates. Using knife-cloudformation does not require Chef to be part of your toolchain, it simply leverages knife as an execution platform.

Advent readers may recall a previous article on strategies for reusable CloudFormation templates which advocates a “layer cake” approach to deploying infrastructure using CloudFormation stacks:

The overall approach is that your templates should have sufficient parameters and outputs to be re-usable across environments like dev, stage, qa, or prod and that each layer’s template builds on the next.

Of course this is all well and good, until we find ourselves, once again, copying and pasting. This time its stack outputs instead of JSON, but again, we can do better.

The recent 0.2.0 release of knife-cloudformation adds a new --apply-stack parameter which makes operating “layer cake” infrastructure much easier.

When passed one or more instances of --apply-stack STACKNAME, knife-cloudformation will cache the outputs of the named stack and use the values of those outputs as the default values for parameters of the same name in the stack you are creating.

For example, a stack “coolapp-elb” which provisions an ELB and an associated security group has been configured with the following outputs:

The values from the ElbName and ElbSecurityGroup would be of use to us in attaching an app server auto scaling group to this ELB, and we could use those values automatically by setting parameter names in the app server template which match the ELB stack’s output names:

Once our coolapp_asg template uses parameter names that match the output names from the coolapp-elb stack, we can deploy the app server layer “on top” of the ELB layer using --apply-stack:

Similarly, if we use a SparkleFormation template to build our VPC, we can set a number of VPC outputs that will be useful when building stacks inside the VPC:

This ‘apply stack’ approach is just the latest way in which the SparkleFormation tool chain can help you keep your sanity when building infrastructure with CloudFormation.

Further reading

I hope this brief tour of SparkleFormation’s capabilities has piqued your interest. For some AWS users, the combination of SparkleFormation and knife-cloudformation helps to address a real pain point in the infrastructure-as-code tool chain, easing the development and operation of layered infrastructure.

Here’s some additional material to help you get started:


AWS Advent 2014 – Integrating AWS with Active Directory

Today’s post on Integrating AWS with Active Directory comes to us from Roger Siggs, who currently helps architect clouds at DataLogix.

Introduction

One of the most popular directory services available is Microsoft’s Active Directory. Active Directory serves as the authoritative system to coordinate access between users and their devices to other resources which could include internal applications, servers, or cloud-based systems and applications. The challenge with AD is not that it is inherently a bad piece of software, but more that the dynamics of the IT landscape have changed. In the past few years there has been a dramatic shift to cloud-based infrastructure. IT applications and an organization’s server infrastructure is increasingly based in the cloud. As a result, how does AD manage that remote infrastructure?

Previous security models were built with the idea of protecting the on-premises environment from outside attacks. How does an Active Directory based model support that model when the environment spans a far larger footprint? The second major trend is toward heterogeneous computing environments. AD was introduced to answer the problems of enterprises that were primarily Windows based.

That is no longer true with Macs and Linux devices infiltrating all sizes of organizations, not to mention tablets and phones. Trends such as these are significant issues and IT admins are struggling with how to leverage them with legacy software and infrastructure. A key part of that struggle is with how to connect and manage their employees and their devices and IT applications. AD doesn’t let them connect everything together – at least not without significant effort.  

Amazon Web Services offers several different methods to provision user access and permissions management through its Identity and Access Management (IAM) service but as a user repository it is lacking in many important features for larger enterprises. The need for more granular levels of access and control on the actual instance, as well as the need to connect and manage employees, devices, and legacy applications, often requires the use of an on-premises directory to provide a centralized and authoritative list of employees, their roles, and their access rights. For many organizations, this on-premises directory is Active Directory. To support their growing customer base, Amazon has released several different methods to integrate your existing directory services with AWS.

Integration Types

The ‘simplest’ form of leveraging AD services into AWS is to just extend your existing footprint into your Amazon environment. This answers some issues around latency for logins, as well as provides relatively quick and easy support for disaster recovery and scalability. Using Cloud-init and various configuration management tools (Puppet, Chef, Powershell, etc) instances can be deployed and automatically join an existing domain- centralizing user management at the instance level. A good working knowledge of AWS services (in particular security group configuration and DHCP Option Sets) is required to ensure replication and other AD specific functionality is supported. The AWS Reference Architecture available here provides much greater detail in both the exact process and step-by-step methodology for this type of integration. This method, while quick and fairly simple, does not allow for access to the back-end systems of AWS. Extending your AD infrastructure in this fashion replicates your existing management processes, but does not provide API or console access to AWS.

Another common form of Directory Service integration is an SSO-based, Federation model. Federation allows for delegated access to AWS resources using a 3rd party Authentication resource. With identity federation, external identities (federated users) are granted secure access to resources in your AWS account without having to create IAM users. These external identities can come from your corporate identity provider ( e.g. Active Directory) or from a web identity provider, such as Amazon Cognito, Login with Amazon, Facebook, Google or any OpenID Connect (OIDC) compatible provider. This allows for users to retain their existing set of usernames, passwords and authentication credentials, while still accessing the AWS resources they need to perform their roles. Depending upon the roles allowed to an authenticated user, this method can provide Console Access, API Access (through the STS GetFederationToken api call), and even Workspaces and Zocalo access.

Federation with Active Directory is configured using SAML (Security Assertion Markup Language) to create a connection between an Identity Provider (IDP), and a Service Provider (SP). In this instance, Active Directory is the IDP, and AWS the SP. This process, detailed below, allows for secure and granular access based on the requesting users role within the organization, and the capabilities that role is allowed within the AWS environment.

 

image

 

  1. The user browses to the internal federation resource server.

  2. If the user is a logged into a computer joined to the AD domain and their web browser supports Windows authentication, they will be authenticated using Windows integrated authentication. If the user is not logged into a computer joined to the domain, they will be prompted for their Windows username and password. The proxy determines the Windows username from the web request and uses this when making the session request.

After an AD user is authenticated by the proxy the following occurs:

  1. The proxy retrieves a list of the user’s AD group membership.

  2. The proxy retrieves IAM user credentials from a web configuration file (web.config) configured during setup. By default, the sample encrypts the secret access key using Windows Cryptographic Services. The proxy uses these credentials to call the ListRoles API requesting a list of all the IAM roles in the AWS account created during setup.

  3. The response includes a list of all the IAM roles available within the AWS account for the requesting user.

  4. The proxy determines user entitlements by taking the list of AD groups and the list of IAM roles and determines the intersection of these two lists based on name mapping. The proxy takes the intersection and populates a drop down box with all the available roles for the current user. The user selects the role they want to use while logging into the AWS management console. Note: if the user is not a member of any AD groups that match a corresponding IAM role, the user will be notified that no roles are available and access will be denied.

  5. Using the Amazon Resource Name (ARN) of the selected role, the proxy uses the credentials of the IAM user to makes an AssumeRole request. The request includes setting the ExternalId property using the security identifier (SID) of the AD group that matches the name of the role. This adds an additional layer of verification in event the AD group is ever deleted and recreated using the same display name. By default the expiration is set to the maximum of 3600 seconds.

  6. The proxy receives a session from Amazon Security Token Service (STS) that includes temporary credentials: access key, secret key, expiration and session token.

  7. (ADFS specific) The proxy uses the session token along with the SignInURL and ConsoleURL from the web configuration file (web.config) to generate a temporary sign-in url.

  8. Finally the user is redirected to the temporary sign-in url which automatically logs them into the AWS Management Console (or API session) is valid until the session expires.

Federation methods can become very complex, depending on the individual use case. Amazon has a large amount of documentation around this feature, but a good starting point is the IAM ‘Manage Federation’ topic available here.

The third method of Integration is using the new AWS Directory Service. This is a cloud-based, managed service that allows for a direct connection between your existing AD environment and your AWS resources. This service has two different directory types- the ‘AD Connector’ for existing systems; and the ‘SimpleAD’ directory type for new, cloud-only environments. The ‘AD Connector’ serves as a proxy between your on-premises infrastructure and AWS and eliminates the need for federation services. To use the AWS Directory Service, you must have AWS Direct Connect, or another secure VPN connection into an AWS VPC (Virtual Private Cloud). The AD Connector allows you to provision access to Amazon Workspaces, Amazon Zocalo, and to provide access to the AWS Console to existing groups in your Active Directory structure. Access is also automatically updated in the event of organizational changes (employee terminations, promotions, team changes) to your AD environment. Additionally, your existing security policies – password expiration, password history, account lockouts and the like are all enforced and managed from a central location.  

Conclusion

There are almost as many methods to address authentication and authorization needs as there are companies who need the problem resolved. With AWS, existing organizations have a number of resources available to custom tailor their hybrid infrastructure to meet the needs of their employees and customers moving forward, without sacrificing the security, stability, and governance that is the hallmark of an on-premises environment. This overview of the topic will hopefully provide some direction for IT Administrators looking to answer the question of how their identity management systems will bridge the gap between yesterday and today.


AWS Advent 2014 – Advanced Network Resilience in VPCs with Consul

Today’s post on building Advanced Network Resilience in AWS VPCs with Consul comes to us from Sam Bashton.

At my company, we’ve been using AWS + VPC for three years or so. On day one of starting to build out an infrastructure within it we sent an email to our Amazon contact asking for ‘a NAT equivalent of an Internet Gateway’ – an AWS managed piece of infrastructure that would do NAT for us. We’re still waiting.

In the mean time, we’ve been through a couple of approaches to providing network resilience for NAT devices. As we’re now using Consul for service discovery everywhere, when we came to re-visiting how to provide resilience at the network layer, it made sense for us to utilise the feature-set it provides.

Autoscaling NAT/bastion instances

For our application to function, it needs to have outbound Internet connectivity at all times. Originally, we provided for this by having one NAT instance per AZ, and having healthchecks fail if this was not available. This meant that a failed NAT instance took down a whole AZ – something that the infrastructure had been designed to cope with, but not ideal, as it meant losing half or a third of capacity until the machine was manually re-provisioned.

The approach I set out below allows us to have NAT provided by instances in an autoscaling group, with minimal downtime in the event of instance failure. This means we now don’t need to worry about machines ‘scheduled for retirement’, being able to terminate them at will.

In this example, we set up a three node consul cluster. One node will be elected as the NAT instance, and will take over NAT duties. A simplistic health check is provided to ensure this instance has Internet access; it sends a ping to google.com and checks for a response. In the event of the node failing in any way, another will quickly step in and take over routing.

In practice, if you already have a consul cluster, you would only need two NAT instances to be running and retain fast failover.

You can try out this setup by using the CloudFormation template at https://s3-eu-west-1.amazonaws.com/awsadvent2014/advent.json

The template only has AMIs defined for us-west-2 and eu-west-1, so you’ll need to launch in one of those regions.

This setup relies on a python script ( https://s3-eu-west-1.amazonaws.com/awsadvent2014/instance.py ) as a wrapper around consul. It discovers the other nodes for consul to connect to via the AWS API, and uses consul session locking to get the cluster to agree on which machine should be the NAT device.

Hopefully this example gives you enough building blocks to go and implement something similar for your environment.


AWS Advent 2014 – An introduction to DynamoDB

Today’s awesome post on DynamoDB comes to use from Jharrod LaFon.

DynamoDB is a powerful, fully managed, low latency, NoSQL database service provided by Amazon. DynamoDB allows you to pay for dedicated throughput, with predictable performance for “any level of request traffic”. Scalability is handled for you, and data is replicated across multiple availability zones automatically. Amazon handles all of the pain points associated with managing a distributed datastore for you, including replication, load balancing, provisioning, and backups. All that is left is for you to take your data, and its access patterns, and make it work in the denormalized world of NoSQL.

Modeling your data

The single most important part of using DynamoDB begins before you ever put data into it: designing the table(s) and keys. Keys (Amazon calls them primary keys) can be composed of one attribute, called a hash key, or a compound key called the hash and range key. The key is used to uniquely identify an item in a table. The choice of the primary key is particularly important because of the way that Amazon stores and retrieves the data. Amazon shards (partitions) your data internally, based on this key. When you pay for provisioned throughput, that throughput is divided across those shards. If you create keys based on data with too little entropy, then your key values will be similar. If your key values are too similar, so that they hash to the same shard, then you are limiting your own throughput.

Choosing an appropriate key requires that you structure your DynamoDB table appropriately. A relational database uses a schema that defines the primary key, columns, and indexes. DynamoDB on the other hand, only requires that you define a schema for the keys. The key schema must be defined when you create the table. Individual parts of an item in DynamoDB are called attributes, and those attributes have data types (basic scalar types are Number, String, Binary, and Boolean). When you define a key schema, you specify which attributes to use for a key, and their data types.

DynamoDB supports two types of primary keys, a Hash Key and a Hash and Range Key.

  • Hash Key consists of a single attribute that uniquely identifies an item.
  • Hash and Range Key consists of two attributes that together, uniquely identify an item.

Denormalization

If you’ve spent some time with relational databases, then you have probably heard of normalization, which is the process of structuring your data to avoid storing information in more than one place. Normalization is accomplished by defining storing data in separate tables and then defining relationships between those tables. Data retrieval is possible because you can join all of that data using the flexibility of a query language (such as SQL).

DynamoDB, being a NoSQL database, and therefore does not support SQL. So instead of normalizing our data, we denormalize it (and eliminate the need to join). A full discussion of denormalizing is beyond the scope of this introductory tutorial, but you can read more about it in DynamoDB’s developer guide.

Accessing data

Performing operations in DynamoDB consumes throughput (which you pay for), and so you should structure your application with that in mind. Individual items in DynamoDB can be retrieved, updated, and deleted. Conditional updates are also supported, which means that the write or update only succeeds if the condition specified is successful. Operations can also be batched for efficiency.

Two other batch operations are also supported, scan and query. A query returns items in a table using a primary key value, and optionally using a range key value or condition. A scan operation examines every item in a table, optionally filtering items before returning them.

Indexes

Items are accessed using their primary key, but you can also use indexes. Indexes provide an alternative (and performant) way up accessing data. Each index has its own primary key and that key is used when performing index lookups. Tables can have multiple indexes, allowing your application to retrieve table data according to its needs. DynamoDB supports two types of indexes.

  • Local secondary index: An index that uses the table’s Hash Key, but can use an alternate range key. Using these indexes consumes throughput capacity from the table.
  • Global secondary index: An index that uses a Hash and Range Key that can be different from the table’s. These indexes have their own throughput capacity, separate from the table’s.

A global secondary indexes is called global because it applies to the entire table, and secondary because the first real index is the primary hash key. In contrast, local secondary indexes are said to be local to a specific hash key. In that case you could have multiple items with the same hash key, but different range keys, and you could access those items using only the hash key.

Consistency

DynamoDB is a distributed datastore, storing replicas of your data to ensure reliability and durability. Synchronizing those replicas takes time, and may not always be immediately necessary. Because of this, DynamoDB allows the user to specify the desired consistency for reading data. There are two types of consistency available.

  • Eventually consistent reads: This is better for read throughput, but you might read stale data.
  • Strongly consistent reads: Used when you absolutely need the latest result.

Limits

DynamoDB is a great service, but it does have limits.

  • Individual items cannot exceed 400kb.
  • Tables cannot exceed 10Gb.
  • If you exceed for provisioned throughput, your requests may be throttled.

A simple example

To help understand how to use DynamoDB, let’s look at an example. Suppose that you wanted to store web session data in DynamoDB. An item in your table might have the following attributes.

Attribute NameData Typesession_idStringuser_idNumbersession_dataStringlast_updatedStringcreatedString

In this example, each item consists of a unique session ID, an integer user ID, the content of the session as a String, and timestamps for the creation of the session and the last time it was updated.

The simplest way to access our table is to use Hash Key. We can use the session_id attribute for theHash Key because it is unique. To look up any session in our session table, we can retrieve it by using the session_id.

Accessing DynamoDB

DynamoDB is provided as an HTTP API. There are multiple libraries that provide a higher level abstraction over the HTTP API, and in many different languages. Python is my primary language, and so I’ll be using Python for this tutorial. Amazon has created boto, their official Python interface. For this example however, I will be using PynamoDB, which has succinct, ORM like syntax (disclaimer: I wrote PynamoDB).

Installing PynamoDB

You can install PynamoDB directly from PyPI.

Specifying your table and key schema

PynamoDB allows you to specify your table attributes and key schema by defining a class with attributes.

The Session class defined above specifies the schema of our table, and its primary key. In just a few lines of code we’ve defined the attributes for a session item as discussed in the table above. PynamoDB provides customized attributes such as the UTCDateTimeAttribute as a convenience, which stores the timestamp as a String in DynamoDB.

The Meta class attributes specify the name of our table, as well as our desired capacity. With DynamoDB, you pay for read and write capacity (as well as data storage), so you need to decide how much capacity you want initially. It’s not fixed however, you can always scale the capacity up or down for your table using the API. It’s worth reading more about capacity in the offical documentationif you plan on using DynamoDB for your project.

Creating a table

Now that we know how our data is structured, and what type of key we will use, let’s create a table.

Tables created in DynamoDB may not be available for use immediately (as Amazon is provisioning resources for you) and the wait argument above specifies that we would like the function to block until the table is ready.

Reading & Writing Data

Now that we have a table we can use, let’s store a session in it.

Our session is now saved in the fully managed DynamoDB service, and can be retrieved just as easily.

A less simple example

Building upon the previous example, let’s make it more useful. Suppose that in additon to being able to retrieve an individual session, you wanted to be able to retrieve all sessions belonging to a specific user. The simple answer is to create a global secondary index.

Here is how we can define the table with a global secondary index using PynamoDB.

This might seem complicated, but it really isn’t. The Session class is defined as before, but with an extra user_index attribute. That attribute is defined by the UserIndex class, which defines the key schema of the index as well as the throughput capacity for the index.

We can create the table, with its index, just as we did previously.

Now, assuming that our table has data in it, we can use the index to query every session for a given user.

Conclusion

DynamoDB isn’t perfect, but it is a great service if you need a scalable, highly available NoSQL database, without having to manage any of it yourself. This tutorial shows how easy it is to get started, but there is much more to it than what is mentioned here.

If you are interested, you should definitely check out these awesome resources.


AWS Advent 2014 – Using IAM to secure your account and resources

Today’s AWS Advent post comes to us from Craig Bruce.

AWS Identity and Access Management (IAM) is a service from AWS to aid you in securing your AWS resources. This is accomplished by creating users and roles with specific permissions for both API endpoints and AWS resources.

Ultimately, IAM is a security tool and like all security tools there is a balance of security and practicality (no one wants to enter an MFA code for every single API request). IAM is an optional and free service, but users that do not use IAM have been bitten – most recently BrowserStack (see here). If BrowserStack had followed the IAM best practices it could of avoided this incident.

This article will cover a few areas of IAM.

  • Best practices. There is really no excuse not to follow these.
  • Where your user identities can originate from. AWS offer multiple options now.
  • Various tips and tricks for using IAM.

Best practices

IAM has a best practice guide which is easy to implement. Also when you access IAM via the AWS management console it highlights some best practices and if you have implemented them.

Here is a brief summary of the IAM best practices:

  • Lock away your AWS account access keys

    Ideally just delete the root access keys. Your root user (with a physical MFA) is only required to perform IAM actions via the AWS management console. Create power users for all other tasks, everything except IAM.

  • Create individual IAM users

    Every user has their own access keys/credentials.

  • Use groups to assign permissions to IAM users

    Even if you think a policy is just for one user, still make a group. You’ll be amazed how quickly user policies become forgotten.

  • Grant least privilege

    If a user asks for read access to a single S3 bucket, do not grant s3:* on all resources, be specific with s3:GetObject and select the specific resource. It is easy to add further access later than restrict from wildcard.

  • Configure a strong password policy for your users

    Do you users even need access to the AWS Management console? If they do make sure the passwords are strong (and preferably stored in a password manager, not a post-it).

  • Use roles for applications that run on Amazon EC2 instances

    Roles on EC2 remove the need for ever, ever including access keys on the instance or in code (which can all too easily end up in version control). Roles let you give the same permissions as a user but AWS rotates the keys three times a day. All AWS SDK’s can obtain credentials from the instance meta-data, you do not need any extra code.

  • Delegate by using roles instead of by sharing credentials

    If you have multiple AWS accounts (common in larger companies) you can authenticate users from the other account to use your resources.

  • Rotate credentials regularly

    For users with access keys rotate them. This is a manual step, but you can have two active keys per user to enable a more seamless transition.

  • Remove unnecessary credentials

    If a user has left, or access requirements change delete the user, alter group memberships and edit the policies by group (so much easier when all your policies are in groups, not users).

  • Use policy conditions for extra security

    Conditions can include specific IP ranges or authenticated via MFA requirements. You can apply these to specific actions to ensure they are only performed from behind your corporate firewall, for example.

  • Keep a history of activity in your AWS account

    CloudTrail is a separate service but having a log of all API access (which includes IAM user information) is incredibly useful for an audit log and even debugging issues with your policies.

Federated users

The default in IAM is to create a new user, which is internal to AWS. You can use this user for any part of AWS. When an IAM user logs in they get a special login screen (via a special URL) to provide their username/password (not email address like the root account). To provide flexibility IAM can utilize 3rd party services for identity, for example Amazon/Facebook/Google (and other SAML providers). Another recent product is AWS Directory Service which lets you use your on-premise corporate identity (Microsoft Active Directory for example) as your identity provider. For mobile applications you should explore Amazon Cognito as this is especially designed for mobile and includes IAM integration. Regardless of your identity source, IAM is still core to managing the access to your AWS resources.

General tips

MFA (multi factor authentication) is available with IAM and highly recommended. One approach you could adopt is:

  • Physical for the root account, they are not expensive.
  • Power users (what ever you define as a power user) use a virtual MFA (like Google Authenticator or Duo Security).
  • Users will less potential destructive access have no MFA.

A power user could have access to everything except IAM, as shown below. Learn more about the policy grammar, which is JSON based, on this blog post.

Advanced policies include the use of resource-level permissions and tags, a good example from the EC2 documentation looks like:

This policy allows the user to describe any EC2 instance, to stop or start two instances (i-123abc12and i-4c3b2a1) and to terminate any instance with the tag purpose set to test. This is particularly useful if you want to restrict your users to your development EC2 instances, but not have access to your production instances. Resource-level permissions often a great deal of flexibility in your policies. A combination of tags and resource-level permissions are AWS preferred approach to writing these more complex policies.

While the policies can get complex here are some final tips:

  • When writing your policies AWS provides templates which can be a good starting place.
  • The IAM Policy Simulator is very handy at testing your policies.
  • Changes to IAM can take a few minutes to propagate, but IAM is not a service you should be changing constantly.
  • 3rd party resources that require access to your AWS resources should each use their own IAM account and access keys.
  • Use IAM in preference to S3 ACL or bucket policies (although there are specific exceptions – such as CloudFront access)
  • IAM support is not complete across all AWS products, get the latest information here.

Conclusions

IAM is a powerful service that can assist you manage and restrict access to your AWS resources. While the initial setup can be tricky getting the correct policies, once saved as groups you will be set. Time invested in IAM now could save you from an embarrassing situation later. Hopefully this article touches on the various aspects of IAM and some are directly appropriate for your use case.