Unlearning for DynamoDB

04. December 2018 2018 0

Introduction

Relational Databases have been the standard way to store information since the 1980s. Relational Database usage techniques such as normalisation and index creation form part of many University courses around the world. In this article, I will set out how perhaps the hardest part of using DynamoDB is forgetting the years of relational database theory most of us have absorbed so fundamentally that it colours every aspect of how we think about storing and retrieving data.

Let’s start with a brief overview of DynamoDB.

Relation Databases were conceived at a time when storage was the most expensive part of a system. Many of the design decisions revolve around this constraint – normalisation places a great deal of emphasis on only having a single copy of each item of data.

NoSQL databases are a more recent development, and have a different target for operation. They were conceived in a world where storage is cheap, and network latency is often the biggest problem, limiting scalability.

DynamoDB is Amazon’s answer to the problem of effectively storing and retrieving data at scale. The marketing would lead you to believe it offers:

  • Very high data durability
  • Very high data availability
  • Infinite capacity
  • Infinite scalability
  • High flexibility with a schemaless design
  • Fully managed and autoscaling with no operational overhead

In reality, whether it lives up to the hype depends on your viewpoint of what each of these concepts mean. The durability, availability, and capacity points are the easiest to agree with – the changes of data loss are infinitesimally low, the only limit on capacity is the 10GB limit per partition, and the number of DynamoDB outages in the last eight years is tiny. As we move down the list though, things get a bit shakier. Scalability depends entirely on sharding data. This means that you can run into issues with ‘hot’ partitions, where particular keys are used much more than others. While Amazon has managed to mitigate this to some extent with adaptive capacity, the problem is still very much something you need to design your data layout to avoid. Simply increasing the provisioned read and write capacity won’t necessarily give you extra performance. It’s kind of true that DynamoDB is schemaless, in that table structures are not uniform, and each row within a table can have different columns of differing types. However, you must define your primary key up front, and this can never change. The primary key consists of at minimum a partition key, and also optionally a range key (also referred to as a sort key). You can add secondary indexes after table creation, but only up to a maximum of 5 local and 5 secondaries. This all adds up to make the last point, that there is no operational overhead when using DynamoDB, obviously false. You won’t spend time upgrading database version or operation systems, but there’s plenty of ops work to do designing the correct table structure, ensuring partition usage is evenly spread, and troubleshooting performance issues.

Unlearning

So how do you make your life using DynamoDB as easy as possible? Start by forgetting everything you know about relational databases because almost none of it is true here. Be careful – along the way you’ll find a lot of your instincts about how data should be structured are actually optimisations for RDBMS rather than incontrovertible facts.

Everything in a single table

In DynamoDB, the ‘right’ number of tables to power an application is one. There are of course exceptions but start with the assumption that all data for your application will be in a single table, and move to multiple tables only if really necessary.

Know how you’re going to use your data up front

In a relational database, the structure of your data stems from the data itself. You group and normalise, and then stitch things back together at query time.

DynamoDB is different. Here you need to know the queries you’re most commonly going to run to structure your data appropriately and work backwards to come up with the table design.

It’s OK to duplicate data

Many times when using DynamoDB you will store the same data more than once. This might even be done with multiple copies of data in the same row, to allow it to be used by different indexes. Global Secondary Indexes are just copies of your data, with different keys.

Re-use the same column for different data

Imagine you have a table with a compound primary key – an account ID as the partition key, and a range key. You could use the range key to store different content about the account, for example, you might have a sort key settings for storing account configuration, then a set of timestamps for actions. You can get all timestamps by executing a query between the start of time and now, and the settings key by specifically looking up the partition key and a sort key named settings. This is probably the hardest part to get your head around at first, but once you get used to it, it becomes very powerful, particularly when used with secondary indexes.

This, of course, makes choosing a suitable name for your sort column very difficult.

Concatenate data into composite sort keys

If you have hierarchical data, you can use compound sort keys to allow you to refine data. For example, if you need to sort by location, you might have a sort key that looks like this:

Using this sort key, you can find all items within a partition at any of the hierarchical levels by using a starts-with operator.

Use sparse indexes

Because not every row in DynamoDB must have the same columns, you can have secondary indexes that only contain a subset of data. A row will only appear in a secondary index if the primary key for that index exists in the table. Using this, you can make less specific and even scan queries that are efficient enough for frequent usage.

Don’t try and use your database for all types of queries

With relational databases, it’s common to spin up a read-only replica to interrogate data for analytics and trending. With DynamoDB, you’ll be needing to do this work somewhere else – perhaps even in a relational database. You can use DynamoDB streams to have data sent to S3, for analysis with Athena, Redshift, or even something like MySQL. Doing this allows you to have a best of both worlds approach, with the high throughput and predictable scalability of DynamoDB, and the ability to do ad-hoc queries provided by a relational engine.

Conclusions

  • Know what questions you need to ask of your data before designing the schema
  • Question what you think you know about how data should be stored
  • Don’t be fooled into thinking you can ‘set it and forget it.’

About the Author

Sam Bashton
Sam Bashton is a cloud computing expert who recently relocated with his family from Manchester, UK to Sydney, Australia. Sam has been working with AWS since 2006, providing consultancy and support for high traffic e-commerce websites. Recently Sam wrote and released bucketbridge.cloud, an AWS Marketplace solution providing an FTP and FTPS proxy for S3.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf and developed cookbooks to simplify building and managing infrastructure at Chef. Jennifer is the coauthor of Effective DevOps and speaks about DevOps, tech culture, and monitoring. She also gives tutorials on a variety of technical topics. When she’s not working, she enjoys learning to make things and spending quality time with her family.