Using Infrastructure as Code as a Poor Man’s DR

by Brad Adair 09. December 2018 2018 0

What is DR?

Let’s start by setting the context of what I mean by Disaster Recovery(DR). There are different interpretations of DR and High Availability, with a very thin and moving line between the two. Here I am specifically referring to the ability to recover your infrastructure from a disaster such as when an AWS region is unavailable. I am not talking about situations where immediate failover is needed.

What is Infrastructure as Code

Infrastructure as Code (IaC) is something that has been around for a while now, but many people are just starting to fully embrace and see the benefits of it. Like DR, this is something that many people have taken to mean many different things over the years. People refer to BASH scripts that generate KVM VMs as IaC. While this is technically correct, this is not IaC. I am explicitly talking about tools such as Terraform, that are designed to generate infrastructure based on a configuration file.

What IaC for DR?

Go to any company or organization that does not have a viable DR strategy and ask them why that is the case. Nine times out of ten, the answer that you get will relate to cost. That makes sense, having a true DR environment can be very expensive, additionally, for people that are not technical and have never experienced a true IT disaster, it can be tough to comprehend why this is all needed. These factors make it very difficult for IT to get approval to put a secondary environment in place. This is where IaC comes in.

If your IaC is properly set up, you can essentially get DR for free. How? If a disaster takes out your infrastructure, you just , and you have your infrastructure back.

But wait, my IaC tool deploys to the AWS region that is down

If your tool is improperly configured, you may not benefit from the DR capabilities of IaC. You need to make sure that you are abstracting the providers and regions out of the actual infrastructure configuration. This allows you to quickly change the region that you are pointing at and re-deploy. For example, in Terraform, you would want to have a separate provider.tf file, that has a provider section with the region specified like this: provider “aws” { region = “eu-west-1” } This will allow you to change one line simply, and re-deploy your exact infrastructure to another region. This is as opposed to having the region information embedded in individual .tf files, which unfortunately I see floating around pretty often.

What if all of AWS (or GCP, or Azure) completely goes down and not just a region?

A complete outage of a service provider is another concern that I hear from time to time. I have a couple of different thoughts about this scenario.

My first thought is that the chances of that happening are so vanishingly small, that it is hardly worth thinking about. If all of AWS is down, you likely have something more serious, like worldwide thermo-nuclear war. However, as engineers, sysadmins, and other assorted IT professionals, we have a habit of not being able to not think about these extreme cases.

For those situations, you can have standby code. What do I mean by this? I mean that you can develop code that deploys the equivalent of your current infrastructure in another environment. Now, this is obviously time-consuming, and since none of us have a ton of spare time, that is a cost, and one that personally I don’t think is worth it, but it’s possible, and up to each reader to decide if it is needed for their environment.

Ok, I have my infrastructure back, what about my data?

Well, you are still doing backups, right? I am making a case for replacing a dedicated DR environment with code; I am not making a case for throwing basic common sense out the window.

That being said, there are times where it would take an impractical amount of time to restore data from backups just because there was a 1-2 hour outage. Especially when you can re-deploy your infrastructure from code to another region in minutes.

This is where I advocate for a hybrid approach between a complete IaC DR plan and a tradition DR setup. In this type of solution, you would have a replicated database (or other data source) running at all times in another region that you plan to use for DR purposes. Then, if disaster strikes, your data is sitting there just waiting for you to deploy the networking and compute resources to access it.

Since this does require keeping some infrastructure running at all times, it does cost some money. However, it will cost far less than having a whole second DR site sitting around waiting and may be an easier pill for the people that have to spend the money to swallow.

Conclusion

I hope that after reading this article you will have an understanding of the feasibility of using IaC as a means to have a DR environment in places where it otherwise would not be feasible. I further hope that you see the benefits of this solution in situations where a full disaster recovery solution is possible, but possibly not needed. Perhaps that money could better be spent elsewhere if you already have IaC in place to cover the worst case scenario.

What’s next

This article introduces the DR topic to get those that read it thinking about using Infrastructure as Code as a possible disaster recovery plan and solution. It just begins to scratch the surface of what is possible and the different considerations that need to be made.

Please visit my website at https://www.adair.tech over the next several weeks as I will publish follow up articles there that will delve further into details, tools, and specific plans for accomplishing this. You can also contact me via my website or Twitter to have a 1-1 conversation on the topic and explore your particular use case in more depth.

About the Author

Brad Adair is an experienced IT professional with over a decade of experience in systems engineering and administration, cloud engineering and architecture, and IT management. He is the President of Adair Technology, LLC., which is a Columbus based IT consulting firm specializing in AWS architecture and other IT infrastructure consulting. He is also an AWS Certified Solutions Architect. Outside of the office he enjoys sports, politics, Disney World, and spending time with his wife and kids.

About the Editor

Jennifer Davis is a Senior Cloud Advocate at Microsoft. Jennifer is the coauthor of Effective DevOps. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps. She has spoken and written about DevOps, Operations, Monitoring, and Automation.