Overcoming Terraform state locking issues with ECS tasks

At Simply Business, we’ve integrated Terraform with our automated deployment pipeline as an easy way of building, configuring and versioning programmable infrastructure for our applications on AWS.

Terraform has many useful features, such as the ability to create templates, modules and provision resources across providers. However, it does not come without some challenges, amongst which is state file locking. In this post, I’ll explain how we’ve implemented running Terraform within AWS ECS tasks to overcome some issues we’ve encountered with Terraform’s state file locking.

The need to maintain a state

One of Terraform’s features is its ability to prevent concurrent runs of a Terraform binary, pointing at the same Terraform state file, from persisting changes that would leave an inconsistent Terraform state. To draw a parallel with a field in which I am by no means expert, the problem that Terraform developers have faced is what in database management theory is at the base of ensuring ACID, atomicity, consistency, isolation, and durability of operations. In object-oriented programming, it’s similar to the concept of ‘thread-safe’ object operations.

By way of example, if we have a single entity that needs munging by other entities, how do we ensure that all operating entities start from a known good state and maintain that assumption throughout the lifecycle? How do we ensure that the manipulated entity is left in a state that is consistent and can be picked up by any other worker? For decades, a solution has been found in locking / mutexing.

Terraform allows a lock to be stored for its state file on shared common resources such as the AWS DynamoDB service. If one Terraform binary attempts to acquire a lock on a state file that is already locked, an exception is raised and the Terraform run exits.

This is not particularly graceful, especially within a continuous integration / continuous deployment (CI/CD) environment. Software engineers, or in my case, security engineers, raise changes that occasionally fail because of this behaviour. Nothing wrong with the code, just a nuisance of sorts, causing builds to be rerun unnecessarily.

sb-tech-site-technology

How we use Terraform

At Simply Business, our CI/CD deployment pipeline integrates with the Jenkins automation server to deploy our applications on AWS. Instead of running Terraform directly using Jenkins, it’s run as an AWS ECS task.

This setup has a couple of advantages:

1 – Terraform can run with a dedicated AWS Identity and Access Management (IAM) role, distinct from the role for Jenkins. Permissions are limited to the project scope, addressing one bad practice of AWS management, i.e. relying on the Jenkins assigned IAM role for all infrastructure changes.

2 – The load burden on Jenkins is considerably reduced. ECS is a very scalable service. Jenkins is less so, even when deployed as an autoscaling group instance template. The result is that Jenkins agents are protected as the scarce resource, and spinning up ECS tasks becomes a cheap offloading operation.

There is also an additional by-product: the AWS APIs can be used to check the status of running tasks. We will explore this with an example.

Scenario

Consider a scenario in which two separate builds are started at roughly the same time. The builds progress through the Jenkins pipeline until a Terraform plan or apply operation in either build finds a Terraform state file locked by the other branch and forces an exit. We would like to have a way of overcoming these conflicts.

Let’s consider a couple of options to understand why running Terraform as an ECS task provides a solution to the stated problem.

What if, in our scenario, we were running Terraform on Jenkins as a local container? Containers running on the same system can be listed by interacting with the local Docker agent; however the Jenkins agents are not set up as a Docker cluster (read Kubernetes / Docker services). What happens if the two builds get assigned to different Jenkins agents? In most cases, this would be the norm. So how can checks be performed on what’s running on a separate Jenkins agent, where one build is completely unaware of the other?

Solution

The solution we’ve come up with is to check for the status of running ECS tasks and recently exited tasks. If no tasks related to the same project are found, a Terraform action (plan or apply) can be executed.

It’s worth noting at this point that locking the Jenkins stage at which Terraform is run is not really an option. Terraform plan and Terraform apply are run in separate Jenkins stages.

Why? This ensures that a Terraform plan action can be run on an integration branch, but the Terraform apply action (and Terraform import, for completeness) can be run only on the main branch. A software engineer must be able to view the planned changes before applying them to the main branch.

By running Terraform as an ECS task, some of the limitations of state file locking can be overcome.

We cannot use a lock on the Jenkins stages where Terraform is executed. When our main branch progresses to the apply stage, an integration branch may start executing the plan stage. The Jenkins stage locking would allow that. The two branches can then fight for state file lock acquisition and will do so; one of the two will fail to acquire the said lock.

So we can instead look at which ECS tasks are executing. If the task currently running for our project has not yet returned a valid numeric exit code, we can wait and by all means make our tasks synchronous across different Jenkins stages and agents. That’s a good by-product of running Terraform as ECS tasks in my opinion.

A couple of shortcomings of this approach derive from how ECS tasks can be listed. There is no control over how long it takes for the list-tasks endpoint to include a newly launched task. It is also possible only to filter the results returned by the ECS list-tasks operation by family name, something that is not particularly intuitive and that has taken some time to realise. The safest approach is to introduce a few seconds delay in the polling of the ECS list-tasks endpoint to let any newly launched tasks to display. No doubt it won’t take long before AWS announces new features that cover additional ground on both of these points.

How to implement ECS tasks for Terraform

And now time for some code. Here is a sample bash file to implement ECS tasks synchronisation when running Terraform:

Hopefully these tips on how to overcome Terraform state file locking issues by using ECS tasks will be useful in your infrastructure as code projects.

See our latest technology team opportunities

If you see a position that suits, why not apply today?

Vincenzo Zambianchi