Jon Leighton

Overhauling Loco2's hosting infrastructure with AWS, Docker and Terraform

Recently I worked on a major overhaul of the infrastructure hosting Loco2.com. In this post, I’ll dig into the details of what we did and why.

Note: while I was until recently Loco2’s Technical Director, I have now gone freelance and I worked on some of this as a freelancer. Feel free to contact me if you would like to discuss a similar project for your organisation. (I still think Loco2 is a great place to work, BTW.)

Table of Contents

  1. History
  2. Infrastructure as code with Terraform
  3. Environment-specific AWS accounts
  4. Transitioning to Virtual Private Cloud
  5. Moving the RDS database into VPC
  6. Provisioning the EC2 Container Service cluster
  7. How ECS runs our containers
  8. Building the Docker image
  9. Deploying the new image
  10. Rolling deploys in ECS
  11. A trick to build Docker images faster
  12. Memory leaks in long-running processes
  13. Developer access to the production environment
  14. Conclusion

History

Loco2 has been hosted on Amazon Web Services for years, but we used it rather like a traditional hosting provider. We had long-running instances which had to be manually provisioned and maintained.

We used Chef to manage the process of configuring servers, but the experience was far from perfect:

Since our servers were long-running, from time to time we had to give them manual attention to fix broken software, prevent disks filling up, and so on. Although rebuildiing servers was somewhat automated via Chef, it was still time-consuming and error prone. We wanted a system where servers could be added or removed quick and easily.

We were still on EC2 Classic, which was the only option when we first set up our AWS infrastructure. This meant that we missed out on features only available in EC2 Virtual Private Cloud, such as faster, cheaper instance types and better network security.

Infrastructure as code with Terraform

In the past we’d made changes to our cloud resources manually through the AWS Console. This was error-prone, and made it hard for one developer to understand when and why another developer had made a certain change.

To solve this, I introduced a relatively new but powerful tool, Terraform. Terraform allows us to declaratively specify our infrastructure as code, which we store in a git repository. Now, we can see exactly when a certain configuration change was made, who made it, and why.

When we change the configuration code, Terraform finds the differences between our desired infrastructure and our actual infrastructure, and performs the necessary modifications.

It also allows us to refer to values by their logical names. For example, rather than having to find the DNS name of the load balancer, and paste it in to a field to configure a CNAME record, we can just write our config file to refer to aws_alb.web.dns_name, which will automatically be replaced with the relevant value when Terraform runs.

Environment-specific AWS accounts

Our legacy infrastructure had one AWS account for all staging and production resources. In order to enforce better separation between the two environments, I set up completely separate staging and production accounts.

To avoid having to set up users in each separate account, we use our existing account as a gateway. Users log in to this gateway account and can then assume a role in the staging or production account which allows them to administer resources within that account.

The beauty of this approach in conjunction with Terraform is that it allows us to test a change in staging and then when we’ve seen that it works, apply the exact same change to our production account.

Our Terraform repository is laid out like this:

When we’re in staging/ or production/, the Terraform AWS provider is configured with an assume_role block, causing Terraform to operate on the correct account.

Transitioning to Virtual Private Cloud

Our services (app servers, PostgreSQL, Redis, and so on) would gradually be migrated to a VPC in the new AWS accounts, but during the transition we still needed to have communication to and from our EC2 Classic instances.

To achieve this, multiple steps were required.

First, I created a new VPC inside our gateway account for each environment. Then, I connected our EC2 Classic instances to those VPCs via ClassicLink. This enables private network traffic to flow between EC2 Classic instances and a VPC.

This VPC was within our gateway account though; we still needed to be able to communicate through to a different VPC inside the staging or production account. This is done with a VPC peering connection which allows VPCs to exchange private network traffic with each other and can be configured to support ClassicLink traffic.

A diagram showing the ClassicLink and peering connection configuration

Moving the RDS database into VPC

Our PostgreSQL database was provisioned using the managed Relational Database Service, but unfortunately it was also using EC2 Classic. Our options for connecting to it from our new VPC instances were sub-optimal:

Therefore, I decided to migrate the database to the new accounts.

I was concerned about significant downtime if we went the route of snapshotting the database and then restoring the snapshot in the desired account. So I spent some time trying to configure the Database Migration Service, which promised a zero-downtime migration.

This turned out to be far more complicated than the documentation would have you believe, and after lots of messing around the final nail in the coffin was the realisation that DMS does not properly support complex data types such as json, hstore and arrays (it converts them to text objects).

I decided to test the snapshot-and-restore approach to see how long it would take, and discovered that it would only be an hour or so. (I should really have just tried this in the first place.) Therefore I woke up at 3 AM one morning and took the site down to do this. Not ideal, but acceptable.

Unfortunately, following the migration we started to see quite a bit of latency on disk I/O, which slowed the site down. This caused lots of stress and head-scratching, but ultimately we ended up weathering the storm and the problems eventually settled down after a few days.

This is certainly a downside of RDS; while having a managed service is great, when things aren’t working well it’s very hard to dig into the details of why, or to know whether it’ll eventually sort itself out. Over the years I’ve realised the importance of testing every change (such as a database version update) against a copy of the production database before implementing it for real. But even so, problems like this can still crop up.

Provisioning the EC2 Container Service cluster

We decided early on that we’d like to use Docker for deploying our application. The benefits of Docker have been written about in many other places so I won’t go into detail here, but the aim was to make it easier to change our application’s runtime environment, make deployment more robust and predictable, and to avoid having to maintain complex, long-running EC2 instances.

I considered Elastic Beanstalk and EC2 Container Service as options for managing our Docker containers, and settled on ECS as it seemed a more flexible approach and less tied to a specific blessed “AWS way” of doing things.

(Whilst there is plenty of excitement around Kubernetes at the moment, using Kubernetes on AWS would require us to manage it ourselves, whereas ECS is a managed service. If I were building a new cloud deployment from scratch, I’d certainly look closely at the managed Google Container Engine though, which is built on Kubernetes.)

ECS runs Docker on what it calls container instances which are grouped in a cluster. You must provision these container instances yourself, which we do via an Auto Scaling Group. This enables us to specify “there must be X instances running in the cluster”, and EC2 will take care of starting and stopping instances to achieve this. In the future, we could implement dynamic auto-scaling where we increase or decrease the required number of instances in response to real time load. For now, it simply allows our cluster to auto-heal if instances die for any reason. The Auto Scaling Group also balances the instances over two availability zones, ensuring that should one AZ fail, our site will continue running.

Amazon provides VM images specifically for use with ECS which have the ECS agent pre-installed. We use these images, in conjunction with some cloud-init config which does some lightweight provisioning such as hooking up Papertrail for logging and Librato Agent for more detailed metrics. (In the future it may be better to create our own derivative machine images via Packer, which would make it faster and more reliable to bring new instances up.)

How ECS runs our containers

A Docker container running within ECS is called a task. To tell ECS what container image to use, what command to run, how much memory to allocate, and so on, you create a task definition. The task definition specifies the parameters for running a container, and the task actually runs it.

You run a task on a cluster, but you have no control over which container instance it actually runs on; ECS will pick one based on available system resources.

If you want a certain task to always be running, you create a service. For example, we have a service specifying that we should always have X instances of the web server task defintion running. If one of those tasks dies for any reason, ECS will notice and magically start a new one. As with Auto Scaling Groups for EC2 instances, it is also possible to implement auto scaling for ECS services, enabling you to dynamically increase or decrease the number of containers you’re running in response to demand.

Loco2 has two clusters: web and worker. The web cluster has one service, which runs Puma. The worker cluster has three services: worker-core, worker-maintenance-reports and worker-maintenance-other. These all run Sidekiq, but each service picks jobs off a different queue and is set up slightly differently. (As the names suggest, worker-core is the main event and the others deal with less important jobs.)

A diagram showing how our ECS clusters, services and task definitions are related

Building the Docker image

When new code is pushed to Loco2’s git repository, Travis CI runs the tests and builds a Docker image. The image contains everything needed to run the application:

We tag the image with the git commit SHA, as well as with latest (for convenience). When we deploy, we use the git commit tag; this allows us to lock to an exact version of the code. Otherwise, we could have a situation where the latest tag is updated, one of our tasks gets restarted by ECS, and then we have a newer version of the code unintentionally deployed. (This system also makes it crystal clear what version of the code we’re running.)

Once we’ve built the image, we use docker run to invoke rails runner '' (with RAILS_ENV=production). This is a simple smoke test to ensure our Docker image can boot Rails without trouble. The image is then pushed to Docker Hub.

Deploying the new image

When we’re ready to deploy, we type /dockbit deploy production in Slack. This triggers a deployment pipeline in Dockbit. The pipeline has two steps:

  1. Wait for the Travis CI build, and ensure that it passed
  2. Run ./bin/deploy $DOCKBIT_DEPLOYMENT_SHA, which invokes a bash script in our repository

The ./bin/deploy script looks like this. The script defines some functions, and then invokes them at the bottom:

check_image_exists
prepare_deploy

run_concurrently update_service web web
run_concurrently update_service worker worker-core
run_concurrently update_service worker worker-maintenance-reports
run_concurrently update_service worker worker-maintenance-other

wait_for_children

There are two preparation steps:

  1. Check that a Docker image for the commit we’re trying to deploy actually exists in our Docker Hub repository
  2. Run rake deploy:prepare in our new Docker image, which allows us to run arbitrary application code on deploy. We use this to run database migrations, amongst other things. This works by running a task on our worker cluster.

If either of these steps fail, we’ll abort the deployment.

Otherwise, we concurrently update each of our ECS services to tell them we want to start running a newer version of the code. Here’s how we update each service:

  1. Download the JSON describing the latest revision of the task definition
  2. Update the Docker image reference in the JSON to point to the git commit tag we’re deploying (e.g. loco2/loco2:f423bbd8ba70446e09c44848b687512741e54814).
  3. Upload the new JSON, creating a new revision of the task definition
  4. Update the ECS service to tell it use the newer revision of the task definition
  5. Wait for the ECS deployment to finish (more on this below)

Rolling deploys in ECS

Updating a service causes ECS to orchestrate a rolling deploy. This means that there is never a point where zero tasks are running and the application is inaccessible. Instead, ECS gradually starts new tasks and stops old ones until no tasks using the previous task definition are still running.

ECS makes decisions about how to do this based on available memory on the container instances (a new task cannot be started if there is not enough memory available for it), as well as your minimum and maximum healthy percentage settings.

The minimum and maximum healthy percentages govern how the rolling deploy will proceed. If we configure a service with 10 desired tasks running, a minimum healthy percentage of 50% and a maximum of 200%, then during a deploy we may have anywhere between 5 and 20 tasks running. If we set a minimum healthy percentage of 100%, then ECS will need to start new tasks before it stops old ones; this only works if there is sufficient available memory on the container instances.

One crucial fact about rolling deploys is that there may be two versions of the application running at the same time. This means that any database migrations applied in the new deploy must be backwards-compatible with the previously-deployed version of the code, otherwise there will probably be errors.

When a web request comes in to our Application Load Balancer during a rolling deploy, it may be routed to a task running either the old code or the new code. If a user gets routed to one of the new tasks, we don’t want them to get routed to one of the old tasks on a subsequent request, otherwise they may end up inconsistently seeing different versions of a page. To solve this, we use sticky sessions, which ensure that the load balancer always routes the same user to the same ECS task (so long as it’s still running).

Rolling deploys are a fantastic feature of ECS. There is a lot of complex logic going on here which we can just rely on ECS to implement.

A trick to build Docker images faster

Building our Docker image on Travis CI is quite slow. While Docker implements a build cache to maximise the efficiency of rebuilding images, this is irrelevant on Travis CI since we’re always building the image in a completely new VM environment with no data cached.

The most time-consuming part of our image build is installing the bundle, which doesn’t only come down to network speed but also the time taken to install various gems with native extensions to compile.

To speed this up a bit, we have an automated build on Docker Hub which builds an image called loco2/loco2_base every time we push to our git repository.

When we build the image for a given commit, we use this base image as a starting point. However, it probably doesn’t contain the absolute latest code, and our bundle or assets may have changed in the newer code. So we replace all the source code and then re-install the bundle, regenerate the assets and so on.

The Dockerfile we build on Travis CI looks like this:

FROM loco2/loco2_base:latest

# Set the commit ID in an env var
ARG LOCO2_COMMIT
ENV LOCO2_COMMIT $LOCO2_COMMIT

# Clean the source tree so that if any files have been deleted after the base
# image was built, they will get removed from from the final image.
RUN docker/clean.sh

# Now, re-add all the source files that still exist.
ADD . /loco2

# Update any generated files to match the updated source tree
RUN docker/prepare.sh

The docker/clean.sh script looks like this:

#!/bin/bash
set -e

# Preserve generated files so we don't have to generate them again when they
# are unchanged.

ls -A | egrep -v "tmp|public|vendor" | xargs rm -r

pushd public/
ls -A | egrep -v "assets" | xargs rm -r
popd

And the docker/prepare.sh script looks like this:

#!/bin/bash
set -e

bundle check || bundle install --deployment --clean --without='development test' --jobs=4
cp docker/database.yml config/
RAILS_ENV=production rake assets:precompile

Since we preseve the bundled gems and compiled assets from the base image, most of the time docker/prepare.sh runs quickly. If the bundle or the assets have changed, it’ll be a bit slower but still nowhere near as slow as starting from scratch.

This approach makes our image builds faster, but it’s still not exactly instantaneous. There is still quite a lot of time spent actually pulling the loco2/loco2_base image in the first place. Also, if we change our base image Dockerfile (e.g. to add a new operating system package, or upgrade the Ruby version) we must wait for it to be rebuilt before we build the final image on Travis CI.

Why don’t we just use a Docker Hub automated builds for our final image, rather than building it on Travis CI?

Memory leaks in long-running processes

As mentioned previously, we use Sidekiq to process background jobs. Unfortunately, over time, the memory used by our Sidekiq processes seems to grow indefinitely (or at least grow pretty large). (This is probably not the fault of Sidekiq itself, but of our own code or code in libraries we’re using.)

While we would ideally spend time finding and fixing the leaks, it’s pretty hard to prioritise this sort of work. So for a long time we have done what many others do and used monit to keep an eye on the memory usage of our Ruby processes, gracefully restarting them when it gets too much.

But using monit doesn’t make a lot of sense in the Docker world, since the container orchestration system (ECS) is already responsible for monitoring our containers.

Docker recently added a “health check” feature which enables you to specify a health check command which will be run inside the container to determine its health. We could implement this to periodically check the memory usage of our process and report the container as unhealthy if it gets too high.

This is all fine and dandy but Docker doesn’t actually do anything about the health check status; that’s really up to the container orchestration system. Ideally, ECS would monitor the Docker container health check status and gracefully restart tasks which are unhealthy.

Unfortunately, ECS doesn’t currently support Docker health checks, though there is an open feature request for it. So we need another solution.

After casting around to try to find out how others were dealing with this problem I drew a blank, so ended up writing a simple memory monitoring script, which Loco2 has made available as open source.

The script runs a program and keeps an eye on its memory use. If it gets too high, it sends a SIGQUIT to the program. If the program doesn’t exit after a certain timeout, it sends a SIGKILL. That’s it - once the program has exited we can rely on ECS to notice that the task died and start a new one, so we don’t need to implement any of our own restarting logic.

It works like this:

memory_monitor --limit 1500 --interval 1 --timeout 30 sidekiq ...

This invocation would run sidekiq, monitoring its resident set size every second, and stopping it within 30 seconds if memory use exceeds 1,500 MB.

Developer access to the production environment

From time to time developers inevitably need to get into the production environment to run rails console, or psql, or a rake task. In our legacy infrastructure we used to just ssh into a server and cd to the directory where the application was. But now we needed to a way to get into a Docker container.

While it would theoretically be possible to do this on our ECS container instances, I decided to provision a dedicated admin instance for these sorts of tasks in order to avoid tying up resources on user-facing servers.

This instance is based on a stock Amazon Linux machine image, and we do some light provisioning via cloud-init to:

Then, we can access a container in production like this:

$ ssh -t [server] \
    docker run --rm -it \
    -e RAILS_ENV=production \
    loco2/loco2:latest \
    [bash|rails c|psql|...]

(In practise, we wrap this invocation up into a little script for convenience.)

Conclusion

There were quite a lot of steps to get to this point, but I think Loco2 now has a much more robust and maintainable infrastructure. I was really impressed by Terraform and it’s nice to see how quickly it is maturing. ECS is good at what it does too, but I think there are lots of ways it could improve.

When the time came to switch over to the new system it thankfully happened with very little drama!

No doubt there are many different ways of solving the problems we encountered, but I hope this provides a useful insight into the solutions arrived on at Loco2. For me this whole experience underlined how difficult it can be to iterate existing, mature systems with lots of real users vs building something from scratch.

16 December 2016

Comments