How to Build a Robust Scheduled Task (Cron) Service in AWS Using Cloudwatch, Lambda, DynamoDB and ECS

How we ditched Heroku Scheduler and built our scheduled tasks infrastructure in AWS (+ Terraform!)

Greg Einfrank, 12 May 2017

We've just finished migrating our infrastructure completely from Heroku to AWS. There are a ton of awesome benefits we will get from this switch, but one I want to specifically talk about here is scheduled tasks.

To run our "scheduled tasks" (aka cron tasks) we mostly relied on Heroku Scheduler. This is a super simple add-on to Heroku that you can use to run tasks on one of their predefined frequencies.

Heroku Scheduler Example

This tool was too limiting for us because:

  • The only interval choices you have are "Daily", "Hourly" or "Every 10 minutes", meaning there's no way to run a job every minute (it's still completely unclear to me why this is so limited and why these were the options they landed on 🤷)
  • There's no way to see execution statistics or history beyond "Last Run"
  • Built for your Heroku app, which only allows you to run scheduled tasks within your Heroku app environment (Rails in our case)
  • These tasks are built for your Heroku app, which only allows you to run scheduled tasks within your Heroku app environment (Rails in our case)
  • These tasks are solely managed through the Heroku Scheduler Dashboard (web UI), not in source control

So, when planning out how to move the scheduled task infrastructure to AWS, we put together this list of requirements:

  1. Schedules must be configured / managed in code. This is helpful for keeping track of all the running scheduled tasks in one place, and having a history in git of previous task schedules
  2. Schedules should support cron scheduler syntax for maximum flexibility
  3. Each task should have a unique lock, to prevent two instances of the same task from running concurrently
  4. Support for multiple app environments and not just rake tasks
  5. Failure alerts and monitoring

One option was to run a Cron process somewhere and kick off jobs on defined schedules based on the crontab. This would work, but would require more overhead as it's basically the same as starting a new app service that we would have to maintain. What runtime environment would it run in? How would it start new jobs? How would we update and configure the schedule through code? What happens if that process was killed or fails? Because of the added complexity of maintaining a running cron process with no simple way to manage it given our requirements, we decided to try a solution of running tasks on ECS and scheduling them using Cloudwatch events.

We were already in the process of moving our app servers to ECS, so piggy-backing off of some of the work we were already putting into running our application(s) in ECS made a lot of sense for scheduled tasks. ECS doesn't have a built-in solution for this use case, but we could take advantage of a bunch of AWS services and put them together to get exactly what we wanted. With AWS, once you buy in, there are a ton of awesome tools at your disposal that work really nicely together. Our solution we came up with uses a combination of Cloudwatch, Lambda, DynamoDB and ECS.

The basic structure looks something like this, and I'll go into detail about each component below:

Scheduled Task Architecture

Cloudwatch Events to Trigger the Task on a Schedule

We use Cloudwatch Event Rules to trigger a lambda function on a schedule of our choosing. This is way more expressive than the Heroku Scheduler, and supports both cron expressions and rate expressions, which allows us to define "every 5 minutes" in two ways: cron(0/5 * * * ? *) means "run every 5 minutes on the multiple of 5th minute (:00, :05, :10, etc), or rate(5 minutes) which means "run every 5 minutes starting from when the event rule was created". The rate expressions are just nice because they're easier to understand at a quick glance if you don't care about exactly when the task runs.

Lambda Function to Bridge the Gap Between the Event Trigger and the ECS Task

This step is really only necessary to bridge the gap between the Cloudwatch Event and the ECS Task. Unfortunately, Cloudwatch Events can trigger Lambda functions but not ECS tasks, so we use a Lambda function that mostly just forwards the request to the ECS task:

Running Tasks on ECS

We have a few services already set up using ECS, so we take advantage of the pre-defined task definitions for our web apps, and the container overrides to override the command that we want to run (for example, bundle exec rake send_nightly_analytics_email). This is nice because we don't have to spin up new instances for the scheduled tasks to run, they run in the same ECS cluster as all of our other services and tasks and use the same IAM roles and permissions.

DynamoDB as a Lock

At this point most of the basic requirements are met by using Cloudwatch, Lambda and ECS to run tasks on a defined schedule. Next, we need to ensure that every tasks can only be run once at any given time, so we need some sort of global lock. For that, we use DynamoDB, which is a NoSQL database service run on AWS. It can be used as a lock by taking advantage of its conditional updates.

By using Python's contextmanager, this lock can be used to wrap any block of code like this:

merge_lock = DynamoDBLock(lock_name, dynamodb_table_name, region_name='us-east-1')
with merge_lock.lock():
    # Run scheduled task...

The great thing about Dynamodb is that it's accessible from any app or service running within our AWS infrastructure - so it's not app-specific.

Terraform

Looking at the process like this makes it seem really complex - is it really worth the added complexity of setting up each task like this?

Yes! We wouldn't dream of setting up something like this without the help of Terraform. One of the main benefits of moving off of Heroku and onto AWS was the ability to manage our infrastructure as code (and it was one of the requirements of this scheduled task project). By taking advantage of Terraform modules, we were able to package up all the resources required for a task to the point where creating a new scheduled task looks like:

module "send_daily_report" {
    source = "git@github.com:finventures/lambda-ecs-scheduled-task"
    task_name = "send_daily_report"                   # Unique name, to use for locking
    command = "bundle exec rake send_daily_report"    # The actual command to be run
    event_schedule = "cron(0 7 * * ? *)"              # This schedule represents "Daily at 7:00 GMT"
    ecs_cluster_arn = "${var.ecs_cluster_arn}"        # ECS Cluster to run the task in
    task_definition_family = "${var.task_def_family}"  # Task definition for ECS
    container_name = "${var.ecs_container_name}"      # ECS container name to run the task in
    lambda_role_arn = "${var.lambda_role_arn}"        # The IAM role used by the lambda function
    locks_table_name = "ScheduledTaskLocks"           # The DynamoDB table, used for locking
    is_enabled = "true"                               # Enable or disable the Cloudwatch event
}

You can see the full Terraform module [here.](https://github.com/finventures/lambda-ecs-scheduled-task)

One decision we explitly are making here is that the schedules for these tasks are not coupled with our Rails app code. Some engineers brought that up as a concern because they felt that it should work that way, but we decided against it because:

  1. This module can be used for multiple apps/services, so coupling it with our Rails code limits our ability to do that
  2. The tasks themselves should be defined as part of app code, but the schedules to run them make more sense as infrastructure
  3. The ability to turn tasks on or off without a full app deploy is powerful

Lessons Learned / Benefits we've enjoyed thus far...

Monitoring

There were two scheduled jobs that were running on Heroku Scheduler for over a year that were running rake tasks that were no longer defined in our codebase. Because of the enhanced logging and alerting we get from our own scheduler, these became immediately obvious - and we fixed up all the errors within the first hour of launching.

Locking

We had an implementation of a Postgresql lock within our Rails code that we needed to remember to add to each rake task that "needed" it. The problem was that we were relying on every engineer to know to wrap their rake task definition with the lock, plus the lock in Postgresql was again limiting the lock to be Rails-app specific. The forced locking in the new system makes it impossible to accidentally run two jobs at the same time, plus the locks can be used for other services and apps within our AWS account.

Reliability and Security

AWS is super reliable, and it allows us to not worry about maintaining our own Cron server. By using IAM roles for each process, each role/process only has permissions to perform the actions that it needs to and no other permissions. By limiting actions to only the ones we expect, we limit the scope of what can go wrong when there is a bug or something else unexpected happens.