Autoscaling Jenkins on EC2
Sid Shanker, 17 Nov 2017
At Fin, we rely pretty heavily on our Jenkins CI cluster for running our test suite to ensure that we aren’t deploying code that breaks our tests. Since we run CI on every push to a branch with a Pull Request up, as we’ve added more developers to the team and written more tests, we’ve had to add more beefier EC2 instances to our Jenkins cluster. While this has been great for making sure that nobody is spending too long waiting for Jenkins to pick up their build, we now waste a lot of money paying for instances that are running at times when nobody is working, say in the middle of the night or on weekends.
We have a custom test runner that intelligently runs our frontend/backend tests based on what files changed with balanced sharding. While we evaluated some SaaS CI solutions in which pricing was based more on usage, almost everything we evaluated would have required us to seriously re-engineer how we run our test suite. We ultimately decided that the easiest and most economical solution was to use EC2 Autoscaling.
EC2 autoscaling works by creating a pool of instances with a particular tag. You can set the number of instances you’d like to have up — either manually in the console, via the AWS API, or via some automated Cloudwatch event, and EC2 will kill or bring up instances in order to hit the number of instances that you’ve targeted.
This meant that that we needed to make some serious changes to how our Jenkins infrastructure worked. Here are the main requirements that we needed to hit in order to use EC2 Autoscaling:
- A script that can run on new Jenkins runners that installs all the necessary software to run our builds
- An automated way for Jenkins runners to register themselves with the Jenkins primary
- A mechanism for disconnecting runners gracefully from the primary before they are terminated
- Cloudwatch metrics that we can use to trigger autoscaling events
Making these changes to our infrastructure would have other benefits too — for instance, it means that it’s trivially easy to add new instances if we need to for whatever reason, and that if an instance is acting up, we can safely destroy it. It also forced us to break our habit of running arbitrary commands on these boxes to clear space or install/upgrade software, and gets us closer to the “treat your servers like cattle, not pets” devops mantra that we follow closely with the rest of our production infrastructure.
Starting New Instances
The first step of this project was learning how to set up brand new Jenkins runners. While before this project we some shell scripts for getting a new runner working, we had also done a bunch of custom work on our existing Jenkins runners that was not reflected in those scripts. The answer here was using Ansible. Since our tests run in Docker, the Ansible set up for these machines is fairly simple.
The next step was figuring out a good way to alert the Jenkins master programatically that the new runner existed. While we spent some time investigating how to use the Jenkins CLI remotely, the best solution for this is the Jenkins Swarm plugin. The way this works is on the runner, you run a Java jar, specifying the IP of the Jenkins primary, and leave it running. The command to run will look something like this:
java -jar /usr/local/bin/swarm-client.jar -master http://JENKINS_MASTER_IP:JENKINS_MASTER_PORT -username VALID_USERNAME -password VALID_PASSWORD
We run this in a service called
jenkins-swarm, which we manage with systemd. Our Ansible scripts set up and run this service. An important note here is that if you have authentication set up on your Jenkins primary instance, you’ll have to provide valid authentication credentials in order for the Jenkins runner to connect. We use Github authentication on Jenkins, and have a machine user that we use for other devops task that we used for this. The Jenkins swarm command allows you to pass an API token (which you can generate in the Github UI) as the “password” value. It’s also worth mentioning that some of the errors that Jenkins throws while attempting to connect are fairly cryptic — authentication errors are 500s with no context, for instance.
Another thing worth noting here on the subject of authentication is that you should not store secrets like your machine user’s Github API token in plaintext in Ansible — we store our encrypted secrets on S3 and download it as part of the Ansible setup, but you could also use Ansible Vault for encrypting it depending on your setup.
Disconnecting Instances Gracefully
The next part of this project was figuring out how to gracefully disconnect Jenkins runners. I initially tried to figure out a way to automatically restart a build if a Jenkins runner was terminated mid-build, however, on the Jenkins side, it was hard to detect whether the build failed because of a legitimate failure or a runner disconnect.
I then discovered that Jenkins supports marking nodes (runners) offline.
This is a great feature that does not stop the current build, but prevents new builds from being scheduled on them. This, in conjunction with AWS Autoscaling Lifecycle Hooks turned out to work well for gracefully terminating instances.
Autoscaling lifecycle hooks can be registered on instance termination. What this allows you to do is have an SNS channel that gets notified when the autoscaling rules require that an instance be torn down. Having the lifecycle hook registered also automatically delays the termination of the instance by some time period. Lifecycle hooks can then notify an SNS channel. We then have an AWS Lambda Python job that listens to this channel, and when it receives payloads, marks the Jenkins runner that is being terminated offline.
The next question here is how to figure out what Jenkins runner to mark offline in the lambda job. The trick we used here was naming each of the Jenkins runners with the EC2 instance id of the box. Jenkins runners can be named via the
-name to the docker swarm client jar, as follows:
java -jar /usr/local/bin/swarm-client.jar -master http://JENKINS_MASTER_IP:JENKINS_MASTER_PORT -username VALID_USERNAME -password VALID_PASSWORD -name EC2_INSTANCE_ID
Since the EC2 instance id is passed to the lambda job when the lifecycle hook fires, it’s very easy to mark the node offline. Here is the code for our lambda task:
Writing Metrics to Cloudwatch
The final part of this project was collecting metrics on how busy the Jenkins build queue is. While we ended up autoscaling our runners based on a simple schedule, we might in the future use metrics around how deep the build queue is to drive the number of runners we have. Furthermore, having this data makes it easier to adjust the number of runners we have in our schedule. Thankfully, Jenkins has some endpoints built-in for collecting metrics on the status of the cluster.
https://JENKINS-HOST-NAME/queue/api/json provides a JSON blob that contains the number of items currently in the build queue, and https://JENKINS-HOST-NAME/computer/api/json provides a JSON blob with the number of busy and total runners connected to the primary. We have another AWS Lambda job that runs every minute, hits these endpoints, and writes the number of items in the queue, number of busy runners, and number of total runners to CloudWatch. While we currently don’t have alarms set up on any of these metrics, it would be fairly easy to set up a CloudWatch alarm that triggers an autoscaling event at a certain number of items in the queue or certain number of idle runners.
While wrangling with some of the Jenkins quirks was frustrating at times, it feels good to be in a place where we have more control over how much we are spending on CI infrastructure.
Hopefully this guide was helpful, and feel free to reach out if you have any suggestions or questions about your own CI autoscaling setup!