The pipeline consisted of an on-prem Jenkins cluster of a Jenkins master and 23 worker nodes. This came after adding a few nodes at a time, each time it added speed to the pipeline. It seemed that 23 was the sweet spot for us. Adding more nodes after this actually slowed down the pipeline and started causing build failures due to timeouts. This is because the communication between the master and the worker nodes was too chatty for our Jenkins master to handle. I tried vertically scaling the Jenkins master several times, but GoDaddy wasn’t really setup well to handle that, and as such, this was as good as we were able to do. I tried several other things as well such as moving Jenkins to AWS, but there were some security blockers that prevented that at the time.
The features of the pipeline include testing pull requests (lint, jest, and cypress), creating necessary AWS infrastructure, gated deployents, and automated rollbacks. The pipeline also creates any infrastructre needed by the application before attempting to deploy the application. Each job would check for any infrastructure it was going to need and create it if it didn’t exist. For us this meant creating ECR repositories for our build and test jobs, EKS clusters for the deploy job, and connecting it all with Global Accelerator. The only thing I chose to not have the pipeline do was to automatically set traffic on Global Accelerator to go live. I always wanted to leave this as a manual choice for the team. I did, however, make a job to allow the team to alter traffic flows across all of our regions so this would be an easy task.
Most teams, including our own before I made the pipeline, had manual deployments. I remember when I first joined the team we were doing manual releases to production on Tuesdays, and the first release step involved using a local changes on a dev machine 👀. Teams that did have CICD pipelines usually had a final manual step to go live in production. All of the pipelines seemed to take over an hour, and that was considered good at the company at the time. I setup our pipeline to deploy out to our lower environments and then production, each time it deployed to an environment, if there were any problems, it would stop immediately, roll back to the previous version, and alert the team. This happened mostly due to a few straggling flakey tests that had unmocked calls to APIs that weren’t ever quite as stable as we’d like on our lower environments, but those APIs were ownedy by other teams and out of our control. Eventually we mocked those and the pipeline ran so smooth we had to remind devs not to walk away from their computer after merging just in case something went wrong. It rarely did, but best practice is always to verify that your change made it to production successfully if there are no automated checks to do so.
I added a lot of optimizations to the build step to try to cut build times down. This was done by adding some custom logic around when to build or when to use a base layer. Base images would be stored separately from release builds in ECR. We would then check if we had a match for a base image and then pull it down if we did. This took the build image time down from around 10 minutes for each build to seconds for most builds. Any changes that affected the underlying base image would of course still take around 10 minutes to build, but the overwhemling majority of our changes didn’t affect the base layer.
I spent a good amount of time trying to optimize this. Just the /products page alone accounts for 10% of company revenue, and we had a lengthy cypress test suite to go along with that to make sure we didn’t break anything. A simple run of the test suite as we were first setting this up ran for about 40 minutes. The first pass was a simple niave run anyone could do, but that was obviously very slow. I started trying to run these in parallel as we got more worker nodes to work with, but more optimization was needed. For a while we ran 10 worker nodes and did a simple parallelization of all tests, run times were around 25 minutes. The first problem with this approach had to do with a Jenkins auth plugin written by another team at GoDaddy and how we connected to the AWS environments. Too many connection request would be rejected. Too many retries was just a slow approach, time was burned in the retrying of a connection. A lazy connection approach seemed to do the trick here. With that out of the way, there was another obvious problem. The startup time for our application was 30 seconds, and while this approach was still far faster than our first pass, that startup time didn’t need to happen for each test run. Grouping the tests together into suites that could run in parallel was a much better approach. This kept our test time down to around 8 minutes for the full suite.
Having started with a manual process which at best took over an hour and didn’t include any automated testing, I made a CICD pipeline that had automated tests at every critical step and ultimately had a deploy time of around 15 minutes (~8 of which was spent testing). Because we also ran our full cypress suite on pull requests, we rarely had anything slip into main that wasn’t supposted to be there. The pipeline worked so well that when the team took on new applications, we ended up using it to deploy out those applications as well. Eventually it deployed out all of the infrastructure and pages for our org.