Make deployments more reliable #172

seanh · 2024-05-21T16:08:02Z

Save developers time by refactoring our deployment workflow to make it more reliable.

Problem

Developers are losing a lot of time when deploying their code because deployments often fail with a variety of error messages from Elastic Beanstalk, and we do a lot of deploys every day/week.

Currently, as soon as a pull request is merged into an app's main branch, that app's deployment workflow is automatically triggered and begins building the new version of the app's Docker image, publishing the image to Docker Hub, and deploying the image to the app's Elastic Beanstalk staging environment(s). After manual approval the image is then deployed to the app's production environment(s) as well.

There are a number of issues with this approach:

If multiple pull requests are merged in quick succession then each pull request starts a new deployment workflow run.

We merge PRs in quick succession all the time.

Previously each subsequent deployment workflow run would wait for any previous runs to complete but this was far too slow: if you merged several PRs at once you'd have to wait for several separate deployments to complete. So currently subsequent worfklow runs cancel in-progress ones. But there are a number of problems with this:
We're kicking off a large number of GitHub Actions workflow runs and Elastic Beanstalk deploys.

If you merge six PRs it triggers six deployments (and then cancels five of them). This all costs developers time, and sometimes we're hitting error messages from Elastic Beanstalk about rate limiting that may be caused by us doing lots of deploys in rapid succession.
It causes confusion and wasted time when one developer is in the middle of deploying their PR and another developer merges their PR and it cancels the deployment of the first developer's PR.

Coordination between the two developers is now needed and the first developer has to wait for the second developer's deployment to complete before they can test their code.
Deployments often fail with an error message from Elastic Beanstalk saying something to the effect of "The environment was not ready to accept a deployment at this time." For example:
```
An error occurred (InvalidParameterValue) when calling the UpdateEnvironment operation: Environment named h-websocket-staging is in an invalid state for this operation. Must be Ready.
```
I suspect that @indigobravo is right that this might be happening because a previous deployment workflow run that was cancelled had already kicked-off a deployment in Elastic Beanstalk and while the workflow run in GitHub Actions was cancelled the deployment in Elastic Beanstalk is still underway.
I think GitHub's information about which version of the code is currently deployed may get out-of-sync with what is actually deployed.

GitHub pages like the deployment workflow runs page and the deployments page give developers information about which commits are currently deployed to the app's staging and production environments. If a deployment workflow run is cancelled then GitHub will still think that the previous commit is deployed, but if the workflow run had already triggered a deployment in Elastic Beanstalk and cancelling the workflow run doesn't actually cancel the Elastic Beanstalk deployment (which I don't think it does) then the environments may actually contain the newer version of the code and the GitHub pages will be giving developers false information.

Principles

There are a couple of principles that we want to maintain:

Each developer is empowered to (and responsible for) deploying and testing their own PRs, and can do this at any time of the day/week.
We keep the staging and production environments up-to-date with the HEAD commits of the app's main branches. A pull request doesn't have to be deployed to staging and production immediately after merging but it should be done pretty quickly: within hours, not days.

Solution

Here's an outline of a solution that I think could be much faster and more reliable while still maintaining these principles:

Deployment workflow runs should once again wait for any currently-in-progress runs to finish, rather than cancelling them.
Deployment workflow runs should not be automatically triggered when pull requests are merged into main. Not even just to deploy to staging. Instead, developers must trigger deployments manually after merging one or more PRs.
We'll add a new scheduled GitHub Actions workflow that posts complaints into Slack if a commit has been on main for more than N hours and hasn't yet been deployed, for any of our apps.
This is to make sure that the app's environments don't fall behind main if developers merge PRs but forget to deploy them.
(This is already needed with the current workflow anyway: PRs are currently deployed to staging automatically but require manual approval to deploy to production and developers sometimes forget to do that so production sometimes gets out-of-date.)
(There's a known issue with scheduled GHA workflows that we'll have to fix: Prevent ci.yml being disabled after 60 days of inactivity #100. But this is already an issue with scheduled workflows that we already depend on, so it already needs to be fixed.)

Benefits:

Developers can merge a PR or batch of PRs and then deploy them all at once without triggering and cancelling several workflow runs.

This should save time, reduce the number of deployments we're putting through Elastic Beanstalk, and make deployments more reliable by reducing or eliminating certain classes of Elastic Beanstalk errors that we frequently see with deployments (for example rate limiting errors and "environment not ready" errors).
GitHub's information about which commits are deployed to which environments shouldn't get out-of-sync as often because we won't be cancelling so many workflows.

Technically it will still be possible to manually cancel a workflow run and, depending on timing, that may cause the out-of-sync issue. But manually cancelling workflow runs should be rare.
If a developer merges a PR while another developer's PR is deploying it won't cancel the other developer's deployment.

Merging a PR will not automatically trigger a deploy. And even if the second developer does manually trigger a deploy it will wait for the first deploy to finish, which means the second developer has to wait, but that seems better than the current situation.

Limitations of this solution

Sometimes deployments fail with this error from Elastic Beanstalk:

botocore.exceptions.ClientError: An error occurred (Throttling) when calling the 
DescribeEnvironments operation (reached max retries: 4): Rate exceeded

We believe this may be caused by tight polling done by our eb-env-wait script combined with tight rate limits on the Elastic Beanstalk API endpoint that it calls. This is a separate issue and may not be fixed by the above solution. (On the other hand, if the above solution reduces the number of deployments it may get us under this rate limit as well.)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make deployments more reliable #172

Make deployments more reliable #172

seanh commented May 21, 2024 •

edited

Loading

Make deployments more reliable #172

Make deployments more reliable #172

Comments

seanh commented May 21, 2024 • edited Loading

Problem

Principles

Solution

Limitations of this solution

seanh commented May 21, 2024 •

edited

Loading