Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] CarbonPlan AWS hub had running infrastructure we didn't track #1666

Closed
5 tasks done
choldgraf opened this issue Aug 31, 2022 · 5 comments
Closed
5 tasks done
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Aug 31, 2022

Summary

We migrated CarbonPlan's cloud infrastructure away from a bespoke kops-based cluster, and towards an AWS-native eksctl cluster. In the process, some of the old kops cluster infrastructure was not properly shut down and got lost during the transition. It was running in the background in a "semi-running" state and incurred a regular amount of cloud costs over time even though nobody was using it.

About a year later a member of CarbonPlan asked us to look into an abnormally high cloud bill, and we discovered this running infrastructure.

Impact on users

  • Financial impact: the running cloud infrastructure was adding costs to their bill each month. Depending on the amount / size of infrastructure it could be more or less significant (we are investigating).

Important information

Tasks and updates

  • Discuss and address incident, leaving comments below with updates
  • Incident has been dealt with or is over
  • Copy/paste the after-action report below and fill in relevant sections
  • Incident title is discoverable and accurate
  • All actionable items in report have linked GitHub Issues
After-action report template
# After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

## Timeline

_A short list of dates / times and major updates, with links to relevant comments in the issue for more context._

All times in {{ most convenient timezone}}.

- {{ yyyy-mm-dd }} - [Summary of first update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of another update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of final update](link to comment)


## What went wrong

_Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items._

- Thing one
- Thing two

## Where we got lucky

_These are good things that happened to us but not because we had planned for them._

- Thing one
- Thing two

## Follow-up actions

_Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._

### Process improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Documentation improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Technical improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]
@choldgraf
Copy link
Member Author

After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

Timeline

A short list of dates / times and major updates, with links to relevant comments in the issue for more context.

All times in CET

2022-08-30 - 10:45am

We receive a FreshDesk request asking us to look into an abnormally high cloud bill on AWS for CarbonPlan.

We investigate the grafana dashboards at grafana.carbonplan.2i2c.cloud. This was a bit difficult because we weren't sure which dashboards were the most important to look at. The Usage report dashboard showed a number of repeated user names and seemed to have a bug in a query code. We decided this was a bug but might not be important for this issue:

11:30am

We decide we don't see anything too abnormal in grafana. A team member tried to log in to the AWS console for CarbonPlan, but noted that it wasn't present in 2i2c's AWS console website. We weren't sure how to access the CarbonPlan cluster.

We decided that we need to wait for a team member to wake up because they were the last ones to touch the infrastructure.

2022-08-31 07:00

A team member who had set this up before logged in to the AWS console for CarbonPlan and discovered that there was an old cluster running from an earlier iteration of the deployment. The old iteration had used kops, and we decided to move away from it in this PR:

In that process, we had to create a new cluster with eksctl, and shut down the cluster using kops. When we did this, we terminated the infrastructure using kops, but it was not entirely decommissioned, and thus began incurring costs.

There was also a kops bug that had failed to delete some volumes which were incurring cost

08:00

We entirely decommissioned the old cluster, and are now monitoring for changes in cloud costs to see how much this will save.

What went wrong

Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items.

Follow-up actions

Process improvements

Documentation improvements

Technical improvements

@damianavila
Copy link
Contributor

I think all the checkboxes detailed in the top comment were completed, so closing this one now.

@yuvipanda
Copy link
Member

I think @choldgraf suggested to them that we check back on usage in a week, and see how it went - so this isn't done yet.

@damianavila
Copy link
Contributor

Thanks for the clarification, @yuvipanda!

@choldgraf
Copy link
Member Author

I believe that we can close this one, we are following up on the response to AWS here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants