[Incident] CarbonPlan AWS hub had running infrastructure we didn't track #1666

choldgraf · 2022-08-31T05:42:08Z

Summary

We migrated CarbonPlan's cloud infrastructure away from a bespoke kops-based cluster, and towards an AWS-native eksctl cluster. In the process, some of the old kops cluster infrastructure was not properly shut down and got lost during the transition. It was running in the background in a "semi-running" state and incurred a regular amount of cloud costs over time even though nobody was using it.

About a year later a member of CarbonPlan asked us to look into an abnormally high cloud bill, and we discovered this running infrastructure.

Impact on users

Financial impact: the running cloud infrastructure was adding costs to their bill each month. Depending on the amount / size of infrastructure it could be more or less significant (we are investigating).

Important information

Hub URL: https://carbonplan.2i2c.cloud/
Support ticket ref: https://2i2c.freshdesk.com/a/tickets/179

Tasks and updates

Discuss and address incident, leaving comments below with updates
Incident has been dealt with or is over
Copy/paste the after-action report below and fill in relevant sections
Incident title is discoverable and accurate
All actionable items in report have linked GitHub Issues

After-action report template

# After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

## Timeline

_A short list of dates / times and major updates, with links to relevant comments in the issue for more context._

All times in {{ most convenient timezone}}.

- {{ yyyy-mm-dd }} - [Summary of first update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of another update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of final update](link to comment)


## What went wrong

_Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items._

- Thing one
- Thing two

## Where we got lucky

_These are good things that happened to us but not because we had planned for them._

- Thing one
- Thing two

## Follow-up actions

_Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._

### Process improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Documentation improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Technical improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

The text was updated successfully, but these errors were encountered:

choldgraf · 2022-08-31T06:29:22Z

After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

Timeline

A short list of dates / times and major updates, with links to relevant comments in the issue for more context.

All times in CET

2022-08-30 - 10:45am

We receive a FreshDesk request asking us to look into an abnormally high cloud bill on AWS for CarbonPlan.

We investigate the grafana dashboards at grafana.carbonplan.2i2c.cloud. This was a bit difficult because we weren't sure which dashboards were the most important to look at. The Usage report dashboard showed a number of repeated user names and seemed to have a bug in a query code. We decided this was a bug but might not be important for this issue:

Usage report breakdown by user does not properly group users jupyterhub/grafana-dashboards#46

11:30am

We decide we don't see anything too abnormal in grafana. A team member tried to log in to the AWS console for CarbonPlan, but noted that it wasn't present in 2i2c's AWS console website. We weren't sure how to access the CarbonPlan cluster.

Create a source of truth for how to access cloud UI for all clusters #1667

We decided that we need to wait for a team member to wake up because they were the last ones to touch the infrastructure.

2022-08-31 07:00

A team member who had set this up before logged in to the AWS console for CarbonPlan and discovered that there was an old cluster running from an earlier iteration of the deployment. The old iteration had used kops, and we decided to move away from it in this PR:

fdabe0d

In that process, we had to create a new cluster with eksctl, and shut down the cluster using kops. When we did this, we terminated the infrastructure using kops, but it was not entirely decommissioned, and thus began incurring costs.

There was also a kops bug that had failed to delete some volumes which were incurring cost

08:00

We entirely decommissioned the old cluster, and are now monitoring for changes in cloud costs to see how much this will save.

What went wrong

Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items.

The "usage report" dashboard was a bit confusing: Usage report breakdown by user does not properly group users jupyterhub/grafana-dashboards#46
It was unclear how to access the CarbonPlan cluster: Create a source of truth for how to access cloud UI for all clusters #1667
The cloud bill didn't trigger any kind of warning so we didn't know about it.
- Related to Grafana alerting infrastructure #1288 but this wouldn't have solved our problem because it was on a different cluster.
- Issue about using the cloud provider alerting: Explore using billing alerting infrastructure directly from the cloud provider features#13

Follow-up actions

damianavila · 2022-09-12T22:25:53Z

I think all the checkboxes detailed in the top comment were completed, so closing this one now.

yuvipanda · 2022-09-12T22:49:04Z

I think @choldgraf suggested to them that we check back on usage in a week, and see how it went - so this isn't done yet.

damianavila · 2022-09-13T00:37:48Z

Thanks for the clarification, @yuvipanda!

choldgraf · 2022-10-21T13:01:53Z

I believe that we can close this one, we are following up on the response to AWS here:

https://github.com/2i2c-org/meta/issues/376

choldgraf added type: Hub Incident labels Aug 31, 2022

This was referenced Aug 31, 2022

Create a source of truth for how to access cloud UI for all clusters #1667

Closed

Improve our system of paired deployments / operations #1668

Closed

Explore using billing alerting infrastructure directly from the cloud provider 2i2c-org/features#13

Open

damianavila assigned choldgraf Sep 12, 2022

damianavila closed this as completed Sep 12, 2022

yuvipanda reopened this Sep 12, 2022

choldgraf removed type: Hub Incident labels Sep 16, 2022

choldgraf closed this as completed Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Incident] CarbonPlan AWS hub had running infrastructure we didn't track #1666

[Incident] CarbonPlan AWS hub had running infrastructure we didn't track #1666

choldgraf commented Aug 31, 2022 •

edited by damianavila

Loading

choldgraf commented Aug 31, 2022

damianavila commented Sep 12, 2022

yuvipanda commented Sep 12, 2022

damianavila commented Sep 13, 2022

choldgraf commented Oct 21, 2022

[Incident] CarbonPlan AWS hub had running infrastructure we didn't track #1666

[Incident] CarbonPlan AWS hub had running infrastructure we didn't track #1666

Comments

choldgraf commented Aug 31, 2022 • edited by damianavila Loading

Summary

Impact on users

Important information

Tasks and updates

choldgraf commented Aug 31, 2022

After-action report

Timeline

2022-08-30 - 10:45am

11:30am

2022-08-31 07:00

08:00

What went wrong

Follow-up actions

Process improvements

Documentation improvements

Technical improvements

damianavila commented Sep 12, 2022

yuvipanda commented Sep 12, 2022

damianavila commented Sep 13, 2022

choldgraf commented Oct 21, 2022

choldgraf commented Aug 31, 2022 •

edited by damianavila

Loading