Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlight metrics exist for backup success #82

Closed
wants to merge 1 commit into from

Conversation

rcrowe
Copy link
Member

@rcrowe rcrowe commented Oct 9, 2023

No description provided.

@rcrowe rcrowe requested a review from a team as a code owner October 9, 2023 14:53
sum by (kubernetes_namespace) (increase(jobs_backup_resume_completed{kubernetes_namespace=~"auth|auth-customer"}[14h])) == 0
```

While the backup job may have successfully run, that is not a full proof way of confirming you'll be able to restore it 😅
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity: what are some of the stories here, is it possible to make a backup that's not actually usable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewhughes-uw For the most part I'm airing on the side of caution, those stories of people backing up and then coming to restore to find it doesn't work.

That if you want full faith in the system you need to automate & check your restore process. Perhaps the message is alarmist here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the message is alarmist here?

Not alarmist, I just found it raised more questions. I think your point might be value to document here

Suggested change
While the backup job may have successfully run, that is not a full proof way of confirming you'll be able to restore it 😅
While the backup job may have successfully run, that is not a full proof way of confirming you'll be able to restore it.
If you want full faith in the system you need to automate and check your restore process.

Copy link
Contributor

@matthewhughes-uw matthewhughes-uw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: where does this metric come from? I see on https://www.cockroachlabs.com/docs/stable/backup-and-restore-monitoring (maybe worth linking) there is a metric jobs.restore.resume_completed

EDIT: I looked at an older version of those docs https://www.cockroachlabs.com/docs/v22.2/backup-and-restore-monitoring and see it, so I guess those metrics were renamed?

@MarcinGinszt
Copy link
Contributor

I moved the PR here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants