Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid reprocessing data in backfill-index #2280

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JasonPowr
Copy link

Issue: #2279

Summary

Motivation

This pr helps to solve the issue of reprocessing already indexed data during repeated runs of the backfill-index job by adding a --enable-redis-index-resume flag to backfill-index. When enabled, backfill-index saves the last processed index in Redis and resumes from it in subsequent runs, processing only new entries.

Release Note

Added a --enable-redis-index-resume flag to backfill-index, enabling jobs to resume from the last processed index and process only new entries for improved efficiency.

Documentation

@JasonPowr JasonPowr requested a review from a team as a code owner November 20, 2024 15:32
@JasonPowr JasonPowr force-pushed the set-up-index-tracking-for-backfill-index branch from ca1c83e to 6e46469 Compare November 20, 2024 15:35
@JasonPowr JasonPowr changed the title set-up-index-tracking-for-backfill-index Avoid reprocessing data in backfill-index Nov 20, 2024
Copy link
Contributor

@haydentherapper haydentherapper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use --start?

@JasonPowr
Copy link
Author

Can you use --start?

Hi @haydentherapper, normally yes, however we tend to run the backfill-index on a schedule e.g a cronjob and keeping '--start' update to date in this context tends to be awkward, we either have to manually do it or store the value for '--start' in an awkward way which could be overwritten or modified.

@JasonPowr
Copy link
Author

@haydentherapper When you get a chance can I get a review on this :), thanks

@haydentherapper
Copy link
Contributor

What is the use case for running this periodically? It’s expected to be run exceptionally, when there is an outage that’s caused a loss of index entries.

How would you handle —end? If this is a cronjob, are you always processing up the last entry?

Copy link

codecov bot commented Dec 4, 2024

Codecov Report

Attention: Patch coverage is 0% with 44 lines in your changes missing coverage. Please review.

Project coverage is 49.80%. Comparing base (488eb97) to head (55041c2).
Report is 250 commits behind head on main.

Files with missing lines Patch % Lines
cmd/backfill-index/main.go 0.00% 44 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2280       +/-   ##
===========================================
- Coverage   66.46%   49.80%   -16.66%     
===========================================
  Files          92      192      +100     
  Lines        9258    24725    +15467     
===========================================
+ Hits         6153    12314     +6161     
- Misses       2359    11320     +8961     
- Partials      746     1091      +345     
Flag Coverage Δ
e2etests 46.49% <ø> (-1.06%) ⬇️
unittests 41.83% <0.00%> (-5.85%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JasonPowr
Copy link
Author

JasonPowr commented Dec 4, 2024

What is the use case for running this periodically?

It was recommend to us to run the cronjob periodically, when we were first getting started with sigsore, as if there are errors writting to the index for any reason, we can make sure that it always manages to stay up to date.

How would you handle —end? If this is a cronjob, are you always processing up the last entry?
Essentially we are curling the rekor api and using the treeSize as the --end value

Hope that helps to clear things up :)

@JasonPowr JasonPowr force-pushed the set-up-index-tracking-for-backfill-index branch from 6e46469 to 668d421 Compare December 4, 2024 13:35
@haydentherapper
Copy link
Contributor

haydentherapper commented Dec 6, 2024

It was recommend to us to run the cronjob periodically

This shouldn't be needed. Failures would be exceptional, e.g. if Redis goes down, and they can be monitored for in the logs as well.

Technically, I don't have an issue with this PR, I just don't think it's necessary. I'm happy to merge this if you still need to be reprocessing data.

@JasonPowr JasonPowr force-pushed the set-up-index-tracking-for-backfill-index branch from 668d421 to 55041c2 Compare December 9, 2024 09:57
@JasonPowr
Copy link
Author

Apologies @haydentherapper linter was giving me trouble :)

Technically, I don't have an issue with this PR, I just don't think it's necessary. I'm happy to merge this if you still need to be reprocessing data.

Thank you that would be great :) and thanks for you input also :)

@cmurphy
Copy link
Contributor

cmurphy commented Dec 10, 2024

This is setting the last filled index to the --end index, which is already known to the script caller since it must be provided. Why not just have the cron job set this value itself when the script completes, and retrieve it to provide to --start when calling the script again?

@JasonPowr
Copy link
Author

Why not just have the cron job set this value itself when the script completes, and retrieve it to provide to --start when calling the script again?

@cmurphy So something like that is definitely possible, however I feel like storing it in this manner is much more reliable, it also makes it more difficult to modify if someone wanted to, given that it could be locked behind a auth, as opposed to written to some env var or file somewhere

@cmurphy
Copy link
Contributor

cmurphy commented Dec 10, 2024

It could just write to the redis instance itself? With redis-cli and the auth credentials you're already passing to the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants