-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid reprocessing data in backfill-index #2280
base: main
Are you sure you want to change the base?
Avoid reprocessing data in backfill-index #2280
Conversation
ca1c83e
to
6e46469
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use --start
?
Hi @haydentherapper, normally yes, however we tend to run the backfill-index on a schedule e.g a cronjob and keeping '--start' update to date in this context tends to be awkward, we either have to manually do it or store the value for '--start' in an awkward way which could be overwritten or modified. |
@haydentherapper When you get a chance can I get a review on this :), thanks |
What is the use case for running this periodically? It’s expected to be run exceptionally, when there is an outage that’s caused a loss of index entries. How would you handle —end? If this is a cronjob, are you always processing up the last entry? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2280 +/- ##
===========================================
- Coverage 66.46% 49.80% -16.66%
===========================================
Files 92 192 +100
Lines 9258 24725 +15467
===========================================
+ Hits 6153 12314 +6161
- Misses 2359 11320 +8961
- Partials 746 1091 +345
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
It was recommend to us to run the cronjob periodically, when we were first getting started with sigsore, as if there are errors writting to the index for any reason, we can make sure that it always manages to stay up to date.
Hope that helps to clear things up :) |
6e46469
to
668d421
Compare
This shouldn't be needed. Failures would be exceptional, e.g. if Redis goes down, and they can be monitored for in the logs as well. Technically, I don't have an issue with this PR, I just don't think it's necessary. I'm happy to merge this if you still need to be reprocessing data. |
Signed-off-by: JasonPowr <[email protected]>
668d421
to
55041c2
Compare
Apologies @haydentherapper linter was giving me trouble :)
Thank you that would be great :) and thanks for you input also :) |
This is setting the last filled index to the |
@cmurphy So something like that is definitely possible, however I feel like storing it in this manner is much more reliable, it also makes it more difficult to modify if someone wanted to, given that it could be locked behind a auth, as opposed to written to some env var or file somewhere |
It could just write to the redis instance itself? With redis-cli and the auth credentials you're already passing to the script. |
Issue: #2279
Summary
Motivation
This pr helps to solve the issue of reprocessing already indexed data during repeated runs of the backfill-index job by adding a --enable-redis-index-resume flag to backfill-index. When enabled, backfill-index saves the last processed index in Redis and resumes from it in subsequent runs, processing only new entries.
Release Note
Added a --enable-redis-index-resume flag to backfill-index, enabling jobs to resume from the last processed index and process only new entries for improved efficiency.
Documentation