Avoid reprocessing data in backfill-index #2280

JasonPowr · 2024-11-20T15:32:53Z

Summary

Motivation

This pr helps to solve the issue of reprocessing already indexed data during repeated runs of the backfill-index job by adding a --enable-redis-index-resume flag to backfill-index. When enabled, backfill-index saves the last processed index in Redis and resumes from it in subsequent runs, processing only new entries.

Release Note

Added a --enable-redis-index-resume flag to backfill-index, enabling jobs to resume from the last processed index and process only new entries for improved efficiency.

Documentation

haydentherapper

Can you use --start?

JasonPowr · 2024-11-21T08:27:19Z

Can you use --start?

Hi @haydentherapper, normally yes, however we tend to run the backfill-index on a schedule e.g a cronjob and keeping '--start' update to date in this context tends to be awkward, we either have to manually do it or store the value for '--start' in an awkward way which could be overwritten or modified.

JasonPowr · 2024-11-25T15:05:46Z

@haydentherapper When you get a chance can I get a review on this :), thanks

haydentherapper · 2024-12-04T04:02:45Z

What is the use case for running this periodically? It’s expected to be run exceptionally, when there is an outage that’s caused a loss of index entries.

How would you handle —end? If this is a cronjob, are you always processing up the last entry?

codecov · 2024-12-04T04:09:19Z

Codecov Report

Attention: Patch coverage is 0% with 44 lines in your changes missing coverage. Please review.

Project coverage is 49.80%. Comparing base (488eb97) to head (55041c2).
Report is 250 commits behind head on main.

Files with missing lines	Patch %	Lines
cmd/backfill-index/main.go	0.00%	44 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2280       +/-   ##
===========================================
- Coverage   66.46%   49.80%   -16.66%     
===========================================
  Files          92      192      +100     
  Lines        9258    24725    +15467     
===========================================
+ Hits         6153    12314     +6161     
- Misses       2359    11320     +8961     
- Partials      746     1091      +345

Flag	Coverage Δ
e2etests	`46.49% <ø> (-1.06%)`	⬇️
unittests	`41.83% <0.00%> (-5.85%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JasonPowr · 2024-12-04T08:44:28Z

What is the use case for running this periodically?

It was recommend to us to run the cronjob periodically, when we were first getting started with sigsore, as if there are errors writting to the index for any reason, we can make sure that it always manages to stay up to date.

How would you handle —end? If this is a cronjob, are you always processing up the last entry?
Essentially we are curling the rekor api and using the treeSize as the --end value

Hope that helps to clear things up :)

haydentherapper · 2024-12-06T22:36:00Z

It was recommend to us to run the cronjob periodically

This shouldn't be needed. Failures would be exceptional, e.g. if Redis goes down, and they can be monitored for in the logs as well.

Technically, I don't have an issue with this PR, I just don't think it's necessary. I'm happy to merge this if you still need to be reprocessing data.

Signed-off-by: JasonPowr <[email protected]>

JasonPowr · 2024-12-09T09:59:26Z

Apologies @haydentherapper linter was giving me trouble :)

Technically, I don't have an issue with this PR, I just don't think it's necessary. I'm happy to merge this if you still need to be reprocessing data.

Thank you that would be great :) and thanks for you input also :)

cmurphy · 2024-12-10T00:35:16Z

This is setting the last filled index to the --end index, which is already known to the script caller since it must be provided. Why not just have the cron job set this value itself when the script completes, and retrieve it to provide to --start when calling the script again?

JasonPowr · 2024-12-10T09:04:40Z

Why not just have the cron job set this value itself when the script completes, and retrieve it to provide to --start when calling the script again?

@cmurphy So something like that is definitely possible, however I feel like storing it in this manner is much more reliable, it also makes it more difficult to modify if someone wanted to, given that it could be locked behind a auth, as opposed to written to some env var or file somewhere

cmurphy · 2024-12-10T18:40:43Z

It could just write to the redis instance itself? With redis-cli and the auth credentials you're already passing to the script.

JasonPowr requested a review from a team as a code owner November 20, 2024 15:32

JasonPowr force-pushed the set-up-index-tracking-for-backfill-index branch from ca1c83e to 6e46469 Compare November 20, 2024 15:35

JasonPowr changed the title ~~set-up-index-tracking-for-backfill-index~~ Avoid reprocessing data in backfill-index Nov 20, 2024

haydentherapper reviewed Nov 20, 2024

View reviewed changes

JasonPowr mentioned this pull request Nov 22, 2024

SECURESIGN-1476 | Add the Redis backfill job to Ansible collection securesign/artifact-signer-ansible#101

Draft

JasonPowr force-pushed the set-up-index-tracking-for-backfill-index branch from 6e46469 to 668d421 Compare December 4, 2024 13:35

set-up-index-tracking-for-backfill-index

55041c2

Signed-off-by: JasonPowr <[email protected]>

JasonPowr force-pushed the set-up-index-tracking-for-backfill-index branch from 668d421 to 55041c2 Compare December 9, 2024 09:57

haydentherapper requested a review from cmurphy December 9, 2024 22:01

haydentherapper approved these changes Dec 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid reprocessing data in backfill-index #2280

Avoid reprocessing data in backfill-index #2280

JasonPowr commented Nov 20, 2024

haydentherapper left a comment

JasonPowr commented Nov 21, 2024

JasonPowr commented Nov 25, 2024

haydentherapper commented Dec 4, 2024

codecov bot commented Dec 4, 2024 •

edited

Loading

JasonPowr commented Dec 4, 2024 •

edited

Loading

haydentherapper commented Dec 6, 2024 •

edited

Loading

JasonPowr commented Dec 9, 2024

cmurphy commented Dec 10, 2024

JasonPowr commented Dec 10, 2024

cmurphy commented Dec 10, 2024

Avoid reprocessing data in backfill-index #2280

Are you sure you want to change the base?

Avoid reprocessing data in backfill-index #2280

Conversation

JasonPowr commented Nov 20, 2024

Summary

Motivation

Release Note

Documentation

haydentherapper left a comment

Choose a reason for hiding this comment

JasonPowr commented Nov 21, 2024

JasonPowr commented Nov 25, 2024

haydentherapper commented Dec 4, 2024

codecov bot commented Dec 4, 2024 • edited Loading

Codecov Report

JasonPowr commented Dec 4, 2024 • edited Loading

haydentherapper commented Dec 6, 2024 • edited Loading

JasonPowr commented Dec 9, 2024

cmurphy commented Dec 10, 2024

JasonPowr commented Dec 10, 2024

cmurphy commented Dec 10, 2024

codecov bot commented Dec 4, 2024 •

edited

Loading

JasonPowr commented Dec 4, 2024 •

edited

Loading

haydentherapper commented Dec 6, 2024 •

edited

Loading