Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up extra/temp files before file-based recovery #104473

Open
DaveCTurner opened this issue Jan 17, 2024 · 6 comments · May be fixed by #115142
Open

Clean up extra/temp files before file-based recovery #104473

DaveCTurner opened this issue Jan 17, 2024 · 6 comments · May be fixed by #115142
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

If a node crashes part-way through a recovery then it may leave some temp files on disk. A subsequent recovery attempt will eventually clean up any unnecessary files but that cleanup happens towards the end of the recovery process, after making a new copy of the shard on disk, and it's possible that there may not be space for all this data because of the space wasted on old temp files.

Today one of the preparatory steps on a recovery target is to replay any safe operations from the local translog and then flush to make a new safe commit, which happens before we even start to copy any data from the recovery source. After successfully making that new safe commit I think it'd be reasonable to do another cleanup step to discard anything that isn't referenced from that commit, which would include any temporary files from a previous recovery. I suspect we might also be able to clean some things up before replaying the local translog.

@DaveCTurner DaveCTurner added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jan 17, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Jan 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@bcully
Copy link
Contributor

bcully commented Oct 15, 2024

An alternative might be to try to reuse any files transferred in a previous recovery attempt, and only clean those that we wouldn't retransmit?

@DaveCTurner
Copy link
Contributor Author

That's true, but I'd rather we didn't introduce any extra complexity in this area unless we're sure it's needed.

@bcully
Copy link
Contributor

bcully commented Oct 18, 2024

Maybe the simplest, safest thing to do is to only delete known temp files (anything starting with RECOVERY_PREFIX (recovery.). This can be done at any time before transfer and doesn't have any dependencies on the state of a local commit?

@DaveCTurner
Copy link
Contributor Author

Yes I think that's safe and covers most of the problem.

I'd be interested to know if there's a simple way to ask Lucene to delete any cruft it might have left over from any commit or merge that was running at the point in time that the shard failed. It's possible that it cleans this all up as part of calling IndexWriter#commit() anyway. I took a quick look and saw evidence that it does some cleanup at this point, but didn't dig deeply enough to determine whether that cleanup catches everything.

@bcully
Copy link
Contributor

bcully commented Oct 18, 2024

I suppose we could attempt to call cleanupAndVerify if we can produce a snapshot locally, which I think should take care of failed merges etc? If that fails (e.g., because there's no segments file or something) we can fall back to deleting temp files by name.

bcully added a commit to bcully/elasticsearch that referenced this issue Oct 18, 2024
If a node crashes during recovery, it may leave temporary files behind
that can consume disk space, which may be needed to complete recovery.
So we attempt to clean up the index before transferring files from
a recovery source. We first attempt to call the store's
`cleanupAndVerify` method, which removes anything not referenced by
the latest local commit. If that fails, because the local index isn't
in a state to produce a local commit, then we fall back to removing
temporary files by name.

Closes elastic#104473
@bcully bcully linked a pull request Oct 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants