-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up extra/temp files before file-based recovery #104473
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
An alternative might be to try to reuse any files transferred in a previous recovery attempt, and only clean those that we wouldn't retransmit? |
That's true, but I'd rather we didn't introduce any extra complexity in this area unless we're sure it's needed. |
Maybe the simplest, safest thing to do is to only delete known temp files (anything starting with |
Yes I think that's safe and covers most of the problem. I'd be interested to know if there's a simple way to ask Lucene to delete any cruft it might have left over from any commit or merge that was running at the point in time that the shard failed. It's possible that it cleans this all up as part of calling |
I suppose we could attempt to call |
If a node crashes during recovery, it may leave temporary files behind that can consume disk space, which may be needed to complete recovery. So we attempt to clean up the index before transferring files from a recovery source. We first attempt to call the store's `cleanupAndVerify` method, which removes anything not referenced by the latest local commit. If that fails, because the local index isn't in a state to produce a local commit, then we fall back to removing temporary files by name. Closes elastic#104473
If a node crashes part-way through a recovery then it may leave some temp files on disk. A subsequent recovery attempt will eventually clean up any unnecessary files but that cleanup happens towards the end of the recovery process, after making a new copy of the shard on disk, and it's possible that there may not be space for all this data because of the space wasted on old temp files.
Today one of the preparatory steps on a recovery target is to replay any safe operations from the local translog and then flush to make a new safe commit, which happens before we even start to copy any data from the recovery source. After successfully making that new safe commit I think it'd be reasonable to do another cleanup step to discard anything that isn't referenced from that commit, which would include any temporary files from a previous recovery. I suspect we might also be able to clean some things up before replaying the local translog.
The text was updated successfully, but these errors were encountered: