Clean up extra/temp files before file-based recovery #104473

DaveCTurner · 2024-01-17T16:30:00Z

If a node crashes part-way through a recovery then it may leave some temp files on disk. A subsequent recovery attempt will eventually clean up any unnecessary files but that cleanup happens towards the end of the recovery process, after making a new copy of the shard on disk, and it's possible that there may not be space for all this data because of the space wasted on old temp files.

Today one of the preparatory steps on a recovery target is to replay any safe operations from the local translog and then flush to make a new safe commit, which happens before we even start to copy any data from the recovery source. After successfully making that new safe commit I think it'd be reasonable to do another cleanup step to discard anything that isn't referenced from that commit, which would include any temporary files from a previous recovery. I suspect we might also be able to clean some things up before replaying the local translog.

elasticsearchmachine · 2024-01-17T16:30:24Z

Pinging @elastic/es-distributed (Team:Distributed)

bcully · 2024-10-15T21:19:14Z

An alternative might be to try to reuse any files transferred in a previous recovery attempt, and only clean those that we wouldn't retransmit?

DaveCTurner · 2024-10-15T21:52:33Z

That's true, but I'd rather we didn't introduce any extra complexity in this area unless we're sure it's needed.

bcully · 2024-10-18T17:53:44Z

Maybe the simplest, safest thing to do is to only delete known temp files (anything starting with RECOVERY_PREFIX (recovery.). This can be done at any time before transfer and doesn't have any dependencies on the state of a local commit?

DaveCTurner · 2024-10-18T18:22:35Z

Yes I think that's safe and covers most of the problem.

I'd be interested to know if there's a simple way to ask Lucene to delete any cruft it might have left over from any commit or merge that was running at the point in time that the shard failed. It's possible that it cleans this all up as part of calling IndexWriter#commit() anyway. I took a quick look and saw evidence that it does some cleanup at this point, but didn't dig deeply enough to determine whether that cleanup catches everything.

bcully · 2024-10-18T19:05:57Z

I suppose we could attempt to call cleanupAndVerify if we can produce a snapshot locally, which I think should take care of failed merges etc? If that fails (e.g., because there's no segments file or something) we can fall back to deleting temp files by name.

If a node crashes during recovery, it may leave temporary files behind that can consume disk space, which may be needed to complete recovery. So we attempt to clean up the index before transferring files from a recovery source. We first attempt to call the store's `cleanupAndVerify` method, which removes anything not referenced by the latest local commit. If that fails, because the local index isn't in a state to produce a local commit, then we fall back to removing temporary files by name. Closes elastic#104473

DaveCTurner added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jan 17, 2024

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Jan 17, 2024

bcully linked a pull request Oct 18, 2024 that will close this issue

Attempt to clean up index before remote transfer #115142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up extra/temp files before file-based recovery #104473

Clean up extra/temp files before file-based recovery #104473

DaveCTurner commented Jan 17, 2024

elasticsearchmachine commented Jan 17, 2024

bcully commented Oct 15, 2024

DaveCTurner commented Oct 15, 2024

bcully commented Oct 18, 2024

DaveCTurner commented Oct 18, 2024

bcully commented Oct 18, 2024

Clean up extra/temp files before file-based recovery #104473

Clean up extra/temp files before file-based recovery #104473

Comments

DaveCTurner commented Jan 17, 2024

elasticsearchmachine commented Jan 17, 2024

bcully commented Oct 15, 2024

DaveCTurner commented Oct 15, 2024

bcully commented Oct 18, 2024

DaveCTurner commented Oct 18, 2024

bcully commented Oct 18, 2024