-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow shutdown during network snapshot transfer #588
base: main
Are you sure you want to change the base?
Allow shutdown during network snapshot transfer #588
Conversation
9e01890
to
1017c47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @mauri870 !
One possible concern I see here is that I'm not sure that we document that SnapshotSink
implementations have to be thread safe... Until now they've never been used concurrently IIRC (but if you know of a place we do then please correct me).
So it's subtle but it could cause issues if the outer goroutine calls sink.Cancel
on a shutdown concurrently with one of the sink.Cancel
or sink.Close
calls. (Or maybe even just the implicit calls to Write
that io.Copy
will be making.
It would be possible to reduce the number of Sink methods that need to be concurrency safe to just the Write
call (i.e. keep all the error checking and calls to Cancel
and Close
in the main routine here. But it's still imposing a new assumption on Sink's that it's safe to call Cancel or Close concurrently with Write 🤔.
We could check the current implementations in this library and our other tools to see if that would be an issue for our stuff, but even if not, I still wonder if we might break other users of this library in subtle ways?
To be clear it's an edge case - would only happen during raft shutdown. But in some programs you might shutdown raft without intending to exit the program so a race bug that causes invalid state or a panic is still not great.
Not decided if this is a deal breaker for this approach or not but curious to hear your thoughts?
Perhaps we could have a different optional sink interface and only take this approach if the sink "opts in" to being safe for concurrent writes and closes? Then all our build in sinks could implement that (with any changes needed to make that true). And most users would get the fix "for free" while those who implemented custom sinks wouldn't have a subtle new race bug silently creep in on an upgrade?
Thanks for the feedback @banks. I ended up moving the Close and Cancel out of the goroutine, so they can't be called concurrently. |
@mauri870 Thanks! I happened to be looking at I think we'll need a more comprehensive plan for how we can avoid changing the (implicit) interface assumptions that the snapshot sink must allow concurrent writes and close/cancel calls before we can feel OK about merging this. Sorry that's a pain for what seemed like a simple fix but I'm pretty sure we'd introduce new race bugs and possibly panics into our software and other users of this library with the PR as it is. |
With the current implementation a network transfer of the snapshot can block for a long time depending on the size of the snapshot. During this time, if a shutdown is initiated it will block until the network transfer is done.
With this commit it will spawn the network transfer in a separate goroutine, and in a select we can check for either shutdown signal or error/success of the snapshot transfer.
Fixes #585