RaftCluster not respecting handling of append/commit of custom IPersistentState #101

Arkensor · 2022-02-06T16:32:40Z

Arkensor
Feb 6, 2022

Hello,

I was looking into edge cases with how I store my IPersistentState and I run into an odd implementation behavior.

I am concerned about RaftCluster::AppendEntriesAsync -> AppendAndCommitAsync. As part of the heartbeat replication it is checked if the log has the previous index and term and if so will go through and append. If the follower log is not where the leader thought it was this will return false correctly and he will have to look for the correct append index to match the follower. That works ...

... but the code does not consider the append or commit operation to fail! In a real world scenario it is quite possible that the underlying storage infrastructure is not able to handle the write operation just at that moment. Normally this should not be a problem. If an exception is thrown or a false is returned or a count of 0 logs was appended then the response to the appendentries should be false if the append entries was not emtpy. The code does not check this.

It only does (RaftCluster.cs L: 431ff)

 await auditTrail.AppendAndCommitAsync(entries, prevLogIndex + 1L, true, commitIndex, token).ConfigureAwait(false);
 result = true;

Which assumes that both append and commit will succeed. If an exception is thrown in this context it will go up through the caller logic and be interpreted as "unable to deliver heartbeat" and the follower is marked as unavailable. You could consider this behavior as intended, but I think that is really not how it should work.

In my test scenario my persistent state "drops" any incoming append or commit entries on followers. Thus LastCommittedEntryIndex and LastUncommittedEntryIndex stay the same after the appendentries. And for that sake auditTrail.AppendAndCommitAsync returns 0. The function can already return the amount of appended logs entries, but it is simply not used or in other code pieces only for logging.

So what can happen is:

Leader gets a new entry. HighestIndex = 1. CommitIndex = 0
Leader sends AppendEntries to all followers.
The followers successfully receive the RPC but decide not do any log changes (file io might be in use atm)
The follower logic thinks everything was fine, since no exceptions thrown, even though the log was not updated.
The followers respond to the leader request
The leader thinks his appendEntries was successfully received by a majority and marks this entry as committed
The leader now is in the state HighestIndex = 1. CommitIndex = 1
The followers remain on HighestIndex = 0. CommitIndex = 0
The leader as part of the heartbeat routines will continue to send the followers append entries, and because their index is not in sync, will continue to try and give them data. This is of course the intended behavior ... but he locally thinks the change really is already commited and there are just some additional followers he also needs to sync up - even though he was the only one who ever appended it to his log.

I am not sure what the formal flow is in the raft algorithm when it comes to this. The paper imagines a world of without flaws and does not consider the log to not be available for operations. I wonder how this was dealt with for the existing file storage implementation. You could await the access to the resource BUT if that takes longer than the request timeout the leader will consider his request failed, because he got no answer in time and the follower is maked as unavailable. The only way around this would be to increase the timeout to such a high amount to compensate for the worst case log access times... but this feels like a flaw. Why not simply respond to the append entries rpc, thus the leader knows the follower is sill there but say "hey, I could not append any entries, you will need to try this again later".

So my proposal is to update the code to something like this (and other places where this logic might be)

long count = await auditTrail.AppendAndCommitAsync(entries, prevLogIndex + 1L, true, commitIndex, token).ConfigureAwait(false);
if(count > 0L) result = true;

The default fall through there would be false already. Membership changes with the // process configuration block would remain as they are, and do their tasks.

Edit:
Alternatively, what might be easier to apply to the existing source code is an explicit check at the end of the AppendEntries RPC for both LastCommittedEntryIndex and LastUncommittedEntryIndex to match the expectation. If those do not report the values we expect, false must be reported back to the leader in my opinion.

The behavior would be that the leader has to a) deliver the appendentries rpc AND b) the rpc must be successfully processed. Instead of listening for the count, it might also make sense to wrap AppendAndCommitAsync in a catch block to consider any exception during that a failure and reason for the leader to try again.

This should give fine grain control over when a log entry really is appended and when a log entry really is committed.

Let me know what you think about this :)

Thanks!

Answered by sakno

Feb 7, 2022

I think the other replication logic could still work.

But it will work. The leader advances its state machine if the majority of nodes commits the entries.

Generally, in Raft the follower may respond only with two responses: true and false. true marks the follower as replicated and increases NextIndex while false indicates that the assumption about the state of the follower's log is wrong and NextIndex must be decrement and replication must be repeated on the next iteration of heartbeats. There is no third option. To be more precise, the third option is an unavailable member because it cannot respond with true nor false. What if the majority could not advance their position due to busy …

View full answer

sakno · 2022-02-06T20:22:54Z

sakno
Feb 6, 2022
Maintainer

If underlying storage device is inaccessible then it means that the entire node should be considered as unavailable (remember that the follower is able to process read requests). Obviously, the leader should mark this member as unavailable. Commit process as well as appending uncommitted log entries require disk I/O. If AppendAndCommitAsync cannot be completed atomically then it should throw an exception. The leader will not increment NextIndex for that member and retry replication on the next round of heartbeats. Your custom WAL should implement fail fast scenario to ensure that storage device is available for writes.

Btw, AppendAndCommitAsync currently may return 0 if the follower is in sync with the leader and there is nothing to commit. This result is considered as correct behavior and the follower must return true.

0 replies

sakno · 2022-02-06T20:24:13Z

sakno
Feb 6, 2022
Maintainer

Returning false forces the leader to decrement NextIndex for the member, which means that the member is too far behind.

0 replies

Arkensor · 2022-02-07T08:46:11Z

Arkensor
Feb 7, 2022
Author

I can see how that makes sense yes, but I still think its sub-optimal. Membership availability is reduced to "are members able to write to their raft log" instead of "are they there". It should not matter if they can write, saying "yeah i got your request but did nothing" should still result in another round of replication, and the leader not considering the successful write for a majority. I think the other replication logic could still work.

A follower can of course still answer read requests. Any existing log entries that might be in a memory cache and do not need to be loaded from the io storage can be answered. Only if that would also fail they would return no log entries or throw an exception indicating that any read request fails.

Throwing exceptions blows up the entire http middleware as well. They are not caught for the RaftHttpCluster and the request is not answered in a controlled way - at least that is how I experienced it.

It would be a shame having to add your own availability for cluster members on top of the information of the raft cluster, just because their log is unavailable for a brief moment. Usually followers can do more than just maintain their log. For me they are also task workers, and the log is just their way to communicate tasks and results to ensure that everybody is aware of the overall cluster state.

But judging from your comments a use case like this does not appear to be in scope of this library. I guess I was expecting it to allow me more control than it does. I saw all the return counts and how I had control over what is committed and what is not via LastCommittedEntryIndex and LastUncommittedEntryIndex, but that info is not touched in the logic that I wanted to control :)

0 replies

sakno · 2022-02-07T08:52:50Z

sakno
Feb 7, 2022
Maintainer

I think the other replication logic could still work.

But it will work. The leader advances its state machine if the majority of nodes commits the entries.

Generally, in Raft the follower may respond only with two responses: true and false. true marks the follower as replicated and increases NextIndex while false indicates that the assumption about the state of the follower's log is wrong and NextIndex must be decrement and replication must be repeated on the next iteration of heartbeats. There is no third option. To be more precise, the third option is an unavailable member because it cannot respond with true nor false. What if the majority could not advance their position due to busy disk? The leader must conclude that the cluster is no longer available.

1 reply

Arkensor Feb 8, 2022
Author

Yeah as I said I can see how this is the only way, if you reduce the cluster functionality to "its just there for raft". I want it to do more, and I need to decide between "raft not available" vs "member is really not reachable", but I will just have to manage that on my own with an additional message to see if he is really "offline", as soon as the raft cluster suggest he is "unavailable".

Thank you for your insights regardless 😃

sakno · 2022-02-07T09:13:56Z

sakno
Feb 7, 2022
Maintainer

In case of HTTP transport, you can use custom middleware to handle the specific type of exception and convert it to InternalServerError HTTP status. HTTP client for Raft internally can recognize this status and mark the member as unavailable.

1 reply

sakno Feb 7, 2022
Maintainer

@Arkensor , the issue turned into discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RaftCluster not respecting handling of append/commit of custom IPersistentState #101

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

RaftCluster not respecting handling of append/commit of custom IPersistentState #101

Arkensor Feb 6, 2022

Replies: 5 comments · 2 replies

sakno Feb 6, 2022 Maintainer

sakno Feb 6, 2022 Maintainer

Arkensor Feb 7, 2022 Author

sakno Feb 7, 2022 Maintainer

Arkensor Feb 8, 2022 Author

sakno Feb 7, 2022 Maintainer

sakno Feb 7, 2022 Maintainer

Arkensor
Feb 6, 2022

Replies: 5 comments 2 replies

sakno
Feb 6, 2022
Maintainer

sakno
Feb 6, 2022
Maintainer

Arkensor
Feb 7, 2022
Author

sakno
Feb 7, 2022
Maintainer

Arkensor Feb 8, 2022
Author

sakno
Feb 7, 2022
Maintainer

sakno Feb 7, 2022
Maintainer