- Background
- Current status
- How to reproduce this issue
- Root cause
- How is the issue resolved
- How to workaround this issue
If etcd crashes during processing defragmentation operation, when the etcd instance starts again, it might reapply some entries which have already been applied, eventually the member's data & revision might be inconsistent with other members. Please note that there is no impact if performing the defragmentation operation offline using etcdutl.
Note that usually there is no data loss, and clients can always get the latest correct data. The only issue is the problematic etcd member’s revision might be a little larger than the other members. But if etcd reapplies some conditional transactions, then it might also cause data inconsistency.
This is a regression issue introduced in pull/12855, and all the existing 3.5.x releases (including 3.5.0 ~ 3.5.5) are impacted. Note that previous critical issue issues/13766 was also caused by the same PR (12855).
etcd 3.4 doesn't have this issue. Note that there is no impact on etcd 3.5 either if performing the defragmentation operation offline using etcdutl.
It should be very hard to reproduce this issue in production environment, because:
- Usually users rarely execute the defragmentation operation.
- It is low possibility for etcd to crash during defragmentation operation.
- Even when etcd crashes during defragmentation operation, it isn't guaranteed to reproduce this issue. If there is no traffic when performing defragmentation, then it will not run into this issue.
I just delivered a PR pull/14730 for main branch (3.6.0) and will backport it to release-3.5 later.
The fix will be included in etcd v3.5.6.
It's really interesting and funny the PR number 14730 is very similar to previous important issue 14370.
Run load test on an etcd cluster, and perform defragmentation operation on one member. Kill the member when the defragmentation operation is in progress. Afterwards, start the member again, then the member's revision might be inconsistent with other members.
Usually the problematic member's revision will be larger than other members, because etcd re-applies some duplicated entries.
You can also reproduce this issue by executing the E2E test case TestLinearizability.
When etcd processes the defragmentation operation, it commits all pending data into boltDB, but not including the consistent index, so the persisted data may not match the consistent index. If etcd crashes for whatever reason during or immediately after the defragmentation operation, when it starts again it will replay the WAL entries starting from the latest snapshot, accordingly it may re-apply some entries which might have already been applied, eventually the revision isn't consistent with other members.
Specifically, when etcd processes defragmentation operation, it calls unsafeCommit, which doesn't call the OnPreCommitUnsafe, so the consistent index isn't persisted.
It's simple, call the OnPreCommitUnsafe
in method unsafeCommit
instead of commit
. Please refer to pull/14730.
If you run into this issue, then you need to remove the problematic member and cleanup its local data. Afterwards, add the member into the cluster again, then it will sync data from the leader automatically.