December 25, 2023
- Background
- bump gRPC-gateway to v2
- livez & readyz probes
- v2 deprecation
- workflow (CI)
- lease revoke issue
- bbolt
- raft
- community
- etcd mentorship program
This article briefly summarizes the major changes that have occurred in the etcd community in the third and fourth quarters of 2023. I should have made a separate summary of the third quarter long ago, but there were too many chores and also due to my laziness, so it has been postponed until now.
gPRC-gateway has a very close connection with protobuf. Specifically, gRPC-gateway v1 depends on protobuf v1, and gRPC-gateway v2 depends on protobuf v2.
I originally planned to upgrade protobuf and gRPC-gateway to v2 at the same time, but the changes were too big, and I was worried that it might have potential compatibility risks, so I finally decided to upgrade gRPC-gateway to v2 separately first.
So the final result is that gRPC-gateway v2 coexists with protobuf v1. In order to make them work correctly, we have made some modifications to the code automatically generated by gRPC-gateway v2. Please refer to the PR etcd/pull/16595 for more details.
etcd previously only supported one health check endpoint: /health
. Users can query the health status of local nodes or clusters through the parameter serializable
. In fact, many people do not fully understand what serializable means, so in order to simplify users' understanding burden, two separate health check endpoints, /livez
and /readyz
, were introduced.
For details, please refer to the discussion in issue etcd/issues/16007 and the PRs associated with it.
In short, the role of /livez
is to tell the clients when a node needs to be restarted; and /readyz
tells the clients when a node is ready to receive customer requests.
The design follows exactly the same pattern as Kubernetes apiserver's health check.
This feature has been backported to etcd 3.4 and 3.5. Will be included in 3.5.12 and 3.4.29. Actually, I was originally not in favor of backporting this feature into 3.4 and 3.5, but in the end I agreed, mainly for two reasons:
- This change is not actually a core part of etcd and has basically no impact on the stability of etcd;
- Some people in the community strongly requested a backport (although no very reasonable reason was given).
v2 deprecation changes etcd's storage from v2store to v3store (bbolt). This issue etcd/issues/12913 was raised by ptabor nearly 3 years ago, which means that this change has been in progress for nearly 3 years, and roughly 4 contributors worked on it, but it has not been completely completed yet.
Although there has been considerable progress in the past few months, there are still some tasks and documentation work that have not been completely completed. In 3.6, etcd no longer accepts v2 requests from clients, and only handles specific v2 requests in etcd's apply process (for compatibility with 3.5).
If anyone is interested in continuing to complete the remaining work, please contact me. However, it requires a relatively in-depth understanding of etcd. Specifically, it is necessary to understand the interaction process between etcd and raft and the apply process of etcd.
Since SIG-ifying etcd, some workflows of etcd have been migrated to the Kubernetes prow test platform. A direct benefit is that all contributors can also rerun failed tests themselves. In the past, only maintainers had this permission.
In addition, the tests for the arm64 platform were migrated to the environment provided by actuated. Please refer to PR etcd/pull/16801 for details.
The issue etcd/issues/15247 was raised in the community on Feb this year and was finally fixed after more than half a year. This problem mainly occurs when the leader blocks for a long time (such as when writing WAL log), which may cause the lease to be incorrectly revoked in the end.
Please refer to etcd/issues/15247#issuecomment-1777862093 to get the root cause of the issue.
However, this problem is currently only fixed on the main branch, and I have not yet gotten time to backport it to 3.4 and 3.5.
bbolt mainly continues to reproduce + investigate the data corruption issues. We have made some big progress, and discovered a linux kernel issue, which was introduced by the new feature fast-commit, may also cause the data corruption. When the fast-commit feature is turned on, data may be lost when the system suddenly crashes or power off. Eventually leading to bbolt data corruption.
From a programming perspective, when the system call fdatasync
is called, the system returns success, but the data is not actually persisted. If the system powers off at this time, the data may be lost. That is, fdatasync
does not comply with its semantics. For details, please refer to the discussion of the issue bbolt/issues/562.
In addition, further improvements have been made to the surgery command. The surgery command is to try to use some surgery (similar to surgery) to restore the database file when data corruption occurs.
Raft has not changed much, only a few minor improvements. One of them is the addition of ForgetLeader
. When the upper-layer application (raft-based applications) is pretty sure that the leader is dead, it can directly call ForgetLeader
on other followers to quickly generate a new leader. Otherwise, you need to wait for the election timeout to elect a new leader.
Please refer to PR raft/pull/78 for details.
First of all, raft maintainer @tbg has retired, see PR raft/pull/85. He will leave the community for at least one year. @tbg is very nice and definitely is an expert in the raft community.
In addition, before @tbg retired, he and I jointly nominated another Cockroachdb engineer as a reviewer, see PR raft/pull/87.
In the bbolt community, I recently nominated a new reviewer from Microsoft, see PR bbolt/issues/648. It is worth mentioning that the bbolt data corruption problem caused by the Linux kernel fast-commit bug was discovered by the new bbolt reviewer.
In addition, a new member from Google has been added to the etcd community.
The community launched a mentorship program a few months ago, and I am responsible for mentoring three mentors from different companies. Please refer to https://groups.google.com/g/etcd-dev/c/ywjqpJOVkwM. For specific reference,
Unfortunately, mentee from VMware was unable to continue this program due to job changes.
PS. Today happens to be Christmas, I wish everyone a Merry Christmas.