Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft/c: fix an indefinite hang in transfer leadership #24404

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Dec 3, 2024

This PR fixes a timeout in raft leadership transfer request. The timeout is caused by a race condition in raft that results in a stuck recovery_stm, even after losing leadership. Exact sequence of events below:

  • Request to transfer leadership in term - 10
  • Starts a recovery_stm to catchup a follower term - 10
  • recovery_stm detects the follower has already been dispatched the latest log index, blocks on a CV until it hears back from the follower (indefinite wait)
  • A racy step_down from a failed replicate request (stm initiated)
  • A new leader takes over and local term incremented - term - 11
  • recovery_stm (in term 10) is stuck waiting for the CV is signaled (which happens if the node assumes leadership again in the future or at shutdown, an indefinite hang if it never assumes leadership)

This PR signals the CV upon step down to unblock the recovery_stm which then detects it is has to shutdown due to a step down (loss of leadership) and this automatically unblocks the transfer leadership request which was waiting for it.

An unrelated (but deeper) issue this PR avoids fixing is why the step down happens when a transfer is already in progress. In this case it was initiated by an STM invariant that requires a step down upon first failure (required for correctness). One can argue that it is not the right behavior to step down when a transfer is already in progress. The fix for it is likely invasive and has correctness implications at the state machine level (eg: idempotency), so avoiding for now.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

@bharathv
Copy link
Contributor Author

bharathv commented Dec 3, 2024

/ci-repeat 1
skip-units
release
dt-repeat 20
tests/rptest/chaos_tests/single_fault_test.py::TxSubscribeTest

1 similar comment
@bharathv
Copy link
Contributor Author

bharathv commented Dec 3, 2024

/ci-repeat 1
skip-units
release
dt-repeat 20
tests/rptest/chaos_tests/single_fault_test.py::TxSubscribeTest

@bharathv
Copy link
Contributor Author

bharathv commented Dec 3, 2024

/ci-repeat 1
skip-units
release
dt-repeat=20
tests/rptest/chaos_tests/single_fault_test.py::TxSubscribeTest

@bharathv
Copy link
Contributor Author

bharathv commented Dec 3, 2024

/ci-repeat 1
skip-units
release
dt-repeat=40
tests/rptest/chaos_tests/single_fault_test.py::TxSubscribeTest

This typically happens when there is a stepdown and the downstream
consumers like recovery need to know about it.
@bharathv
Copy link
Contributor Author

bharathv commented Dec 3, 2024

/ci-repeat 1
skip-units
release
dt-repeat=40
tests/rptest/chaos_tests/single_fault_test.py::TxSubscribeTest

@bharathv bharathv marked this pull request as ready for review December 4, 2024 03:50
@bharathv bharathv changed the title raft/c: hang in transfer leadership - wip raft/c: hang in transfer leadership Dec 4, 2024
@bharathv bharathv changed the title raft/c: hang in transfer leadership raft/c: fix an indefinite hang in transfer leadership Dec 4, 2024
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 4, 2024

Retry command for Build#59208

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/license_enforcement_test.py::LicenseEnforcementTest.test_license_enforcement@{"clean_node_after_recovery":true,"clean_node_before_recovery":false}
tests/rptest/tests/license_enforcement_test.py::LicenseEnforcementTest.test_license_enforcement@{"clean_node_after_recovery":false,"clean_node_before_recovery":false}
tests/rptest/tests/license_enforcement_test.py::LicenseEnforcementTest.test_escape_hatch_license_variable@{"clean_node_before_upgrade":false}
tests/rptest/tests/license_enforcement_test.py::LicenseEnforcementTest.test_license_enforcement@{"clean_node_after_recovery":false,"clean_node_before_recovery":true}
tests/rptest/tests/license_enforcement_test.py::LicenseEnforcementTest.test_license_enforcement@{"clean_node_after_recovery":true,"clean_node_before_recovery":true}
tests/rptest/tests/license_enforcement_test.py::LicenseEnforcementTest.test_escape_hatch_license_variable@{"clean_node_before_upgrade":true}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 4, 2024

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59208#0193900d-362c-4a1e-ac0e-951f86073cd9:

"rptest.tests.license_enforcement_test.LicenseEnforcementTest.test_license_enforcement.clean_node_before_recovery=True.clean_node_after_recovery=False"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59208#0193900d-362e-4948-ba03-87047c5cbc15:

"rptest.tests.license_enforcement_test.LicenseEnforcementTest.test_escape_hatch_license_variable.clean_node_before_upgrade=True"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59208#01939012-94b1-4626-870d-7023ed923a22:

"rptest.tests.license_enforcement_test.LicenseEnforcementTest.test_license_enforcement.clean_node_before_recovery=True.clean_node_after_recovery=False"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59208#01939012-94b3-4515-9a2c-ea0f2873d859:

"rptest.tests.license_enforcement_test.LicenseEnforcementTest.test_escape_hatch_license_variable.clean_node_before_upgrade=True"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants