Skip to content
This repository has been archived by the owner on Jan 30, 2023. It is now read-only.

Handle the case where the coordinator is replaced with a new host #2

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

gotascii
Copy link

@gotascii gotascii commented Oct 26, 2017

If the group coordinator is replaced with a different host, but the broker id remains the same, the client will go into and endless reconnection loop. This PR refreshes the cluster data if there is a ConnectionError when joining a group. The issue is reproducible by following these steps:

  • Start up a cluster with 3 nodes.
  • Publish some messages to a topic.
  • Connect to the topic and start an each_message loop.
  • A broker, say #0 for example, becomes memoized in @coordinator in ConsumerGroup.
  • Stop the each_message loop but do not exit the process.
  • Kill broker 0 and bring back a new host with a different ip as broker 0.
  • With the same consumer instance, run the each_message loop again.

When the above steps are taken:

  • ConsumerGroup#join is called.
  • Then coordinator.join_group on ConsumerGroup L:117 fails with ConnectionError.
  • ConsumerGroup#join sets @coordinator = nil.
  • Cluster#get_group_coordinator asks a broker for the broker id of the coordinator which is 0.
  • connect_to_broker pulls cached info for id 0 (i.e. the old IP).
  • Then coordinator.join_group on ConsumerGroup L:117 fails with ConnectionError restarting the loop.

Seeing as the retry for a ConnectionError is guarded by a sleep 1 I'm hoping this is a pretty safe place to refresh metadata.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant