Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster slots not rediscovered on scale-down cluster #2806

Open
carlhopf opened this issue Jul 26, 2024 · 2 comments
Open

cluster slots not rediscovered on scale-down cluster #2806

carlhopf opened this issue Jul 26, 2024 · 2 comments
Labels

Comments

@carlhopf
Copy link
Contributor

carlhopf commented Jul 26, 2024

Description

#slots.rediscover() is correctly called when MOVED or ASK reply is received:

} else if (err.message.startsWith('MOVED')) {

if (err.message.startsWith('ASK')) {

there is a problem / race-condition on scale-down cluster (rebalance hash slots away, forget node, shutdown):

#slots.rediscover() might never be called, if the leaving node has not been queried (after it's slots migrated away) before shutdown (hence it never got a chance to reply with MOVED and trigger a rediscover)

this results in connection closed errors, because node-redis cluster.#slots is keeping a client active for an outdated cluster topology.

steps to reproduce:

  1. start a redis cluster
  2. create node-redis client and connect
  3. redis-cli reshard all slots away from a redis cluster node
  4. redis-cli del-node the now empty node, then shutdown
  5. node-redis client execute command for a key, which is sharded the (now shutdown) node
  6. connection closed error

an option to fix this would be to provide an option for a slot refresh interval (ioredis does this too), will send a pull request

Node.js Version

No response

Redis Server Version

No response

Node Redis Version

No response

Platform

No response

Logs

No response

@carlhopf carlhopf added the Bug label Jul 26, 2024
@Davrosh
Copy link

Davrosh commented Aug 28, 2024

Adding to this a scenario I've found:
Following this guide https://redis.io/learn/operate/redis-at-scale/scalability/exercise-1
Initializing createCluster with six nodes in rootNodes being 7100-7105 (instead of 7000-7005, I have conflicting ports).

Logging all masters and replicas every second or so (while trying to access a certain key) with:

console.log(
    'masters',
    redisClient.masters.map(node => node.port),
    'replicas',
    redisClient.replicas.map(node => node.port)
);

Started my program while 7101 was down - it did not appear on the list (I then restarted 7101 but the list was not updated) - seems like this is caused by the issue above

masters (3) [7104, 7105, 7102] replicas (2) [7100, 7103]

At some point I shut down 7105 which caused an error event and the reconnect strategy to trigger

Redis disconnect. connect ECONNREFUSED 127.0.0.1:7105 

After reconnecting 7105, now 7101 suddenly appears (it became a master following 7105 shutting down) but 7105 does NOT appear as a replica - It this scenario, a MOVED error would NOT occur and the list of replicas would therefore not be updated.

masters (3) [7104, 7101, 7102] replicas (2) [7100, 7103]

Then, I followed the rest of the guide and added 7106 and 7107 to the cluster and while running cluster reshard in the background (which inevitably lead to a MOVED error when accessing my key), both 7101 and 7105 suddenly appeared as a master/replica respectively, but 7106 and 7107 each appear 3 times.

masters (6) [7106, 7104, 7106, 7101, 7106, 7102] replicas (6) [7107, 7100, 7107, 7105, 7107, 7103]

I then restarted the application but the list is still as above with the duplication (probably because 7106 and 7107 were not listed on the list of rootNodes which seems like a separate but still alarming issue) but at least all actual nodes are specified.

In the event of having just replicas and not masters added or removed (scale up OR down), no MOVED error would probably occur, leading to the issue above.

I think a periodic interval refresh/rediscover as suggested above is a good idea.

@leibale
Copy link
Collaborator

leibale commented Sep 19, 2024

IMO a better solution would be to run "rediscover" proactively when a connection dies rather than having a "slots refresh interval" (which will solve the problem eventually, but will still have some "downtime").
I'm not sure if we should invest time into fixing it in v4 or just fix it in v5 (which should be released soon)..
Anyway, PR's are more than welcome ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants