cluster slots not rediscovered on scale-down cluster #2806

carlhopf · 2024-07-26T11:40:11Z

Description

#slots.rediscover() is correctly called when MOVED or ASK reply is received:

node-redis/packages/client/lib/cluster/index.ts

Line 271 in 6f79b49

} else if (err.message.startsWith('MOVED')) {

node-redis/packages/client/lib/cluster/index.ts

Line 256 in 6f79b49

if (err.message.startsWith('ASK')) {

there is a problem / race-condition on scale-down cluster (rebalance hash slots away, forget node, shutdown):

#slots.rediscover() might never be called, if the leaving node has not been queried (after it's slots migrated away) before shutdown (hence it never got a chance to reply with MOVED and trigger a rediscover)

this results in connection closed errors, because node-redis cluster.#slots is keeping a client active for an outdated cluster topology.

steps to reproduce:

start a redis cluster
create node-redis client and connect
redis-cli reshard all slots away from a redis cluster node
redis-cli del-node the now empty node, then shutdown
node-redis client execute command for a key, which is sharded the (now shutdown) node
connection closed error

an option to fix this would be to provide an option for a slot refresh interval (ioredis does this too), will send a pull request

Node.js Version

No response

Redis Server Version

No response

Node Redis Version

No response

Platform

No response

Logs

No response

The text was updated successfully, but these errors were encountered:

Davrosh · 2024-08-28T21:43:12Z

Adding to this a scenario I've found:
Following this guide https://redis.io/learn/operate/redis-at-scale/scalability/exercise-1
Initializing createCluster with six nodes in rootNodes being 7100-7105 (instead of 7000-7005, I have conflicting ports).

Logging all masters and replicas every second or so (while trying to access a certain key) with:

console.log(
    'masters',
    redisClient.masters.map(node => node.port),
    'replicas',
    redisClient.replicas.map(node => node.port)
);

Started my program while 7101 was down - it did not appear on the list (I then restarted 7101 but the list was not updated) - seems like this is caused by the issue above

masters (3) [7104, 7105, 7102] replicas (2) [7100, 7103]

At some point I shut down 7105 which caused an error event and the reconnect strategy to trigger

Redis disconnect. connect ECONNREFUSED 127.0.0.1:7105

After reconnecting 7105, now 7101 suddenly appears (it became a master following 7105 shutting down) but 7105 does NOT appear as a replica - It this scenario, a MOVED error would NOT occur and the list of replicas would therefore not be updated.

masters (3) [7104, 7101, 7102] replicas (2) [7100, 7103]

Then, I followed the rest of the guide and added 7106 and 7107 to the cluster and while running cluster reshard in the background (which inevitably lead to a MOVED error when accessing my key), both 7101 and 7105 suddenly appeared as a master/replica respectively, but 7106 and 7107 each appear 3 times.

masters (6) [7106, 7104, 7106, 7101, 7106, 7102] replicas (6) [7107, 7100, 7107, 7105, 7107, 7103]

I then restarted the application but the list is still as above with the duplication (probably because 7106 and 7107 were not listed on the list of rootNodes which seems like a separate but still alarming issue) but at least all actual nodes are specified.

In the event of having just replicas and not masters added or removed (scale up OR down), no MOVED error would probably occur, leading to the issue above.

I think a periodic interval refresh/rediscover as suggested above is a good idea.

leibale · 2024-09-19T11:54:58Z

IMO a better solution would be to run "rediscover" proactively when a connection dies rather than having a "slots refresh interval" (which will solve the problem eventually, but will still have some "downtime").
I'm not sure if we should invest time into fixing it in v4 or just fix it in v5 (which should be released soon)..
Anyway, PR's are more than welcome ;)

carlhopf added the Bug label Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster slots not rediscovered on scale-down cluster #2806

cluster slots not rediscovered on scale-down cluster #2806

carlhopf commented Jul 26, 2024 •

edited

Loading

Davrosh commented Aug 28, 2024 •

edited

Loading

leibale commented Sep 19, 2024

cluster slots not rediscovered on scale-down cluster #2806

cluster slots not rediscovered on scale-down cluster #2806

Comments

carlhopf commented Jul 26, 2024 • edited Loading

Description

Node.js Version

Redis Server Version

Node Redis Version

Platform

Logs

Davrosh commented Aug 28, 2024 • edited Loading

leibale commented Sep 19, 2024

carlhopf commented Jul 26, 2024 •

edited

Loading

Davrosh commented Aug 28, 2024 •

edited

Loading