You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey team. We noticed that ACL Policy replication sometimes does not happen in clusters connected via WAN. We didn't see any errors or failed replication reports while encountering this bug. When it happens, the primary has an updated policy, and one or two secondaries have an old version of the policy (or do not create a policy if it's a new one). Policies don't sync during the next replication poll, and, ultimately, the consul cluster thinks it's in sync with the primary, while it's not true (indexes are correct).
Important: Replication works most of the time, it fails only from time to time and completely randomly (sometimes it happens during the first five tries, sometimes during 40). Replicated Index is up-to-date in all the cases. We see no errors in logs either during failed or successful replication. For testing, we used consul versions 1.16.6 and 1.15.10.
Reproduction Steps
Steps to reproduce this issue:
Create three clusters and connect them via WAN.
Create a testing policy on the primary cluster
Update the policy with slight changes (e.g. change service_prefix from test-1 to test-2....10) 1-15 times.
We have clusters running on ECS Fargate, in the same VPC, SG rules allow all traffic (for debugging purposes).
Tried to disable encryption, verifications, and different consul versions (1.16.6 and 1.15.10). We tried multiple clusters in multiple AWS accounts with the same results. We also tried to use either a master token as a replication or a dedicated one only for replication.
WAN setup:
consul members -wan
Node Address Status Type Build Protocol DC Partition Segment
ip-10-0-0-134.region.compute.internal.consul-1-0 10.0.0.134:8302 alive server 1.15.10 2 consul-1-0 default <all>
ip-10-0-0-145.region.compute.internal.consul-1-2 10.0.0.145:8302 alive server 1.15.10 2 consul-1-2 default <all>
ip-10-0-0-146.region.compute.internal.consul-1-1 10.0.0.146:8302 alive server 1.15.10 2 consul-1-1 default <all>
ip-10-0-0-151.region.compute.internal.consul-1-2 10.0.0.151:8302 alive server 1.15.10 2 consul-1-2 default <all>
ip-10-0-0-154.region.compute.internal.consul-1-1 10.0.0.154:8302 alive server 1.15.10 2 consul-1-1 default <all>
ip-10-0-0-155.region.compute.internal.consul-1-0 10.0.0.155:8302 alive server 1.15.10 2 consul-1-0 default <all>
ip-10-0-0-165.region.compute.internal.consul-1-1 10.0.0.165:8302 alive server 1.15.10 2 consul-1-1 default <all>
ip-10-0-0-167.region.compute.internal.consul-1-1 10.0.0.167:8302 alive server 1.15.10 2 consul-1-1 default <all>
ip-10-0-0-176.region.compute.internal.consul-1-2 10.0.0.176:8302 alive server 1.15.10 2 consul-1-2 default <all>
ip-10-0-0-178.region.compute.internal.consul-1-0 10.0.0.178:8302 alive server 1.15.10 2 consul-1-0 default <all>
ip-10-0-0-186.region.compute.internal.consul-1-2 10.0.0.186:8302 alive server 1.15.10 2 consul-1-2 default <all>
ip-10-0-0-196.region.compute.internal.consul-1-2 10.0.0.196:8302 alive server 1.15.10 2 consul-1-2 default <all>
ip-10-0-0-210.region.compute.internal.consul-1-1 10.0.0.210:8302 alive server 1.15.10 2 consul-1-1 default <all>
ip-10-0-0-211.region.compute.internal.consul-1-0 10.0.0.211:8302 alive server 1.15.10 2 consul-1-0 default <all>
ip-10-0-0-216.region.compute.internal.consul-1-0 10.0.0.216:8302 alive server 1.15.10 2 consul-1-0 default <all>
Overview of the Issue
Hey team. We noticed that ACL Policy replication sometimes does not happen in clusters connected via WAN. We didn't see any errors or failed replication reports while encountering this bug. When it happens, the primary has an updated policy, and one or two secondaries have an old version of the policy (or do not create a policy if it's a new one). Policies don't sync during the next replication poll, and, ultimately, the consul cluster thinks it's in sync with the primary, while it's not true (indexes are correct).
Important: Replication works most of the time, it fails only from time to time and completely randomly (sometimes it happens during the first five tries, sometimes during 40). Replicated Index is up-to-date in all the cases. We see no errors in logs either during failed or successful replication. For testing, we used consul versions 1.16.6 and 1.15.10.
Reproduction Steps
Steps to reproduce this issue:
Consul info for Servers
We dont' have clients in this testing setup
Operating system and Environment details
We have clusters running on ECS Fargate, in the same VPC, SG rules allow all traffic (for debugging purposes).
Tried to disable encryption, verifications, and different consul versions (1.16.6 and 1.15.10). We tried multiple clusters in multiple AWS accounts with the same results. We also tried to use either a master token as a replication or a dedicated one only for replication.
WAN setup:
Log Fragments
The text was updated successfully, but these errors were encountered: