ACL policies replication bug #21959

yburyndi-gh · 2024-11-20T14:47:00Z

Overview of the Issue

Hey team. We noticed that ACL Policy replication sometimes does not happen in clusters connected via WAN. We didn't see any errors or failed replication reports while encountering this bug. When it happens, the primary has an updated policy, and one or two secondaries have an old version of the policy (or do not create a policy if it's a new one). Policies don't sync during the next replication poll, and, ultimately, the consul cluster thinks it's in sync with the primary, while it's not true (indexes are correct).

Important: Replication works most of the time, it fails only from time to time and completely randomly (sometimes it happens during the first five tries, sometimes during 40). Replicated Index is up-to-date in all the cases. We see no errors in logs either during failed or successful replication. For testing, we used consul versions 1.16.6 and 1.15.10.

Reproduction Steps

Steps to reproduce this issue:

Create three clusters and connect them via WAN.
Create a testing policy on the primary cluster
Update the policy with slight changes (e.g. change service_prefix from test-1 to test-2....10) 1-15 times.
Read the policy on the secondary clusters.

Consul info for Servers

Server info

consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = a8dca240
        version = 1.15.10
        version_metadata =
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 3
        leader = false
        leader_addr = ip:8300
        server = true
raft:
        applied_index = 214
        commit_index = 214
        fsm_pending = 0
        last_contact = 34.3041ms
        last_log_index = 214
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:297216aa-1206-becd-a357-7fa6a5f64a2e Address:10.0.0.199:8300} {Suffrage:Voter ID:0dbca19c-0310-4eda-4595-6fa73fc91f93 Address:10.0.0.180:8300} {Suffrage:Voter ID:fbcef797-3d7a-d141-bd8c-cbbc36b3f358 Address:10.0.0.141:8300} {Suffrage:Voter ID:3c198304-4604-a298-7786-65f3cc2a241c Address:10.0.0.186:8300} {Suffrage:Voter ID:fbdfb4f5-f3fb-e089-4657-c06290938243 Address:10.0.0.217:8300}]
        latest_configuration_index = 0
        num_peers = 4
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 2
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 176
        max_procs = 8
        os = linux
        version = go1.21.7
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 5
        members = 5
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 221
        members = 15
        query_queue = 0
        query_time = 1

We dont' have clients in this testing setup

Server agent HCL config
{
    "client_addr": "{{ GetInterfaceIP \"eth1\" }} 127.0.0.1",
    "bind_addr": "{{ GetInterfaceIP \"eth1\" }}",
    "data_dir": "/consul/data",
    "log_level": "TRACE",
    "datacenter": "${DC}",
    "encrypt": "${GOSSIP_KEY}",
    "primary_datacenter": "${PRIMARY_CLUSTER}",
    "retry_join_wan": ["${RETRY_DNS}:8302"],
    "encrypt_verify_incoming": true,
    "encrypt_verify_outgoing": true,
    "leave_on_terminate": true,
    "skip_leave_on_interrupt": false,
    "tls": {
        "defaults": {
            "tls_cipher_suites": "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
            "verify_incoming": false,
            "verify_outgoing": true,
            "ca_file": "/consul/config/ca.cert.pem",
            "cert_file": "/consul/config/consul.crt",
            "key_file": "/consul/config/consul.key"
        },
        "internal_rpc": {
            "verify_server_hostname": true,
            "verify_incoming": true
        }
    },
    "rpc": {
        "enable_streaming": true
    },
    "ports": {
        "http": -1,
        "https": 8501,
        "grpc": -1,
        "grpc_tls": 8502
    },
    "auto_encrypt": {
        "allow_tls": true
    },
    "acl": {
        "enabled": true,
        "default_policy": "deny",
        "down_policy": "extend-cache",
        "enable_token_persistence": true,
        "enable_token_replication": true,
        "tokens": {
            "master": "${MASTER_TOKEN}",
            "agent": "${MASTER_TOKEN}",
            "replication": "${MASTER_TOKEN}"
        }
    },
    "connect": {
        "enabled": true
    },
    "telemetry": {
        "statsd_address": "localhost:8125",
        "disable_hostname": true
    }
}

Operating system and Environment details

We have clusters running on ECS Fargate, in the same VPC, SG rules allow all traffic (for debugging purposes).
Tried to disable encryption, verifications, and different consul versions (1.16.6 and 1.15.10). We tried multiple clusters in multiple AWS accounts with the same results. We also tried to use either a master token as a replication or a dedicated one only for replication.

WAN setup:

consul members -wan
Node                                                      Address          Status  Type    Build    Protocol  DC          Partition  Segment
ip-10-0-0-134.region.compute.internal.consul-1-0  10.0.0.134:8302  alive   server  1.15.10  2         consul-1-0  default    <all>
ip-10-0-0-145.region.compute.internal.consul-1-2  10.0.0.145:8302  alive   server  1.15.10  2         consul-1-2  default    <all>
ip-10-0-0-146.region.compute.internal.consul-1-1  10.0.0.146:8302  alive   server  1.15.10  2         consul-1-1  default    <all>
ip-10-0-0-151.region.compute.internal.consul-1-2  10.0.0.151:8302  alive   server  1.15.10  2         consul-1-2  default    <all>
ip-10-0-0-154.region.compute.internal.consul-1-1  10.0.0.154:8302  alive   server  1.15.10  2         consul-1-1  default    <all>
ip-10-0-0-155.region.compute.internal.consul-1-0  10.0.0.155:8302  alive   server  1.15.10  2         consul-1-0  default    <all>
ip-10-0-0-165.region.compute.internal.consul-1-1  10.0.0.165:8302  alive   server  1.15.10  2         consul-1-1  default    <all>
ip-10-0-0-167.region.compute.internal.consul-1-1  10.0.0.167:8302  alive   server  1.15.10  2         consul-1-1  default    <all>
ip-10-0-0-176.region.compute.internal.consul-1-2  10.0.0.176:8302  alive   server  1.15.10  2         consul-1-2  default    <all>
ip-10-0-0-178.region.compute.internal.consul-1-0  10.0.0.178:8302  alive   server  1.15.10  2         consul-1-0  default    <all>
ip-10-0-0-186.region.compute.internal.consul-1-2  10.0.0.186:8302  alive   server  1.15.10  2         consul-1-2  default    <all>
ip-10-0-0-196.region.compute.internal.consul-1-2  10.0.0.196:8302  alive   server  1.15.10  2         consul-1-2  default    <all>
ip-10-0-0-210.region.compute.internal.consul-1-1  10.0.0.210:8302  alive   server  1.15.10  2         consul-1-1  default    <all>
ip-10-0-0-211.region.compute.internal.consul-1-0  10.0.0.211:8302  alive   server  1.15.10  2         consul-1-0  default    <all>
ip-10-0-0-216.region.compute.internal.consul-1-0  10.0.0.216:8302  alive   server  1.15.10  2         consul-1-0  default    <all>

Log Fragments

agent.server.replication.acl.policy: acl replication - upserted batch: number_upserted=1 batch_size=136
agent.server.replication.acl.policy: acl replication - finished updates
agent.server.replication.acl.policy: ACL replication completed through remote index: index=183
Primary: 
consul acl policy update --name test_policy --rules @test.hcl
ID:           5c62c300-a08c-ac54-e00e-8fe8ad5c7503
Name:         test_policy
Description:
Datacenters:
Rules:
#testing policy
node_prefix "test-23" {
  policy = "write"
}

DC1:
consul acl policy read --name test_policy
ID:           5c62c300-a08c-ac54-e00e-8fe8ad5c7503
Name:         test_policy
Description:
Datacenters:
Rules:
#testing policy
node_prefix "test-23" {
  policy = "write"
}
curl -k https://localhost:8501/v1/acl/replication?pretty
{
    "Enabled": true,
    "Running": true,
    "SourceDatacenter": "dc",
    "ReplicationType": "tokens",
    "ReplicatedIndex": 183,
    "ReplicatedRoleIndex": 1,
    "ReplicatedTokenIndex": 7,
    "LastSuccess": "2024-11-20T14:35:40Z",
    "LastError": "0001-01-01T00:00:00Z",
    "LastErrorMessage": ""
}
DC2: 
consul acl policy read --name test_policy
ID:           5c62c300-a08c-ac54-e00e-8fe8ad5c7503
Name:         test_policy
Description:
Datacenters:
Rules:
#testing policy
node_prefix "test-22" {
  policy = "write"
}
curl -k https://localhost:8501/v1/acl/replication?pretty
{
    "Enabled": true,
    "Running": true,
    "SourceDatacenter": "dc",
    "ReplicationType": "tokens",
    "ReplicatedIndex": 183,
    "ReplicatedRoleIndex": 1,
    "ReplicatedTokenIndex": 7,
    "LastSuccess": "2024-11-20T14:35:47Z",
    "LastError": "0001-01-01T00:00:00Z",
    "LastErrorMessage": ""
}

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACL policies replication bug #21959

ACL policies replication bug #21959

yburyndi-gh commented Nov 20, 2024 •

edited

Loading

ACL policies replication bug #21959

ACL policies replication bug #21959

Comments

yburyndi-gh commented Nov 20, 2024 • edited Loading

Overview of the Issue

Reproduction Steps

Consul info for Servers

Operating system and Environment details

Log Fragments

yburyndi-gh commented Nov 20, 2024 •

edited

Loading