Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACL policies replication bug #21959

Open
yburyndi-gh opened this issue Nov 20, 2024 · 0 comments
Open

ACL policies replication bug #21959

yburyndi-gh opened this issue Nov 20, 2024 · 0 comments

Comments

@yburyndi-gh
Copy link

yburyndi-gh commented Nov 20, 2024

Overview of the Issue


Hey team. We noticed that ACL Policy replication sometimes does not happen in clusters connected via WAN. We didn't see any errors or failed replication reports while encountering this bug. When it happens, the primary has an updated policy, and one or two secondaries have an old version of the policy (or do not create a policy if it's a new one). Policies don't sync during the next replication poll, and, ultimately, the consul cluster thinks it's in sync with the primary, while it's not true (indexes are correct).

Important: Replication works most of the time, it fails only from time to time and completely randomly (sometimes it happens during the first five tries, sometimes during 40). Replicated Index is up-to-date in all the cases. We see no errors in logs either during failed or successful replication. For testing, we used consul versions 1.16.6 and 1.15.10.

Reproduction Steps

Steps to reproduce this issue:

  1. Create three clusters and connect them via WAN.
  2. Create a testing policy on the primary cluster
  3. Update the policy with slight changes (e.g. change service_prefix from test-1 to test-2....10) 1-15 times.
  4. Read the policy on the secondary clusters.

Consul info for Servers

Server info
consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = a8dca240
        version = 1.15.10
        version_metadata =
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 3
        leader = false
        leader_addr = ip:8300
        server = true
raft:
        applied_index = 214
        commit_index = 214
        fsm_pending = 0
        last_contact = 34.3041ms
        last_log_index = 214
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:297216aa-1206-becd-a357-7fa6a5f64a2e Address:10.0.0.199:8300} {Suffrage:Voter ID:0dbca19c-0310-4eda-4595-6fa73fc91f93 Address:10.0.0.180:8300} {Suffrage:Voter ID:fbcef797-3d7a-d141-bd8c-cbbc36b3f358 Address:10.0.0.141:8300} {Suffrage:Voter ID:3c198304-4604-a298-7786-65f3cc2a241c Address:10.0.0.186:8300} {Suffrage:Voter ID:fbdfb4f5-f3fb-e089-4657-c06290938243 Address:10.0.0.217:8300}]
        latest_configuration_index = 0
        num_peers = 4
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 2
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 176
        max_procs = 8
        os = linux
        version = go1.21.7
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 5
        members = 5
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 221
        members = 15
        query_queue = 0
        query_time = 1

We dont' have clients in this testing setup

Server agent HCL config
{
    "client_addr": "{{ GetInterfaceIP \"eth1\" }} 127.0.0.1",
    "bind_addr": "{{ GetInterfaceIP \"eth1\" }}",
    "data_dir": "/consul/data",
    "log_level": "TRACE",
    "datacenter": "${DC}",
    "encrypt": "${GOSSIP_KEY}",
    "primary_datacenter": "${PRIMARY_CLUSTER}",
    "retry_join_wan": ["${RETRY_DNS}:8302"],
    "encrypt_verify_incoming": true,
    "encrypt_verify_outgoing": true,
    "leave_on_terminate": true,
    "skip_leave_on_interrupt": false,
    "tls": {
        "defaults": {
            "tls_cipher_suites": "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
            "verify_incoming": false,
            "verify_outgoing": true,
            "ca_file": "/consul/config/ca.cert.pem",
            "cert_file": "/consul/config/consul.crt",
            "key_file": "/consul/config/consul.key"
        },
        "internal_rpc": {
            "verify_server_hostname": true,
            "verify_incoming": true
        }
    },
    "rpc": {
        "enable_streaming": true
    },
    "ports": {
        "http": -1,
        "https": 8501,
        "grpc": -1,
        "grpc_tls": 8502
    },
    "auto_encrypt": {
        "allow_tls": true
    },
    "acl": {
        "enabled": true,
        "default_policy": "deny",
        "down_policy": "extend-cache",
        "enable_token_persistence": true,
        "enable_token_replication": true,
        "tokens": {
            "master": "${MASTER_TOKEN}",
            "agent": "${MASTER_TOKEN}",
            "replication": "${MASTER_TOKEN}"
        }
    },
    "connect": {
        "enabled": true
    },
    "telemetry": {
        "statsd_address": "localhost:8125",
        "disable_hostname": true
    }
}


Operating system and Environment details

We have clusters running on ECS Fargate, in the same VPC, SG rules allow all traffic (for debugging purposes).
Tried to disable encryption, verifications, and different consul versions (1.16.6 and 1.15.10). We tried multiple clusters in multiple AWS accounts with the same results. We also tried to use either a master token as a replication or a dedicated one only for replication.

WAN setup:

consul members -wan
Node                                                      Address          Status  Type    Build    Protocol  DC          Partition  Segment
ip-10-0-0-134.region.compute.internal.consul-1-0  10.0.0.134:8302  alive   server  1.15.10  2         consul-1-0  default    <all>
ip-10-0-0-145.region.compute.internal.consul-1-2  10.0.0.145:8302  alive   server  1.15.10  2         consul-1-2  default    <all>
ip-10-0-0-146.region.compute.internal.consul-1-1  10.0.0.146:8302  alive   server  1.15.10  2         consul-1-1  default    <all>
ip-10-0-0-151.region.compute.internal.consul-1-2  10.0.0.151:8302  alive   server  1.15.10  2         consul-1-2  default    <all>
ip-10-0-0-154.region.compute.internal.consul-1-1  10.0.0.154:8302  alive   server  1.15.10  2         consul-1-1  default    <all>
ip-10-0-0-155.region.compute.internal.consul-1-0  10.0.0.155:8302  alive   server  1.15.10  2         consul-1-0  default    <all>
ip-10-0-0-165.region.compute.internal.consul-1-1  10.0.0.165:8302  alive   server  1.15.10  2         consul-1-1  default    <all>
ip-10-0-0-167.region.compute.internal.consul-1-1  10.0.0.167:8302  alive   server  1.15.10  2         consul-1-1  default    <all>
ip-10-0-0-176.region.compute.internal.consul-1-2  10.0.0.176:8302  alive   server  1.15.10  2         consul-1-2  default    <all>
ip-10-0-0-178.region.compute.internal.consul-1-0  10.0.0.178:8302  alive   server  1.15.10  2         consul-1-0  default    <all>
ip-10-0-0-186.region.compute.internal.consul-1-2  10.0.0.186:8302  alive   server  1.15.10  2         consul-1-2  default    <all>
ip-10-0-0-196.region.compute.internal.consul-1-2  10.0.0.196:8302  alive   server  1.15.10  2         consul-1-2  default    <all>
ip-10-0-0-210.region.compute.internal.consul-1-1  10.0.0.210:8302  alive   server  1.15.10  2         consul-1-1  default    <all>
ip-10-0-0-211.region.compute.internal.consul-1-0  10.0.0.211:8302  alive   server  1.15.10  2         consul-1-0  default    <all>
ip-10-0-0-216.region.compute.internal.consul-1-0  10.0.0.216:8302  alive   server  1.15.10  2         consul-1-0  default    <all> 

Log Fragments

agent.server.replication.acl.policy: acl replication - upserted batch: number_upserted=1 batch_size=136
agent.server.replication.acl.policy: acl replication - finished updates
agent.server.replication.acl.policy: ACL replication completed through remote index: index=183
Primary: 
consul acl policy update --name test_policy --rules @test.hcl
ID:           5c62c300-a08c-ac54-e00e-8fe8ad5c7503
Name:         test_policy
Description:
Datacenters:
Rules:
#testing policy
node_prefix "test-23" {
  policy = "write"
}

DC1:
consul acl policy read --name test_policy
ID:           5c62c300-a08c-ac54-e00e-8fe8ad5c7503
Name:         test_policy
Description:
Datacenters:
Rules:
#testing policy
node_prefix "test-23" {
  policy = "write"
}
curl -k https://localhost:8501/v1/acl/replication?pretty
{
    "Enabled": true,
    "Running": true,
    "SourceDatacenter": "dc",
    "ReplicationType": "tokens",
    "ReplicatedIndex": 183,
    "ReplicatedRoleIndex": 1,
    "ReplicatedTokenIndex": 7,
    "LastSuccess": "2024-11-20T14:35:40Z",
    "LastError": "0001-01-01T00:00:00Z",
    "LastErrorMessage": ""
}
DC2: 
consul acl policy read --name test_policy
ID:           5c62c300-a08c-ac54-e00e-8fe8ad5c7503
Name:         test_policy
Description:
Datacenters:
Rules:
#testing policy
node_prefix "test-22" {
  policy = "write"
}
curl -k https://localhost:8501/v1/acl/replication?pretty
{
    "Enabled": true,
    "Running": true,
    "SourceDatacenter": "dc",
    "ReplicationType": "tokens",
    "ReplicatedIndex": 183,
    "ReplicatedRoleIndex": 1,
    "ReplicatedTokenIndex": 7,
    "LastSuccess": "2024-11-20T14:35:47Z",
    "LastError": "0001-01-01T00:00:00Z",
    "LastErrorMessage": ""
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant