Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DESIGN] swarm config template context: global service: peering Node.ID awareness #3183

Open
zarganum opened this issue Oct 1, 2024 · 0 comments

Comments

@zarganum
Copy link

zarganum commented Oct 1, 2024

Dear team,

Design question.

Current Context implementation in context.go populates the "variable namespace" with identifiers like .Service.ID, .Node.ID etc which are locally-scoped, e.g. they describe service, task or local node. There is no information about peering nodes placing service tasks.

If the global service based on Raft consensus is run on multiple nodes, they (potentially) depend on the proper peer configuration. You know these matters better than me, so correct me when I'm wrong:)

Use case: HashiCorp Vault (or its open source fork OpenBao) running with Raft integrated storage as a global service.

  • Cluster formation. The retry_join should list all possible sources for cluster initialization (read: all nodes of swarm except the current one).
  • Cluster node fail. The permanently failed peer has to be removed manually.

Frankly, the above node fail is a spherical vacuum one. I broke bad my 3-node Vault in a way that remaining follower FSMs dead-locked in leader election and API was unable to respond. Another example is offline recovery of Vault cluster using one remaining replica (TLDR: Vault peers are persisted in DB, but the peers list may be overridden in external JSON thus initiating recovery, followed by JSON deletion. Nice but manual).

IMO 21st century requires automation:)

We can of course create an automation that will subscribe to node state changes but it will be great to have like all-nodes list in the config template with ability to iterate (and optionally exclude the current one). Ideally, this list should contain Node IDs that match service placement constraints.

Question: is the above scenario when Context has peering nodes awareness something that aligns with general design?
Or is it considered anti-pattern clearly requiring external (off-swarm) automation?
I want to understand if it worth further R&D or not.

Many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant