Don't rely on InfraStructureTopology for infra HA #3186

orenc1 · 2024-12-01T10:24:15Z

There are some cases in which the number of worker nodes are changing throughout the lifecycle of the cluster, and the InfrastructureTopology in the Infrastructure resource is statically set at cluster installation type and it is not getting updated dynamically. For example, an SNO cluster that is being added a new worker node. In that case, the infrastructure topology should be updated to HighlyAvailable and don't remain SingleReplica.
Instead, we should count the worker nodes, same as we do in case of a k8s cluster.

In addition, fixing a potential bug were we used only the node-role.kubernetes.io/master label to find and count the masters/control-plane nodes.

What this PR does / why we need it:

Reviewer Checklist

Reviewers are supposed to review the PR for every aspect below one by one. To check an item means the PR is either "OK" or "Not Applicable" in terms of that item. All items are supposed to be checked before merging a PR.

Jira Ticket:

https://issues.redhat.com/browse/CNV-50027

Release note:

NONE

kubevirt-bot · 2024-12-01T10:24:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from orenc1. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2024-12-01T10:27:46Z

Pull Request Test Coverage Report for Build 12155870093

Details

33 of 117 (28.21%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.6%) to 71.25%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
controllers/hyperconverged/hyperconverged_controller.go	0	2	0.0%
pkg/util/cluster.go	33	39	84.62%
controllers/commontestutils/testUtils.go	0	9	0.0%
controllers/nodes/nodes_controller.go	0	67	0.0%

Totals
Change from base Build 12107257272:	-0.6%
Covered Lines:	6027
Relevant Lines:	8459

💛 - Coveralls

hco-bot · 2024-12-01T13:43:47Z

hco-e2e-upgrade-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-azure
hco-e2e-operator-sdk-gcp lane succeeded.
/override ci/prow/hco-e2e-operator-sdk-aws

kubevirt-bot · 2024-12-01T13:43:51Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-operator-sdk-aws, ci/prow/hco-e2e-upgrade-operator-sdk-azure

In response to this:

hco-e2e-upgrade-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-azure
hco-e2e-operator-sdk-gcp lane succeeded.
/override ci/prow/hco-e2e-operator-sdk-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hco-bot · 2024-12-01T13:52:31Z

hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

kubevirt-bot · 2024-12-01T13:52:34Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

In response to this:

hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hco-bot · 2024-12-01T15:10:07Z

hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws
hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

kubevirt-bot · 2024-12-01T15:10:12Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws, ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

In response to this:

hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws
hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hco-bot · 2024-12-01T16:36:05Z

hco-e2e-kv-smoke-gcp lane succeeded.
/override ci/prow/hco-e2e-kv-smoke-azure

kubevirt-bot · 2024-12-01T16:36:08Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure

In response to this:

hco-e2e-kv-smoke-gcp lane succeeded.
/override ci/prow/hco-e2e-kv-smoke-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hco-bot · 2024-12-01T22:30:13Z

hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure
hco-e2e-operator-sdk-gcp lane succeeded.
/override ci/prow/hco-e2e-operator-sdk-aws
hco-e2e-upgrade-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-azure

kubevirt-bot · 2024-12-01T22:30:18Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-operator-sdk-aws, ci/prow/hco-e2e-upgrade-operator-sdk-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

In response to this:

hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure
hco-e2e-operator-sdk-gcp lane succeeded.
/override ci/prow/hco-e2e-operator-sdk-aws
hco-e2e-upgrade-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hco-bot · 2024-12-01T22:41:54Z

hco-e2e-operator-sdk-sno-azure lane succeeded.
/override ci/prow/hco-e2e-operator-sdk-sno-aws
hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws

kubevirt-bot · 2024-12-01T22:41:58Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-operator-sdk-sno-aws, ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws

In response to this:

hco-e2e-operator-sdk-sno-azure lane succeeded.
/override ci/prow/hco-e2e-operator-sdk-sno-aws
hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hco-bot · 2024-12-01T22:52:30Z

hco-e2e-upgrade-prev-operator-sdk-sno-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws

kubevirt-bot · 2024-12-01T22:52:33Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws

In response to this:

hco-e2e-upgrade-prev-operator-sdk-sno-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nunnatsa

Added several inline comments, but my general comment is about the ClusterInfo type, and thread-safty.

I think we must split the high-available info & methods from this type/interface into a new type, and then protect it with RWMutex.

@orenc1, WDYT?

nunnatsa · 2024-12-01T10:50:11Z

pkg/util/cluster.go

@@ -426,3 +424,47 @@ func isValidCipherName(str string) bool {
 		slices.Contains(openshiftconfigv1.TLSProfiles[openshiftconfigv1.TLSProfileIntermediateType].Ciphers, str) ||
 		slices.Contains(openshiftconfigv1.TLSProfiles[openshiftconfigv1.TLSProfileModernType].Ciphers, str)
 }
+
+func getNodesCount(cl client.Client, nodesType NodesType) (int, error) {


There is no reused code between the two cases, and both are pretty long. Let't split this function into two functions, and avoid passing the nodeType parameter. It gives us no advantage to have one function here.

yes, at the beginning the majority of the function was shared for the two cases.
but then it turned out there is no logical OR when selecting nodes based on two labels where either of them exists, so i needed to make two queries.

function refactored. now returning both workers and masters count in a single API call.

nunnatsa · 2024-12-01T10:51:37Z

pkg/util/cluster.go

+	if nodesType == ControlPlaneNodes {
+		masterReq, err := labels.NewRequirement("node-role.kubernetes.io/master", selection.Exists, nil)
+		if err != nil {
+			return -1, err


avoid returning non-zero-value when returning error. there is no need for that and the convention is to return the zero value + error (0 for ints).

ok.
anyway in case of an error, the returned int is not being read.

nunnatsa · 2024-12-01T10:52:05Z

pkg/util/cluster.go

+		}
+		cpSelector := labels.NewSelector().Add(*controlplaneReq)
+		cpLabelSelector := client.MatchingLabelsSelector{Selector: cpSelector}
+		err = cl.List(context.TODO(), cpNodeList, cpLabelSelector)


use real context

nunnatsa · 2024-12-02T05:41:45Z

controllers/nodes/nodes_controller.go

+func (r *ReconcileNodeCounter) Reconcile(ctx context.Context, _ reconcile.Request) (reconcile.Result, error) {
+	log.Info("Triggered by node count change")
+	logger := logf.Log.WithName("nodes-controller")
+	clusterInfo := hcoutil.GetClusterInfo()


If we're reading and writing from several go-routines, then we must protect the data.

But clusterInfo interface is too large for that, and will force us to protect each and every read method. So I think we must take out IsControlPlaneHighlyAvailable and IsInfrastructureHighlyAvailable out of the clusterInfo interface, together with their related data, into a new type/interface, and protect them.

controllers/nodes/nodes_controller.go

pkg/util/cluster.go

nunnatsa

Please handle async locks - see inline coments.

nunnatsa · 2024-12-03T15:21:07Z

controllers/nodes/nodes_controller.go

+	log.Info("Triggered by node count change")
+	logger := logf.Log.WithName("nodes-controller")
+	clusterInfo := hcoutil.GetClusterInfo()
+	err := clusterInfo.Init(ctx, r.client, logger)


I think clusterInfo.Init() does too much and update too many fields.

I do think we should take the high availability info out of the clusterInfo, but even if we don't, let's at least add a dedicated setter for it.

nunnatsa · 2024-12-03T15:30:19Z

pkg/util/cluster.go

+	c.controlPlaneHighlyAvailable = masterNodeCount >= 3
+	c.infrastructureHighlyAvailable = workerNodeCount >= 2


If we allow to get async access to this fields, we MUST protect them.

Please add a RWMutext, and lock each time we read or write these fields.

alternatively, use atomic.Bool from the standard library, instead of bool for the controlPlaneHighlyAvailable and infrastructureHighlyAvailable fields.

openshift-ci · 2024-12-03T16:02:30Z

@orenc1: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/hco-e2e-operator-sdk-gcp	`7cbaa03`	link	true	`/test hco-e2e-operator-sdk-gcp`
ci/prow/hco-e2e-kv-smoke-gcp	`7cbaa03`	link	true	`/test hco-e2e-kv-smoke-gcp`
ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure	`7cbaa03`	link	true	`/test hco-e2e-upgrade-prev-operator-sdk-azure`
ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure	`7cbaa03`	link	false	`/test hco-e2e-upgrade-operator-sdk-sno-azure`
ci/prow/ci-index-hco-upgrade-operator-sdk-bundle	`7cbaa03`	link	true	`/test ci-index-hco-upgrade-operator-sdk-bundle`
ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-azure	`7cbaa03`	link	false	`/test hco-e2e-upgrade-prev-operator-sdk-sno-azure`
ci/prow/hco-e2e-upgrade-operator-sdk-azure	`7cbaa03`	link	true	`/test hco-e2e-upgrade-operator-sdk-azure`
ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure	`7cbaa03`	link	true	`/test hco-e2e-consecutive-operator-sdk-upgrades-azure`
ci/prow/hco-e2e-operator-sdk-azure	`7cbaa03`	link	true	`/test hco-e2e-operator-sdk-azure`
ci/prow/hco-e2e-kv-smoke-azure	`7cbaa03`	link	true	`/test hco-e2e-kv-smoke-azure`
ci/prow/hco-e2e-operator-sdk-sno-azure	`7cbaa03`	link	false	`/test hco-e2e-operator-sdk-sno-azure`
ci/prow/hco-e2e-operator-sdk-aws	`7cbaa03`	link	true	`/test hco-e2e-operator-sdk-aws`
ci/prow/hco-e2e-operator-sdk-sno-aws	`7cbaa03`	link	false	`/test hco-e2e-operator-sdk-sno-aws`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hco-bot · 2024-12-03T17:35:06Z

hco-e2e-upgrade-prev-operator-sdk-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-azure
hco-e2e-upgrade-operator-sdk-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure
hco-e2e-upgrade-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-azure
hco-e2e-consecutive-operator-sdk-upgrades-aws lane succeeded.
/override ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure
hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

kubevirt-bot · 2024-12-03T17:35:13Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure, ci/prow/hco-e2e-upgrade-operator-sdk-azure, ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-azure

In response to this:

hco-e2e-upgrade-prev-operator-sdk-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-azure
hco-e2e-upgrade-operator-sdk-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure
hco-e2e-upgrade-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-operator-sdk-azure
hco-e2e-consecutive-operator-sdk-upgrades-aws lane succeeded.
/override ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure
hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

There are some cases in which the number of worker nodes are changing throughout the lifecycle of the cluster, and the InfrastructureTopology in the Infrastructure resource is statically set at cluster installation type and it is not getting updated dynamically. For example, an SNO cluster that is being added a new worker node. In that case, the infrastructure topology should be updated to 'HighlyAvailable' and don't remain 'SingleReplica'. Signed-off-by: Oren Cohen <[email protected]>

sonarcloud · 2024-12-04T08:32:12Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Dec 1, 2024

kubevirt-bot requested review from nunnatsa and sradco December 1, 2024 10:24

kubevirt-bot added the size/L label Dec 1, 2024

orenc1 force-pushed the update_ha_discovery branch from 04ccc94 to 4a5b861 Compare December 1, 2024 12:00

orenc1 force-pushed the update_ha_discovery branch from 4a5b861 to 2c541d5 Compare December 1, 2024 20:38

nunnatsa reviewed Dec 2, 2024

View reviewed changes

orenc1 force-pushed the update_ha_discovery branch 4 times, most recently from 34027e0 to 7cbaa03 Compare December 3, 2024 15:14

nunnatsa reviewed Dec 3, 2024

View reviewed changes

orenc1 force-pushed the update_ha_discovery branch from 7cbaa03 to 5d95295 Compare December 4, 2024 08:25

orenc1 force-pushed the update_ha_discovery branch from 5d95295 to 25ed595 Compare December 4, 2024 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't rely on InfraStructureTopology for infra HA #3186

Don't rely on InfraStructureTopology for infra HA #3186

orenc1 commented Dec 1, 2024 •

edited

Loading

kubevirt-bot commented Dec 1, 2024

coveralls commented Dec 1, 2024 •

edited

Loading

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

nunnatsa left a comment

nunnatsa Dec 1, 2024

orenc1 Dec 2, 2024

orenc1 Dec 3, 2024

nunnatsa Dec 1, 2024

orenc1 Dec 2, 2024

nunnatsa Dec 1, 2024

nunnatsa Dec 2, 2024

nunnatsa left a comment

nunnatsa Dec 3, 2024

nunnatsa Dec 3, 2024

openshift-ci bot commented Dec 3, 2024

hco-bot commented Dec 3, 2024

kubevirt-bot commented Dec 3, 2024

sonarcloud bot commented Dec 4, 2024

		c.controlPlaneHighlyAvailable = masterNodeCount >= 3
		c.infrastructureHighlyAvailable = workerNodeCount >= 2

Don't rely on InfraStructureTopology for infra HA #3186

Are you sure you want to change the base?

Don't rely on InfraStructureTopology for infra HA #3186

Conversation

orenc1 commented Dec 1, 2024 • edited Loading

kubevirt-bot commented Dec 1, 2024

coveralls commented Dec 1, 2024 • edited Loading

Pull Request Test Coverage Report for Build 12155870093

Details

💛 - Coveralls

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

hco-bot commented Dec 1, 2024

kubevirt-bot commented Dec 1, 2024

nunnatsa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nunnatsa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Dec 3, 2024

hco-bot commented Dec 3, 2024

kubevirt-bot commented Dec 3, 2024

sonarcloud bot commented Dec 4, 2024

Quality Gate passed

orenc1 commented Dec 1, 2024 •

edited

Loading

coveralls commented Dec 1, 2024 •

edited

Loading