-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't rely on InfraStructureTopology for infra HA #3186
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 12155870093Details
💛 - Coveralls |
04ccc94
to
4a5b861
Compare
hco-e2e-upgrade-operator-sdk-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-operator-sdk-aws, ci/prow/hco-e2e-upgrade-operator-sdk-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-upgrade-operator-sdk-sno-azure lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws, ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-kv-smoke-gcp lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
4a5b861
to
2c541d5
Compare
hco-e2e-upgrade-prev-operator-sdk-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-operator-sdk-aws, ci/prow/hco-e2e-upgrade-operator-sdk-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-operator-sdk-sno-azure lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-operator-sdk-sno-aws, ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
hco-e2e-upgrade-prev-operator-sdk-sno-azure lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added several inline comments, but my general comment is about the ClusterInfo type, and thread-safty.
I think we must split the high-available info & methods from this type/interface into a new type, and then protect it with RWMutex.
@orenc1, WDYT?
pkg/util/cluster.go
Outdated
@@ -426,3 +424,47 @@ func isValidCipherName(str string) bool { | |||
slices.Contains(openshiftconfigv1.TLSProfiles[openshiftconfigv1.TLSProfileIntermediateType].Ciphers, str) || | |||
slices.Contains(openshiftconfigv1.TLSProfiles[openshiftconfigv1.TLSProfileModernType].Ciphers, str) | |||
} | |||
|
|||
func getNodesCount(cl client.Client, nodesType NodesType) (int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reused code between the two cases, and both are pretty long. Let't split this function into two functions, and avoid passing the nodeType
parameter. It gives us no advantage to have one function here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, at the beginning the majority of the function was shared for the two cases.
but then it turned out there is no logical OR when selecting nodes based on two labels where either of them exists, so i needed to make two queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function refactored. now returning both workers and masters count in a single API call.
pkg/util/cluster.go
Outdated
if nodesType == ControlPlaneNodes { | ||
masterReq, err := labels.NewRequirement("node-role.kubernetes.io/master", selection.Exists, nil) | ||
if err != nil { | ||
return -1, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid returning non-zero-value when returning error. there is no need for that and the convention is to return the zero value + error (0 for ints).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
anyway in case of an error, the returned int is not being read.
pkg/util/cluster.go
Outdated
} | ||
cpSelector := labels.NewSelector().Add(*controlplaneReq) | ||
cpLabelSelector := client.MatchingLabelsSelector{Selector: cpSelector} | ||
err = cl.List(context.TODO(), cpNodeList, cpLabelSelector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use real context
func (r *ReconcileNodeCounter) Reconcile(ctx context.Context, _ reconcile.Request) (reconcile.Result, error) { | ||
log.Info("Triggered by node count change") | ||
logger := logf.Log.WithName("nodes-controller") | ||
clusterInfo := hcoutil.GetClusterInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're reading and writing from several go-routines, then we must protect the data.
But clusterInfo interface is too large for that, and will force us to protect each and every read method. So I think we must take out IsControlPlaneHighlyAvailable
and IsInfrastructureHighlyAvailable
out of the clusterInfo interface, together with their related data, into a new type/interface, and protect them.
34027e0
to
7cbaa03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please handle async locks - see inline coments.
log.Info("Triggered by node count change") | ||
logger := logf.Log.WithName("nodes-controller") | ||
clusterInfo := hcoutil.GetClusterInfo() | ||
err := clusterInfo.Init(ctx, r.client, logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think clusterInfo.Init() does too much and update too many fields.
I do think we should take the high availability info out of the clusterInfo, but even if we don't, let's at least add a dedicated setter for it.
pkg/util/cluster.go
Outdated
c.controlPlaneHighlyAvailable = masterNodeCount >= 3 | ||
c.infrastructureHighlyAvailable = workerNodeCount >= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow to get async access to this fields, we MUST protect them.
Please add a RWMutext, and lock each time we read or write these fields.
alternatively, use atomic.Bool
from the standard library, instead of bool for the controlPlaneHighlyAvailable
and infrastructureHighlyAvailable
fields.
@orenc1: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
hco-e2e-upgrade-prev-operator-sdk-sno-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-azure, ci/prow/hco-e2e-upgrade-operator-sdk-azure, ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
7cbaa03
to
5d95295
Compare
There are some cases in which the number of worker nodes are changing throughout the lifecycle of the cluster, and the InfrastructureTopology in the Infrastructure resource is statically set at cluster installation type and it is not getting updated dynamically. For example, an SNO cluster that is being added a new worker node. In that case, the infrastructure topology should be updated to 'HighlyAvailable' and don't remain 'SingleReplica'. Signed-off-by: Oren Cohen <[email protected]>
5d95295
to
25ed595
Compare
Quality Gate passedIssues Measures |
There are some cases in which the number of worker nodes are changing throughout the lifecycle of the cluster, and the
InfrastructureTopology
in the Infrastructure resource is statically set at cluster installation type and it is not getting updated dynamically. For example, an SNO cluster that is being added a new worker node. In that case, the infrastructure topology should be updated toHighlyAvailable
and don't remainSingleReplica
.Instead, we should count the worker nodes, same as we do in case of a k8s cluster.
In addition, fixing a potential bug were we used only the
node-role.kubernetes.io/master
label to find and count the masters/control-plane nodes.What this PR does / why we need it:
Reviewer Checklist
Jira Ticket:
Release note: