Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster][Fix] leave .Status.State untouched when there is a reconcile error #2622

Merged

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Dec 6, 2024

Why are these changes needed?

We accidentally set the .State of RayCluster CR to rayv1.Ready even if there was a reconcile error in PR #2258 where we removed rayv1.Failed. This made users unable to deal with false ready RayClusters with no pod actually created, see #2357.

But since #2258 has been released and the rayv1.Failed has gone, I think we can't bring back rayv1.Failed either now. This PR fixes the issue by leaving the .State field untouched when there is a reconcile error but setting the error message in the .Reason field. That means the initial .State for case #2357 will be blank and the pod creation error will be in the .Reason field. The new unit test case in this PR also tests this behavior.

Example result:

▶ kubectl describe raycluster raycluster-mini
Name:         raycluster-mini
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  ray.io/v1
Kind:         RayCluster
Metadata:
  Creation Timestamp:  2024-12-06T22:53:44Z
  Generation:          1
  Resource Version:    1393
  UID:                 da4d6b0d-c3a3-4f00-9b9b-e4c411e2d6be
Spec:
  Head Group Spec:
    ...
  Ray Version:       2.9.0
  Suspend:           false
  Worker Group Specs:
    ...
Status:
  Desired CPU:              2
  Desired GPU:              0
  Desired Memory:           2G
  Desired TPU:              0
  Desired Worker Replicas:  1
  Endpoints:
    Client:     10001
    Dashboard:  8265
    Metrics:    8080
    Redis:      6379
  Head:
    Service Name:       raycluster-mini-head-svc
  Last Update Time:     2024-12-06T22:53:46Z
  Max Worker Replicas:  1
  Min Worker Replicas:  1
  Observed Generation:  1
  Reason:               FailedCreateHeadPod
pods "raycluster-mini-head-v6zqb" is forbidden: exceeded quota: demoquota, requested: cpu=1,memory=1G, used: cpu=100m,memory=512Mi, limited: cpu=1,memory=1G
Events:
  Type     Reason                 Age   From                   Message
  ----     ------                 ----  ----                   -------
  Normal   CreatedService         3s    raycluster-controller  Created service default/raycluster-mini-head-svc
  Warning  FailedToCreateHeadPod  3s    raycluster-controller  Failed to create head Pod default/, pods "raycluster-mini-head-tqljd" is forbidden: exceeded quota: demoquota, requested: cpu=1,memory=1G, used: cpu=100m,memory=512Mi, limited: cpu=1,memory=1G
  Warning  FailedToCreateHeadPod  3s    raycluster-controller  Failed to create head Pod default/, pods "raycluster-mini-head-gt52b" is forbidden: exceeded quota: demoquota, requested: cpu=1,memory=1G, used: cpu=100m,memory=512Mi, limited: cpu=1,memory=1G

Related issue number

Closes #2357

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests

@rueian rueian force-pushed the fix-raycluster-ready-with-reconcileerr branch 2 times, most recently from 36d4cb3 to 90d7377 Compare December 6, 2024 22:05
@rueian rueian marked this pull request as ready for review December 6, 2024 22:44
@kevin85421
Copy link
Member

do we still need this in case #2357 (comment)?

@rueian
Copy link
Contributor Author

rueian commented Dec 9, 2024

I think we still need this. Currently, we mark the RayCluster CR as Ready when no Pods are created due to the k8s quota limit. That doesn't look correct in any way.

This PR will make .State left blank in this case since we don't introduce the Failed state back.

@kevin85421
Copy link
Member

maybe we need to check the number of Pods before passing to the function CheckAllPodsRunning?

@rueian
Copy link
Contributor Author

rueian commented Dec 9, 2024

maybe we need to check the number of Pods before passing to the function CheckAllPodsRunning?

Yeap, checking len(runtimePods) > 0 before invoking CheckAllPodsRunning() works as well. Is there any reason we ignore the reconcileErr here?

@kevin85421
Copy link
Member

Is there any reason we ignore the reconcileErr here?

does it include in conditions?

@rueian
Copy link
Contributor Author

rueian commented Dec 9, 2024

Yes, it is included in the ReplicaFailure condition, but if the feature gate is disable, then users can only find the error from logs.

And I think it is not very good to call CheckAllPodsRunning while the reconcileErr is not nil. That can cause we turn a CR to Ready state early. Shouldn't we turn the CR to Ready only when the reconcileErr is nil?

@kevin85421
Copy link
Member

And I think it is not very good to call CheckAllPodsRunning while the reconcileErr is not nil. That can cause we turn a CR to Ready state early. Shouldn't we turn the CR to Ready only when the reconcileErr is nil?

This is a bug. We should not call CheckAllPodsRunning when the number of running Pods is not equal to the expected number of Pods.

@kevin85421
Copy link
Member

Discussed offline:

  1. we will not set Reason
  2. only call CheckAllPodsRunning when reconcile error is nil and the number of Pods is as expected.

@rueian rueian force-pushed the fix-raycluster-ready-with-reconcileerr branch from 57b4fbf to 456eb43 Compare December 10, 2024 06:01
@rueian rueian force-pushed the fix-raycluster-ready-with-reconcileerr branch from 456eb43 to 605dd33 Compare December 10, 2024 07:38
@rueian
Copy link
Contributor Author

rueian commented Dec 10, 2024

Discussed offline:

  1. we will not set Reason
  2. only call CheckAllPodsRunning when reconcile error is nil and the number of Pods is as expected.

Updated. Test cases related to calculateStatus are also modified to make sure we don't change mark the CR as Ready when there are not enough Pods.

@@ -1784,6 +1801,147 @@ func TestCalculateStatus(t *testing.T) {
assert.True(t, meta.IsStatusConditionPresentAndEqual(newInstance.Status.Conditions, string(rayv1.RayClusterReplicaFailure), metav1.ConditionTrue))
}

func TestCalculateStatusWithoutDesiredReplicas(t *testing.T) {
setupTest(t)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add comments to explain this test.
  2. Add assertions to ensure that testRayCluster fulfills the assumptions made by this test (i.e., the worker group's replicas should not be 0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


func TestCalculateStatusWithReconcileErrorBackAndForth(t *testing.T) {
setupTest(t)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@kevin85421 kevin85421 merged commit b5bcb86 into ray-project:master Dec 11, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Regression from v1.1] RayCluster Condition should surface resource quota issue correctly
3 participants