Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Incorrect labels on the Service for RayCluster head #2564

Open
2 tasks done
jegork opened this issue Nov 22, 2024 · 1 comment
Open
2 tasks done

[Bug] Incorrect labels on the Service for RayCluster head #2564

jegork opened this issue Nov 22, 2024 · 1 comment
Labels
bug Something isn't working raycluster

Comments

@jegork
Copy link

jegork commented Nov 22, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

  1. Create a RayCluster
  2. Update the RayCluster head node label app.kubernetes.io/name
  3. Operator does not update selector labels for head service
  4. Head pod is unreachable

Expected: labels are updated for the head service by the Operator

Reproduction script

Sample YAML:

yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    app.kubernetes.io/component: converters-cluster
    app.kubernetes.io/name: ray-converters-2
  name: converters
spec:
  autoscalerOptions:
    idleTimeoutSeconds: 60
    imagePullPolicy: IfNotPresent
    upscalingMode: Default
  enableInTreeAutoscaling: true
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
      num-cpus: '0'
    template:
      metadata:
        annotations:
          prometheus.io/path: /metrics
          prometheus.io/port: '8080'
          prometheus.io/scrape: 'true'
        labels:
          app.kubernetes.io/component: converters-cluster
          app.kubernetes.io/name: ray-converters-2
      spec:
        containers:
          - image: docker.io/rayproject/ray:2.38.0-py311-gpu
            name: ray-head
            ports:
              - containerPort: 6379
                name: gcs-server
                protocol: TCP
              - containerPort: 8265
                name: dashboard
                protocol: TCP
              - containerPort: 10001
                name: client
                protocol: TCP
              - containerPort: 8000
                name: serve
                protocol: TCP
            resources:
              limits:
                cpu: 2
                memory: 2Gi
              requests:
                cpu: 2
                memory: 2Gi
        serviceAccountName: default
  rayVersion: 2.38.0
  workerGroupSpecs:
    - groupName: cpu
      maxReplicas: 3
      minReplicas: 0
      rayStartParams: {}
      replicas: 0
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8080'
            prometheus.io/scrape: 'true'
          labels:
            app.kubernetes.io/component: converters-cluster
            app.kubernetes.io/name: ray-converters-2
        spec:
          containers:
            - image: docker.io/rayproject/ray:2.38.0-py311-gpu
              lifecycle:
                preStop:
                  exec:
                    command:
                      - /bin/sh
                      - -c
                      - ray stop
              name: ray-worker
              resources:
                limits:
                  cpu: '4'
                  memory: 8Gi
                requests:
                  cpu: '4'
                  memory: 8Gi

Apply the resource, then change ray-converters-2 to ray-converters-3, but the service still gives

NAME                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                         AGE   SELECTOR
converters-head-svc   ClusterIP   None            <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   12m   app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=ray-converters-2,ray.io/cluster=converters,ray.io/identifier=converters-head,ray.io/node-type=head

Anything else

issues with reconciliation of selector labels for head has been happening to me on different clusters and this becomes really annoying, as the only solution is to delete the raycluster after which it fixes the service (although if you delete just the service, it doesn't get fixed)

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@jegork jegork added bug Something isn't working triage labels Nov 22, 2024
@andrewsykim
Copy link
Collaborator

KubeRay uses the app.kubernetes.io/name label in the Head Service

utils.KubernetesApplicationNameLabelKey: utils.ApplicationName,

Since you're overwriting with a different label, the service won't match against your Pods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working raycluster
Projects
None yet
Development

No branches or pull requests

3 participants