You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I configured PIPELINE_PARALLELISM = 2 and the service started up correctly. But the second pod always report Ready = False. Looks like with this example, the Deployment is only deployed to one pod, and under the hood vLLM leverages the second pod for distributed inference. Thus, the Proxy isn't deployed to second pod and readiness check isn't passing.
While this isn't an immediate problem (serving is working as expected), it doesn't seem right to report unready pods while it actually is ready. In addition, unready pods has other infrastructure impact on our k8s cluster (e.g. impacting node lifecycle management).
On Slack you mentioned overriding the readiness probe and then using proxy like Envoy. This could work but I think the implementation will be tricky to get right because it's not trivial to know which Pods are running the proxy actors without connecting to the Ray Cluster. Another workaround is potentially just running a dummy proxy actor on every node?
potentially just running a dummy proxy actor on every node
yes that's another idea I thought about. basically a dummy proxy that just route traffic? I was actually hoping this config: https://docs.ray.io/en/latest/serve/api/doc/ray.serve.config.ProxyLocation.html to do that. but when I was testing it, it seems to be ignoring the second pod vLLM was using - it didn't deploy proxy on that pod.
Is there a different way to start proxy actor everywhere?
Can you just deploy another Ray Serve Deployment that is doing nothing? You may need to add enough replicas / resources to ensure it runs on every node
Will that work? I think the operator will inject readiness check based on the port of the RayService deployment (say 8000). I won't be able to make another RayServe deployment on the same port on every node, right?
also I think the proxy actor needs to know where to route the traffic. I am not sure if the second vllm port can handle requests?
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I followed https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/ray-service.vllm.yaml to setup vllm serving with RayService. While it works out, I see an issue when enabling multi-node inference.
I configured
PIPELINE_PARALLELISM = 2
and the service started up correctly. But the second pod always reportReady = False
. Looks like with this example, the Deployment is only deployed to one pod, and under the hood vLLM leverages the second pod for distributed inference. Thus, the Proxy isn't deployed to second pod and readiness check isn't passing.Looks like the behavior was introduced by #1808
While this isn't an immediate problem (serving is working as expected), it doesn't seem right to report unready pods while it actually is ready. In addition, unready pods has other infrastructure impact on our k8s cluster (e.g. impacting node lifecycle management).
Reproduction script
Following https://docs.ray.io/en/latest/cluster/kubernetes/examples/vllm-rayservice.html or https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/ray-service.vllm.yaml and set
PIPELINE_PARALLELISM=2
will reproduce this issue.Anything else
It is always re-producible.
I don't have any solution in mind. but I am happy to brainstorm together and implement/tryout any solution if there is a suggestion!
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: