-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][distributed] fix zmq hang #6759
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
@@ -40,7 +40,7 @@ def _validate_http_url(self, url: str): | |||
raise ValueError("Invalid HTTP URL: A valid HTTP URL " | |||
"must have scheme 'http' or 'https'.") | |||
|
|||
def _headers(self, **extras: str) -> Mapping[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a lint error I fix by the way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, as long as XSUB isn't needed to make this semantically correct. I didn't dig into XPUB/XSUB quite enough to know that for sure myself, yet.
It's nice to be able to fix an issue AND simplify this code a lot :)
@@ -9,7 +9,7 @@ | |||
import torch | |||
import torch.distributed as dist | |||
from torch.distributed import ProcessGroup | |||
from zmq import PUB, REP, REQ, SUB, SUBSCRIBE, Context # type: ignore | |||
from zmq import SUB, SUBSCRIBE, XPUB, XPUB_VERBOSE, Context # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need XSUB
, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you can check https://netmq.readthedocs.io/en/latest/xpub-xsub/ .
XPUB
connects to SUB
.
XSUB
is used to connect many PUB
, which is not our usecase.
(cherry picked from commit 740374d)
[core][distributed] fix zmq hang (vllm-project#6759)
Signed-off-by: Alvant <[email protected]>
fixes #6700
this is caused by incorrect usage of zmq, or a bug of zmq, reported at zeromq/libzmq#4713 .
By using XPUB channel, we can make sure all subscribers already subscribed, and we are ready to publish (broadcast).
Locally tested, previously it hangs once in 20 runs.
Now it runs without any problem in 1000 runs.