Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#训练时分布式bug #12

Open
CBQ-1223 opened this issue Jun 12, 2024 · 0 comments
Open

#训练时分布式bug #12

CBQ-1223 opened this issue Jun 12, 2024 · 0 comments

Comments

@CBQ-1223
Copy link

File "train.py", line 191, in
main()
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "train.py", line 120, in main
torch.distributed.broadcast(seed, src=0)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

torch.distributed.broadcast(seed, src=0)

这个在A100上有报错,请问您有遇到过或者知道解决方法吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant