You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "train.py", line 191, in
main()
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "train.py", line 120, in main
torch.distributed.broadcast(seed, src=0)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
torch.distributed.broadcast(seed, src=0)
这个在A100上有报错,请问您有遇到过或者知道解决方法吗
The text was updated successfully, but these errors were encountered:
File "train.py", line 191, in
main()
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "train.py", line 120, in main
torch.distributed.broadcast(seed, src=0)
File "/root/anaconda3/envs/py3-mink/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
torch.distributed.broadcast(seed, src=0)
这个在A100上有报错,请问您有遇到过或者知道解决方法吗
The text was updated successfully, but these errors were encountered: