Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Error when training on Kaggle #824

Open
Kuchiriel opened this issue Oct 15, 2024 · 1 comment
Open

[Bug]: Error when training on Kaggle #824

Kuchiriel opened this issue Oct 15, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@Kuchiriel
Copy link

Project Version

Latest

Platform and OS Version

Kaggle

Affected Devices

Kaggle Latest Environment

Existing Issues

No response

What happened?

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/kaggle/working/program_ml/rvc/train/train.py", line 509, in run
train_and_evaluate(
File "/kaggle/working/program_ml/rvc/train/train.py", line 707, in train_and_evaluate
scaler.scale(loss_disc).backward()
File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.19.2.2]:48294
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 42 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

Steps to reproduce

Happens during training between 100 ~ 500 epochs

Expected behavior

Continue the training without this error

Attachments

No response

Screenshots or Videos

No response

Additional Information

No response

@Kuchiriel Kuchiriel added the bug Something isn't working label Oct 15, 2024
@aris-py
Copy link
Contributor

aris-py commented Oct 18, 2024

@Vidalnt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants