[Bug]: Error when training on Kaggle #824

Kuchiriel · 2024-10-15T20:06:59Z

Project Version

Latest

Platform and OS Version

Kaggle

Affected Devices

Kaggle Latest Environment

Existing Issues

No response

What happened?

Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/kaggle/working/program_ml/rvc/train/train.py", line 509, in run
train_and_evaluate(
File "/kaggle/working/program_ml/rvc/train/train.py", line 707, in train_and_evaluate
scaler.scale(loss_disc).backward()
File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/autograd/init.py", line 267, in backward
_engine_run_backward(
File "/kaggle/tmp/.venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [172.19.2.2]:48294
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 42 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

Steps to reproduce

Happens during training between 100 ~ 500 epochs

Expected behavior

Continue the training without this error

Attachments

No response

Screenshots or Videos

No response

Additional Information

No response

aris-py · 2024-10-18T16:06:38Z

@Vidalnt

Kuchiriel added the bug Something isn't working label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error when training on Kaggle #824

[Bug]: Error when training on Kaggle #824

Kuchiriel commented Oct 15, 2024

aris-py commented Oct 18, 2024

[Bug]: Error when training on Kaggle #824

[Bug]: Error when training on Kaggle #824

Comments

Kuchiriel commented Oct 15, 2024

Project Version

Platform and OS Version

Affected Devices

Existing Issues

What happened?

Steps to reproduce

Expected behavior

Attachments

Screenshots or Videos

Additional Information

aris-py commented Oct 18, 2024