You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "tools/train.py", line 145, in
main()
File "tools/train.py", line 141, in main
runner.train()
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1789, in train
model = self.train_loop.run() # type: ignore
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 98, in run
self.run_epoch()
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 115, in run_epoch
self.run_iter(idx, data_batch)
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 131, in run_iter
outputs = self.runner.model.train_step(
File "/opt/conda/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/opt/conda/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/mmdet3d/models/detectors/base.py", line 75, in forward
return self.loss(inputs, data_samples, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 292, in loss
feats = self.extract_feat(batch_inputs_dict, batch_input_metas)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 268, in extract_feat
img_feature = self.extract_img_feat(imgs, deepcopy(points),
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 156, in extract_img_feat
x = self.view_transform(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 424, in forward
x = super().forward(*args, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 330, in forward
x = self.bev_pool(geom, x)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 140, in bev_pool
x = x[kept]
RuntimeError: CUDA out of memory. Tried to allocate 2.11 GiB (GPU 0; 31.73 GiB total capacity; 17.32 GiB already allocated; 990.94 MiB free; 18.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Additional information
I think Distributed learning doesn't seem to be working properly.
I use nuscense dataset.
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 451731 C /opt/conda/bin/python 2216MiB |
| 0 N/A N/A 451732 C /opt/conda/bin/python 1246MiB |
| 0 N/A N/A 451736 C /opt/conda/bin/python 1104MiB |
| 0 N/A N/A 451737 C /opt/conda/bin/python 1120MiB |
| 0 N/A N/A 451739 C /opt/conda/bin/python 1184MiB |
| 0 N/A N/A 451741 C /opt/conda/bin/python 1222MiB |
| 0 N/A N/A 451743 C /opt/conda/bin/python 1176MiB |
| 0 N/A N/A 451745 C /opt/conda/bin/python 1114MiB |
| 1 N/A N/A 451732 C /opt/conda/bin/python 1768MiB |
| 2 N/A N/A 451736 C /opt/conda/bin/python 1768MiB |
| 3 N/A N/A 451737 C /opt/conda/bin/python 1768MiB |
| 4 N/A N/A 451739 C /opt/conda/bin/python 1768MiB |
| 5 N/A N/A 451741 C /opt/conda/bin/python 1768MiB |
| 6 N/A N/A 451743 C /opt/conda/bin/python 1768MiB |
| 7 N/A N/A 451745 C /opt/conda/bin/python 1648MiB |
+-----------------------------------------------------------------------------------------+
OOM issues occur as the GPU focuses on number 0.
What's the problem?
The text was updated successfully, but these errors were encountered:
Prerequisite
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
1.x branch https://github.com/open-mmlab/mmdetection3d/tree/dev-1.x
Environment
:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
sys.platform: linux
Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: Tesla V100-SXM3-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.1+cu111
PyTorch compiling details: PyTorch built with:
TorchVision: 0.11.2+cu111
OpenCV: 4.10.0
MMEngine: 0.10.5
MMDetection: 3.3.0
MMDetection3D: 1.4.0+962f093
spconv2.0: True
System environment:
sys.platform: linux
Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1686915582
GPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: Tesla V100-SXM3-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.1+cu111
PyTorch compiling details: PyTorch built with:
GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=co
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-eroverflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, U
TorchVision: 0.11.2+cu111
OpenCV: 4.10.0
MMEngine: 0.10.5
Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 1686915582
Distributed launcher: pytorch
Distributed training: True
GPU number: 8
Reproduces the problem - code sample
xx
Reproduces the problem - command or script
bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py 8 --cfg-options load_from=work_dirs/bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d/epoch_20.pth model.img_backbone.init_cfg.checkpoint=./swint-nuimages-pretrained.pth
Reproduces the problem - error message
File "tools/train.py", line 145, in
main()
File "tools/train.py", line 141, in main
runner.train()
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1789, in train
model = self.train_loop.run() # type: ignore
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 98, in run
self.run_epoch()
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 115, in run_epoch
self.run_iter(idx, data_batch)
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 131, in run_iter
outputs = self.runner.model.train_step(
File "/opt/conda/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/opt/conda/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/mmdet3d/models/detectors/base.py", line 75, in forward
return self.loss(inputs, data_samples, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 292, in loss
feats = self.extract_feat(batch_inputs_dict, batch_input_metas)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 268, in extract_feat
img_feature = self.extract_img_feat(imgs, deepcopy(points),
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 156, in extract_img_feat
x = self.view_transform(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 424, in forward
x = super().forward(*args, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 330, in forward
x = self.bev_pool(geom, x)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 140, in bev_pool
x = x[kept]
RuntimeError: CUDA out of memory. Tried to allocate 2.11 GiB (GPU 0; 31.73 GiB total capacity; 17.32 GiB already allocated; 990.94 MiB free; 18.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Additional information
I think Distributed learning doesn't seem to be working properly.
I use nuscense dataset.
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 451731 C /opt/conda/bin/python 2216MiB |
| 0 N/A N/A 451732 C /opt/conda/bin/python 1246MiB |
| 0 N/A N/A 451736 C /opt/conda/bin/python 1104MiB |
| 0 N/A N/A 451737 C /opt/conda/bin/python 1120MiB |
| 0 N/A N/A 451739 C /opt/conda/bin/python 1184MiB |
| 0 N/A N/A 451741 C /opt/conda/bin/python 1222MiB |
| 0 N/A N/A 451743 C /opt/conda/bin/python 1176MiB |
| 0 N/A N/A 451745 C /opt/conda/bin/python 1114MiB |
| 1 N/A N/A 451732 C /opt/conda/bin/python 1768MiB |
| 2 N/A N/A 451736 C /opt/conda/bin/python 1768MiB |
| 3 N/A N/A 451737 C /opt/conda/bin/python 1768MiB |
| 4 N/A N/A 451739 C /opt/conda/bin/python 1768MiB |
| 5 N/A N/A 451741 C /opt/conda/bin/python 1768MiB |
| 6 N/A N/A 451743 C /opt/conda/bin/python 1768MiB |
| 7 N/A N/A 451745 C /opt/conda/bin/python 1648MiB |
+-----------------------------------------------------------------------------------------+
OOM issues occur as the GPU focuses on number 0.
What's the problem?
The text was updated successfully, but these errors were encountered: