[Bug] OOM occurs when learning BEVFusion lidar & camera. Distributed learning doesn't seem to be working properly. #3048

dudqls1994 · 2024-10-22T08:18:34Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

1.x branch https://github.com/open-mmlab/mmdetection3d/tree/dev-1.x

Environment

:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
sys.platform: linux
Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: Tesla V100-SXM3-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.1+cu111
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.2+cu111
OpenCV: 4.10.0
MMEngine: 0.10.5
MMDetection: 3.3.0
MMDetection3D: 1.4.0+962f093
spconv2.0: True

System environment:
sys.platform: linux
Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1686915582
GPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: Tesla V100-SXM3-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.1+cu111
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=co
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-eroverflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, U

TorchVision: 0.11.2+cu111
OpenCV: 4.10.0
MMEngine: 0.10.5
Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 1686915582
Distributed launcher: pytorch
Distributed training: True
GPU number: 8

Reproduces the problem - code sample

xx

Reproduces the problem - command or script

bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py 8 --cfg-options load_from=work_dirs/bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d/epoch_20.pth model.img_backbone.init_cfg.checkpoint=./swint-nuimages-pretrained.pth

Reproduces the problem - error message

File "tools/train.py", line 145, in
main()
File "tools/train.py", line 141, in main
runner.train()
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1789, in train
model = self.train_loop.run() # type: ignore
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 98, in run
self.run_epoch()
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 115, in run_epoch
self.run_iter(idx, data_batch)
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 131, in run_iter
outputs = self.runner.model.train_step(
File "/opt/conda/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/opt/conda/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/mmdet3d/models/detectors/base.py", line 75, in forward
return self.loss(inputs, data_samples, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 292, in loss
feats = self.extract_feat(batch_inputs_dict, batch_input_metas)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 268, in extract_feat
img_feature = self.extract_img_feat(imgs, deepcopy(points),
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 156, in extract_img_feat
x = self.view_transform(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 424, in forward
x = super().forward(*args, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 330, in forward
x = self.bev_pool(geom, x)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 140, in bev_pool
x = x[kept]
RuntimeError: CUDA out of memory. Tried to allocate 2.11 GiB (GPU 0; 31.73 GiB total capacity; 17.32 GiB already allocated; 990.94 MiB free; 18.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additional information

I think Distributed learning doesn't seem to be working properly.
I use nuscense dataset.

OOM issues occur as the GPU focuses on number 0.
What's the problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] OOM occurs when learning BEVFusion lidar & camera. Distributed learning doesn't seem to be working properly. #3048

[Bug] OOM occurs when learning BEVFusion lidar & camera. Distributed learning doesn't seem to be working properly. #3048

dudqls1994 commented Oct 22, 2024 •

edited

Loading

[Bug] OOM occurs when learning BEVFusion lidar & camera. Distributed learning doesn't seem to be working properly. #3048

[Bug] OOM occurs when learning BEVFusion lidar & camera. Distributed learning doesn't seem to be working properly. #3048

Comments

dudqls1994 commented Oct 22, 2024 • edited Loading

Prerequisite

Task

Branch

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

dudqls1994 commented Oct 22, 2024 •

edited

Loading