Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] OOM occurs when learning BEVFusion lidar & camera. Distributed learning doesn't seem to be working properly. #3048

Open
3 tasks done
dudqls1994 opened this issue Oct 22, 2024 · 0 comments

Comments

@dudqls1994
Copy link

dudqls1994 commented Oct 22, 2024

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

1.x branch https://github.com/open-mmlab/mmdetection3d/tree/dev-1.x

Environment

:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
sys.platform: linux
Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: Tesla V100-SXM3-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.1+cu111
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.2+cu111
OpenCV: 4.10.0
MMEngine: 0.10.5
MMDetection: 3.3.0
MMDetection3D: 1.4.0+962f093
spconv2.0: True

System environment:
sys.platform: linux
Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1686915582
GPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: Tesla V100-SXM3-32GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.1, V11.1.105
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.1+cu111
PyTorch compiling details: PyTorch built with:

  • GCC 7.3

  • C++ Version: 201402

  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications

  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)

  • OpenMP 201511 (a.k.a. OpenMP 4.5)

  • LAPACK is enabled (usually provided by MKL)

  • NNPACK is enabled

  • CPU capability usage: AVX512

  • CUDA Runtime 11.1

  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=co

  • CuDNN 8.0.5

  • Magma 2.5.2

  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-eroverflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, U

    TorchVision: 0.11.2+cu111
    OpenCV: 4.10.0
    MMEngine: 0.10.5
    Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 1686915582
    Distributed launcher: pytorch
    Distributed training: True
    GPU number: 8


Reproduces the problem - code sample

xx

Reproduces the problem - command or script

bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py 8 --cfg-options load_from=work_dirs/bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d/epoch_20.pth model.img_backbone.init_cfg.checkpoint=./swint-nuimages-pretrained.pth

Reproduces the problem - error message

File "tools/train.py", line 145, in
main()
File "tools/train.py", line 141, in main
runner.train()
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1789, in train
model = self.train_loop.run() # type: ignore
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 98, in run
self.run_epoch()
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 115, in run_epoch
self.run_iter(idx, data_batch)
File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 131, in run_iter
outputs = self.runner.model.train_step(
File "/opt/conda/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/opt/conda/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/mmdet3d/models/detectors/base.py", line 75, in forward
return self.loss(inputs, data_samples, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 292, in loss
feats = self.extract_feat(batch_inputs_dict, batch_input_metas)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 268, in extract_feat
img_feature = self.extract_img_feat(imgs, deepcopy(points),
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 156, in extract_img_feat
x = self.view_transform(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 424, in forward
x = super().forward(*args, **kwargs)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 330, in forward
x = self.bev_pool(geom, x)
File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 140, in bev_pool
x = x[kept]
RuntimeError: CUDA out of memory. Tried to allocate 2.11 GiB (GPU 0; 31.73 GiB total capacity; 17.32 GiB already allocated; 990.94 MiB free; 18.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additional information

I think Distributed learning doesn't seem to be working properly.
I use nuscense dataset.

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 451731 C /opt/conda/bin/python 2216MiB |
| 0 N/A N/A 451732 C /opt/conda/bin/python 1246MiB |
| 0 N/A N/A 451736 C /opt/conda/bin/python 1104MiB |
| 0 N/A N/A 451737 C /opt/conda/bin/python 1120MiB |
| 0 N/A N/A 451739 C /opt/conda/bin/python 1184MiB |
| 0 N/A N/A 451741 C /opt/conda/bin/python 1222MiB |
| 0 N/A N/A 451743 C /opt/conda/bin/python 1176MiB |
| 0 N/A N/A 451745 C /opt/conda/bin/python 1114MiB |
| 1 N/A N/A 451732 C /opt/conda/bin/python 1768MiB |
| 2 N/A N/A 451736 C /opt/conda/bin/python 1768MiB |
| 3 N/A N/A 451737 C /opt/conda/bin/python 1768MiB |
| 4 N/A N/A 451739 C /opt/conda/bin/python 1768MiB |
| 5 N/A N/A 451741 C /opt/conda/bin/python 1768MiB |
| 6 N/A N/A 451743 C /opt/conda/bin/python 1768MiB |
| 7 N/A N/A 451745 C /opt/conda/bin/python 1648MiB |
+-----------------------------------------------------------------------------------------+

OOM issues occur as the GPU focuses on number 0.
What's the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant