Skip to content

[Bug] Low reproducibility? Limit gpus? #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
mrsempress opened this issue Apr 3, 2024 · 9 comments
Closed
3 tasks done

[Bug] Low reproducibility? Limit gpus? #32

mrsempress opened this issue Apr 3, 2024 · 9 comments

Comments

@mrsempress
Copy link

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

System environment: [1085/1460]
sys.platform: linux
Python: 3.8.17 (default, Jul 5 2023, 21:04:15) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1551893665
GPU 0,1: NVIDIA A100-SXM4-80GB
CUDA_HOME: /mnt/lustre/share/cuda-11.0
NVCC: Cuda compilation tools, release 11.0, V11.0.221
GCC: gcc (GCC) 5.4.0
PyTorch: 1.12.1
PyTorch compiling details: PyTorch built with:

GCC 9.3

C++ Version: 201402

Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications

Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)

OpenMP 201511 (a.k.a. OpenMP 4.5)

LAPACK is enabled (usually provided by MKL)

NNPACK is enabled

CPU capability usage: AVX2

CUDA Runtime 11.3

NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=
sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37

CuDNN 8.3.2 (built against CUDA 11.5)

Magma 2.5.2

Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_BGEMM−DUSEQNNPACK−DUSEPYTORCHQNNPACK−DUSEXNNPACK−DSYMBOLICATEMOBILEDEBUGHANDLE−DEDGEPROFILERUSEKINETO−O2−fPIC−Wno−narrowing−Wall−Wextra−Werror=return−type−Wno−missing−field−initializers−Wno−type−limits−Wno−array−bounds−Wno−unknown−pragmas−Wno−unuse
-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostic−color=always−faligned−new−Wno−unused−but−set−variable−Wno−maybe−uninitialized−fno−math−errno−fno−trapping−math−Werror=format−Werror=cast−function−type−Wno−stringop−overflow,LAPACKINFO=mkl,PERFWITHAVX=1,PERFWITHAVX2=1,PERFWITHAVX512=1,TORCHVERSION=1.12.
, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1
OpenCV: 4.9.0
MMEngine: 0.10.3

Reproduces the problem - code sample

N/A

Reproduces the problem - command or script

sh tools/mv-grounding.sh

Reproduces the problem - error message

The reproducibility results are:
AP25:
| Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall |
| results | 0.2093 | 0.1840 | 0.1966 | 0.2129 | 0.0000 | 0.2073 | 0.2073 |

AP50:
| Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall |
| results | 0.0535 | 0.0452 | 0.0581 | 0.0501 | 0.0000 | 0.0528 | 0.0528 |

But, the results in the paper are:
AP25:
| Type | Easy | Hard | View-Dep | View-Indep | Overall |
| results | 0.2711 | 0.2012 | 0.2342 | 0.2637 | 0.2572 |

In addition, the training can only be completed when the number of GPUs is 8.
When the number of GPUs is 2 or 4, issue 30 will sometimes occur, and issue 26 will sometimes occur.

Additional information

  1. Is there a limit to the number of GPUs, or is the problem random, and it just runs out when gpu=8?
  2. Are the results of visual grounding reported in the paper using the default config in tools/mv_grounding.sh? Or added fcaf_coder or modified other parameters?
@Tai-Wang
Copy link
Contributor

Tai-Wang commented Apr 3, 2024

You need to reduce the learning rate by 2 or 4 accordingly because the actual batch size is only 1/4 of that in our experiments. It should yield a comparable result when you adjust the optimizer setting, although we did not try them before.

@mrsempress
Copy link
Author

When I reproduced it, it was still based on GPU 8, as written in your mv_grounding.sh, but the result was not good. When the GPU changes, an error message will appear, and it will not run successfully.

@Tai-Wang
Copy link
Contributor

Tai-Wang commented Apr 3, 2024

Do you remove the pretrained checkpoint from the config? I find your result is lower than our reported performance here. You can first reproduce the performance reported in our repo because we re-split the training/val/test set for the challenge as explained here.

@mrsempress
Copy link
Author

I removed the pretrained checkpoint from the config because I didn't know that pretrained weights were necessary and didn't see the detection branch's role on the visual grounding branch in the pipeline.
I will try again to get the pre-training weights and redo the visual grounding task.
Thank you for your reply.

@Tai-Wang
Copy link
Contributor

Tai-Wang commented Apr 3, 2024

OK. We found loading the pretrained detection checkpoint to be a helpful trick, as it is mentioned in BUTD-DETR. Look forward to your further feedback.

@ZCMax
Copy link
Collaborator

ZCMax commented Apr 3, 2024

I removed the pretrained checkpoint from the config because I didn't know that pretrained weights were necessary and didn't see the detection branch's role on the visual grounding branch in the pipeline. I will try again to get the pre-training weights and redo the visual grounding task. Thank you for your reply.

Since the feature extraction pipeline can be shared by detection and visual grounding task, so we can use the 3D detection pre-trained checkpoint for weight initialization. It can be useful for better grounding performance and accelerate the training convergence at some extent.

@Tai-Wang
Copy link
Contributor

Tai-Wang commented Apr 9, 2024

Close due to inactivity. Please feel free to reopen this issue if you have any further questions.

@Tai-Wang Tai-Wang closed this as completed Apr 9, 2024
@mrsempress
Copy link
Author

After loading your checkpoint, the performance exceeded what your paper reported(+7.95%).

The results in the paper are:
AP25:
| Type | Easy | Hard | View-Dep | View-Indep | Overall |
| results | 0.2711 | 0.2012 | 0.2342 | 0.2637 | 0.2572 |

The reproducibility results are: (with load your checkpoint)
AP25:
| Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall |
| results | 0.3489 | 0.3018|0.3567|0.3277|0.0000|0.3377|0.3377|

AP50:
| Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall |
| results | 0.1168|0.0925|0.1127|0.1159|0.0000|0.1148|0.1148|
  1. Another question is why the result of overall is the same as the result of multi.
  2. In addition, you mentioned that using detection checkpoint is important. In my experiment, it increased by 13.04%. If the grounding checkpoint is also used as the initialization of detection, will there be an improvement? If we keep looping initialization, can we get better results?

@ZCMax
Copy link
Collaborator

ZCMax commented Apr 11, 2024

After loading your checkpoint, the performance exceeded what your paper reported(+7.95%).

The results in the paper are:
AP25:
| Type | Easy | Hard | View-Dep | View-Indep | Overall |
| results | 0.2711 | 0.2012 | 0.2342 | 0.2637 | 0.2572 |

The reproducibility results are: (with load your checkpoint)
AP25:
| Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall |
| results | 0.3489 | 0.3018|0.3567|0.3277|0.0000|0.3377|0.3377|

AP50:
| Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall |
| results | 0.1168|0.0925|0.1127|0.1159|0.0000|0.1148|0.1148|
  1. Another question is why the result of overall is the same as the result of multi.
  2. In addition, you mentioned that using detection checkpoint is important. In my experiment, it increased by 13.04%. If the grounding checkpoint is also used as the initialization of detection, will there be an improvement? If we keep looping initialization, can we get better results?
  1. Since all the prompts belong to the multiple type, the overall performance is exactly the same as multiple.
  2. Actually, an exploration can be joint grounding and detection training, illustrated in BUTD-DETR, reformulating the detection task to the category prompt grounding task. It may boost both the detection and grounding performance at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants