Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] UPD - ValueError: Plane vertices are not coplanar. #40

Closed
3 tasks done
EricLee0224 opened this issue Apr 15, 2024 · 8 comments
Closed
3 tasks done

[Bug] UPD - ValueError: Plane vertices are not coplanar. #40

EricLee0224 opened this issue Apr 15, 2024 · 8 comments
Assignees

Comments

@EricLee0224
Copy link

EricLee0224 commented Apr 15, 2024

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

System environment:
sys.platform: linux
Python: 3.8.19 | packaged by conda-forge | (default, Mar 20 2024, 12:47:35) [GCC 12.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 545726448
GPU 0,1,2,3,4,5,6,7: NVIDIA RTX A6000
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.58
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 1.11.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3

  • C++ Version: 201402

  • Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications

  • Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)

  • OpenMP 201511 (a.k.a. OpenMP 4.5)

  • LAPACK is enabled (usually provided by MKL)

  • NNPACK is enabled

  • CPU capability usage: AVX2

  • CUDA Runtime 11.3

  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37

  • CuDNN 8.2

  • Magma 2.5.2

  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

    TorchVision: 0.12.0
    OpenCV: 4.9.0
    MMEngine: 0.10.3

Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 545726448
Distributed launcher: pytorch
Distributed training: True
GPU number: 8

Reproduces the problem - code sample

Reproduces the problem - command or script

3D mv-Det:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet --launcher="pytorch"

3D mv-VG:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 tools/train.py configs/grounding/mv-grounding_8xb12_embodiedscan-vg-9dof.py --work-dir=work_dirs/mv-3dground --launcher="pytorch"

Reproduces the problem - error message

04/15 13:56:37 - mmengine - INFO - Checkpoints will be saved to /data/zyp/code/EmbodiedScan/work_dirs/mv-3dground.

/data/zyp/code/EmbodiedScan/embodiedscan/models/layers/fusion_layers/point_fusion.py:48: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone(
).detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
pcd_rotate_mat = (torch.tensor(img_meta['pcd_rotation'],
/data/zyp/code/EmbodiedScan/embodiedscan/models/layers/fusion_layers/point_fusion.py:48: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone(
).detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
pcd_rotate_mat = (torch.tensor(img_meta['pcd_rotation'],
/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmcv/cnn/bricks/transformer.py:524: UserWarning: position encoding of key ismissing in MultiheadAttention.
warnings.warn(f'position encoding of key is'
Traceback (most recent call last):
File "tools/train.py", line 133, in
main()
File "tools/train.py", line 129, in main
runner.train()
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
self.run_epoch()
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
self.run_iter(idx, data_batch)
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
outputs = self.runner.model.train_step(
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/detectors/sparse_featfusion_grounder.py", line 666, in forward
return self.loss(inputs, data_samples, **kwargs)
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/detectors/sparse_featfusion_grounder.py", line 507, in loss
losses = self.bbox_head.loss(**head_inputs_dict,
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 637, in loss
losses = self.loss_by_feat(*loss_inputs)
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 668, in loss_by_feat
losses_cls, losses_bbox = multi_apply(
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 711, in loss_by_feat_single
cls_reg_targets = self.get_targets(cls_scores_list,
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 258, in get_targets
pos_inds_list, neg_inds_list) = multi_apply(self._get_targets_single,
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 398, in _get_targets_single
assign_result = self.assigner.assign(
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/task_modules/assigners/hungarian_assigner.py", line 113, in assign
cost = match_cost(pred_instances=pred_instances_3d,
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/losses/match_cost.py", line 108, in call
overlaps = pred_bboxes.overlaps(pred_bboxes, gt_bboxes)
File "/data/zyp/code/EmbodiedScan/embodiedscan/structures/bbox_3d/euler_box3d.py", line 134, in overlaps
_, iou3d = box3d_overlap(corners1, corners2, eps=eps)
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/pytorch3d/ops/iou_box3d.py", line 160, in box3d_overlap
_check_coplanar(boxes2, eps)
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/pytorch3d/ops/iou_box3d.py", line 66, in _check_coplanar
raise ValueError(msg)
ValueError: Plane vertices are not coplanar

Additional information

I can run the 3D mv-det task very smoothly in both training and testing. However, when I run the 3D mv-VG task in the same environment with 8*A6000 (48G), it always encounters a ValueError: Plane vertices are not coplanar in the first epoch.

I have checked the related issues #22, #32, #30, facebookresearch/pytorch3d/issues/992, and facebookresearch/pytorch3d/issues/1771.

I have also tried the following solutions:

  1. Modifying eps in box3d_overlap with values like 1e-2, 1e-3, 1e-4, and 1e-5.
  2. Changing the learning rate (lr) in the training script to values like 5e-2 and 5e-4.
  3. Training with detection checkpoint and without detection checkpoint.
  4. Using 2xA6000, 4xA6000, and 8xA6000.
  5. Using --resume and --resume auto

However, none of these solutions have worked so far. Could anyone please share how to solve this issue or provide a successful environment setup? Will the team look into this matter? Many thanks.

@mxh1999
Copy link
Collaborator

mxh1999 commented Apr 16, 2024

I completely understand your feeling with this situation.

Based on my understanding, this issue often arises when one of the predicted boxes has a side length that is too short.

Here are some possible solutions:

  1. Adjust the 'val_interval' in the configuration settings to avoid evaluating the model when it's unstable.
  2. Enforce the modification of all boxes that have sides smaller than a certain threshold during evaluation.
  3. Just remove the '_check_coplanar' and '_check_nonzero' checks.

I hope this helps!

@mxh1999 mxh1999 self-assigned this Apr 16, 2024
@iris0329
Copy link

iris0329 commented Apr 16, 2024

I met the same issue when training 3dv-grounding using A6000, when computing the cost of the Hungarian assignment of training.

I think it's inappropriate to simply modify eps in box3d_overlap, because in '_check_coplanar' and '_check_nonzero' checks, eps limit

  1. the maximum of the dot product of the edge vector with the surface normal vector
  2. the minimum area of each face of the box

respectively.

So setting as follows fixes my issue:

# file: 
# site-packages/pytorch3d/ops/iou_box3d.py

_check_coplanar(boxes1, 1e-2)
_check_coplanar(boxes2, 1e-2)
_check_nonzero(boxes1, 1e-4)
_check_nonzero(boxes2, 1e-4)

Basically, it's similar to the idea that just remove the '_check_coplanar' and '_check_nonzero' checks as suggested above. But I'm not sure if there are any serious consequences to extending this limit

@EricLee0224
Copy link
Author

@mxh1999 @iris0329 Thanks for your further solutions! I will try it and update issue if there have any progress.

@EricLee0224
Copy link
Author

I met the same issue when training 3dv-grounding using A6000, when computing the cost of the Hungarian assignment of training.

I think it's inappropriate to simply modify eps in box3d_overlap, because in '_check_coplanar' and '_check_nonzero' checks, eps limit

  1. the maximum of the dot product of the edge vector with the surface normal vector
  2. the minimum area of each face of the box

respectively.

So setting as follows fixes my issue:

# file: 
# site-packages/pytorch3d/ops/iou_box3d.py

_check_coplanar(boxes1, 1e-2)
_check_coplanar(boxes2, 1e-2)
_check_nonzero(boxes1, 1e-4)
_check_nonzero(boxes2, 1e-4)

Basically, it's similar to the idea that just remove the '_check_coplanar' and '_check_nonzero' checks as suggested above. But I'm not sure if there are any serious consequences to extending this limit

NB, it really works. While some checks have been removed, the val/test results are not significantly different from the official reports. This seems to be an acceptable solution.

@iris0329
Copy link

Hi @EricLee0224 , can you share your reproduced result? The val/test result I got is lower than the official one.

@EricLee0224
Copy link
Author

EricLee0224 commented Apr 17, 2024

Sure. I think it is normal to have minor fluctuations in the results of the 3DVG task. You can conduct multiple experiments to observe the results, but I feel that these fluctuations are negligible and not necessary to focus on. Since the baseline values are not very high, it implies that we need to explore better model approaches to significantly (maybe +5~10?) improve performance.
image

@iris0329
Copy link

See, thank you! similar results.

@chinagalaxy2002
Copy link

I met the same issue when training 3dv-grounding using A6000, when computing the cost of the Hungarian assignment of training.
I think it's inappropriate to simply modify eps in box3d_overlap, because in '_check_coplanar' and '_check_nonzero' checks, eps limit

  1. the maximum of the dot product of the edge vector with the surface normal vector
  2. the minimum area of each face of the box

respectively.
So setting as follows fixes my issue:

# file: 
# site-packages/pytorch3d/ops/iou_box3d.py

_check_coplanar(boxes1, 1e-2)
_check_coplanar(boxes2, 1e-2)
_check_nonzero(boxes1, 1e-4)
_check_nonzero(boxes2, 1e-4)

Basically, it's similar to the idea that just remove the '_check_coplanar' and '_check_nonzero' checks as suggested above. But I'm not sure if there are any serious consequences to extending this limit

NB, it really works. While some checks have been removed, the val/test results are not significantly different from the official reports. This seems to be an acceptable solution.

System environment:
sys.platform: linux
Python: 3.8.20 | CPython | (('default', 'Oct 3 2024 15:24:27'))
OS: Linux 5.15.0-78-generic #85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023
CUDA available: True
MUSA available: False
GPU count: 1
GPU 0: NVIDIA vGPU-32GB
CUDA version: 11.3
CUDNN version: 8200
PyTorch build details:
PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.2
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
    ...
    NumPy: 1.21.6
    TorchVision: 0.12.0
    OpenCV: 4.10.0
    Distributed training: True

当我按照修改

file:

site-packages/pytorch3d/ops/iou_box3d.py

_check_coplanar(boxes1, 1e-2)
_check_coplanar(boxes2, 1e-2)
_check_nonzero(boxes1, 1e-4)
_check_nonzero(boxes2, 1e-4)
之后,仍然出现该问题:
ValueError: Planes have zero areas

我尝试remove这个特判:
if (face_areas < eps).any().item():
msg = "Planes have zero areas"
raise ValueError(msg)
运行demo出现如下情况:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants