Error Code 4: Miscellaneous (IShuffleLayer Reshape_427: reshape changes volume. Reshaping [900,1,256] to [900,7200,32].) #2245

liangguixing95 · 2022-08-15T07:07:05Z

hello, when i coverted my onnx model to TensorRT by the command,
./trtexec --onnx=model.onnx --saveEngine=model.engine
i got big diff between pytorch result and trt result. i located the problem which might be related to the decoder transformer part of my model. so i only coverted the transformer part to onnx and try to find out what is wrong. but when i run the command ./trtexec --onnx=decoder_transformer.onnx --saveEngine=decoder_transformer.engineto covert onnx to trt. i got an error which didn't appear during the "model.onnx" converting.

The error comes from the cross attention part. but the error disappears when i only covert the cross attention module to onnx and trt by ./trtexec --onnx=cross_attention.onnx --saveEngine=cross_attention.engine. so finally i can not figure out how to solve the problem to get correct trt result and open a issue for some help. Thanks~

Environment
TensorRT Version: 8.4.1.5+cuda11.6
NVIDIA GPU: A100
NVIDIA Driver Version: 510.47.03
CUDA Version: 11.6
CUDNN Version: 8.4.0.27
Operating System: Ubuntu 20.04.2 LTS
Python Version: 3.7.13
PyTorch Version: 1.10

The text was updated successfully, but these errors were encountered:

zerollzeng · 2022-08-15T12:47:14Z

Usually, this happened when your model has a dynamic input shape and a fixed reshape operation, can you check it first?

frankvp11 · 2022-08-16T18:35:21Z

I got this same error. What do you want me to check? @zerollzeng
Edit: I am training using the balloon example (idk where the link was anymore) and used their dataset and configurations.

zerollzeng · 2022-08-17T09:48:14Z

Check the onnx model first, e.g. run it with onnx runtime with a preset input shapes.

zerollzeng · 2022-08-17T09:51:58Z

the problem here is simple, support you have a reshape layer, reshape a tensor to 2x6, it's has an input of axb, then axb must equal to 2x6=12

frankvp11 · 2022-08-17T09:56:45Z

Yeah- I made another issue explaining my issue more closely, but I knew what you meant before already. Ill check it later with onnxruntime

liangguixing95 · 2022-08-19T06:20:37Z

I've found out the reason which is related to the layer norm. In my model, the input of LN is a tensor of [900,1,256], the LN function is called by nn.functional.layer_norm(input, [256,]) , the output in the pytorch version has no problem but get a wrong output shape of [900,900,256] for onnx. I fixed the problem by revise the method into nn.functional.layer_norm(input, [1, 256]) . you can check if your code get the same problem @frankvp11

liangguixing95 · 2022-08-19T07:06:24Z

I've fixed the shape error but got another new problem. the outputs of onnx and trtfp32 engine are quite different after the torch.bmm operator in cross attention module.

I compare the output of q,k,attn of onnx and trt and print the max diff of each pair. q,k of them are the same, but attn are quite different. as show below. I have no idea to solve this. @zerollzeng

frankvp11 · 2022-08-19T10:39:26Z

I'm working with Detectron2 so its impossible for me to realistically edit the source code.

zerollzeng · 2022-08-20T11:25:57Z

I compare the output of q,k,attn of onnx and trt and print the max diff of each pair. q,k of them are the same, but attn are quite different. as show below. I have no idea to solve this

Can you provide a reproduce so that I can check it on my side? I would prefer a minimal onnx model.

liangguixing95 · 2022-08-22T02:30:16Z

https://drive.google.com/drive/folders/13LGb4uCEzrLV4k1dRa9FBHPnrrAwXfSf?usp=sharing
Hear are the onnx model and some debug inputs i used to preduce the diff comparison log.

zerollzeng · 2022-08-22T15:52:46Z

I can't reproduce it using polygraphy, all output is matched:

[I] Accuracy Comparison | trt-runner-N0-08/22/22-15:50:44 vs. onnxrt-runner-N0-08/22/22-15:50:44
[I]     Comparing Output: '72' (dtype=float32, shape=(8, 900, 32)) with '72' (dtype=float32, shape=(8, 900, 32))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-08/22/22-15:50:44: 72 | Stats: mean=-0.0027745, std-dev=0.1346, var=0.018118, median=-7.5492e-05, min=-0.53595 at (2, 16, 0), max=0.58039 at (2, 300, 21), avg-magnitude=0.10865
[I]         onnxrt-runner-N0-08/22/22-15:50:44: 72 | Stats: mean=-0.0027745, std-dev=0.1346, var=0.018118, median=-7.5492e-05, min=-0.53595 at (2, 16, 0), max=0.58039 at (2, 300, 21), avg-magnitude=0.10865
[I]         Error Metrics: 72
[I]             Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=0] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]             Relative Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]         PASSED | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     Comparing Output: '73' (dtype=float32, shape=(8, 12000, 32)) with '73' (dtype=float32, shape=(8, 12000, 32))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-08/22/22-15:50:44: 73 | Stats: mean=0.062328, std-dev=0.72619, var=0.52735, median=0.055339, min=-3.2914 at (3, 5027, 19), max=3.1621 at (1, 3771, 3), avg-magnitude=0.5761
[I]         onnxrt-runner-N0-08/22/22-15:50:44: 73 | Stats: mean=0.062328, std-dev=0.72619, var=0.52735, median=0.055339, min=-3.2914 at (3, 5027, 19), max=3.1621 at (1, 3771, 3), avg-magnitude=0.5761
[I]         Error Metrics: 73
[I]             Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=0] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]             Relative Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]         PASSED | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     Comparing Output: '76' (dtype=float32, shape=(8, 900, 12000)) with '76' (dtype=float32, shape=(8, 900, 12000))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-08/22/22-15:50:44: 76 | Stats: mean=-0.24013, std-dev=0.44643, var=0.1993, median=-0.23786, min=-3.2709 at (2, 191, 11177), max=2.4214 at (1, 174, 3771), avg-magnitude=0.40642
[I]         onnxrt-runner-N0-08/22/22-15:50:44: 76 | Stats: mean=-0.24013, std-dev=0.44643, var=0.1993, median=-0.23786, min=-3.2709 at (2, 191, 11177), max=2.4214 at (1, 174, 3771), avg-magnitude=0.40642
[I]         Error Metrics: 76
[I]             Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=0] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]             Relative Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]         PASSED | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     PASSED | All outputs matched | Outputs: ['72', '73', '76']
[I] PASSED | Command: /usr/local/bin/polygraphy run module.onnx --trt --onnxrt

zerollzeng · 2022-08-22T15:54:46Z

A suggestion: after constant folding, the network structure is simpler:

polygraphy surgeon sanitize module.onnx --fold-constants -o module_folded.onnx

frankvp11 · 2022-08-23T12:59:47Z

@zerollzeng does constant folding make the model better/faster?

liangguixing95 · 2022-08-24T08:31:26Z

@zerollzeng does constant folding make the model better/faster?
Constant folding brings some performance degradation for my case. The onnx file provided is a minimal part of the cross attention module in my model. Running the onnx by polygraphy shows there may be no problem. But when using the real data, the max diff of the outputs are quite large as the log show above.

zerollzeng · 2022-08-24T16:51:09Z

Constant folding brings some performance degradation for my case. The onnx file provided is a minimal part of the cross attention module in my model. Running the onnx by polygraphy shows there may be no problem. But when using the real data, the max diff of the outputs are quite large as the log show above.

Are you using the real data for input? it might be caused by your input data, e.g. if you feed random binary data to it, it might be large value like e+6

ttyio · 2022-12-06T02:09:24Z

closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!

fanchuanster · 2023-06-29T14:05:48Z

Use NGC pytorch:22.12-py3 instead of pytorch:22.07-py3 to fix “Error Code 4: Miscellaneous (IShuffleLayer Reshape_179: reshape changes volume. Reshaping [784] to [1])"

lix19937 · 2024-05-11T07:30:38Z

I also come across this problem

[05/11/2024-15:07:32] [V] [TRT] Insert CopyNode after ConstantNode that produces a Myelin graph output: 25021
[05/11/2024-15:07:33] [E] Error[4]: [shapeCompiler.cpp::evaluateShapeChecks::1180] Error Code 4: Internal Error (kOPT values for profile 0 violate shape constraints: IShuffleLayer Reshape_1933: reshaping failed for tensor: 3516 Reshape would change volume.)
[05/11/2024-15:07:33] [E] Error[2]: [builder.cpp::buildSerializedNetwork::743] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[05/11/2024-15:07:33] [E] Engine could not be created from network
[05/11/2024-15:07:33] [E] Building engine failed
[05/11/2024-15:07:33] [E] Failed to create engine from model or file.
[05/11/2024-15:07:33] [E] Engine set up failed

the onnx's input all are fixed shape, but inner network has data-dependent op like nonzero, if I replace all code related to data-dependent operations with plugins for implementation, the errors will not occur.

zerollzeng self-assigned this Aug 15, 2022

zerollzeng added the triaged Issue has been triaged by maintainers label Aug 15, 2022

ttyio closed this as completed Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Code 4: Miscellaneous (IShuffleLayer Reshape_427: reshape changes volume. Reshaping [900,1,256] to [900,7200,32].) #2245

Error Code 4: Miscellaneous (IShuffleLayer Reshape_427: reshape changes volume. Reshaping [900,1,256] to [900,7200,32].) #2245

liangguixing95 commented Aug 15, 2022

zerollzeng commented Aug 15, 2022

frankvp11 commented Aug 16, 2022 •

edited

Loading

zerollzeng commented Aug 17, 2022

zerollzeng commented Aug 17, 2022

frankvp11 commented Aug 17, 2022

liangguixing95 commented Aug 19, 2022 •

edited

Loading

liangguixing95 commented Aug 19, 2022 •

edited

Loading

frankvp11 commented Aug 19, 2022

zerollzeng commented Aug 20, 2022

liangguixing95 commented Aug 22, 2022

zerollzeng commented Aug 22, 2022 •

edited

Loading

zerollzeng commented Aug 22, 2022

frankvp11 commented Aug 23, 2022

liangguixing95 commented Aug 24, 2022

zerollzeng commented Aug 24, 2022

ttyio commented Dec 6, 2022

fanchuanster commented Jun 29, 2023

lix19937 commented May 11, 2024 •

edited

Loading

Error Code 4: Miscellaneous (IShuffleLayer Reshape_427: reshape changes volume. Reshaping [900,1,256] to [900,7200,32].) #2245

Error Code 4: Miscellaneous (IShuffleLayer Reshape_427: reshape changes volume. Reshaping [900,1,256] to [900,7200,32].) #2245

Comments

liangguixing95 commented Aug 15, 2022

zerollzeng commented Aug 15, 2022

frankvp11 commented Aug 16, 2022 • edited Loading

zerollzeng commented Aug 17, 2022

zerollzeng commented Aug 17, 2022

frankvp11 commented Aug 17, 2022

liangguixing95 commented Aug 19, 2022 • edited Loading

liangguixing95 commented Aug 19, 2022 • edited Loading

frankvp11 commented Aug 19, 2022

zerollzeng commented Aug 20, 2022

liangguixing95 commented Aug 22, 2022

zerollzeng commented Aug 22, 2022 • edited Loading

zerollzeng commented Aug 22, 2022

frankvp11 commented Aug 23, 2022

liangguixing95 commented Aug 24, 2022

zerollzeng commented Aug 24, 2022

ttyio commented Dec 6, 2022

fanchuanster commented Jun 29, 2023

lix19937 commented May 11, 2024 • edited Loading

frankvp11 commented Aug 16, 2022 •

edited

Loading

liangguixing95 commented Aug 19, 2022 •

edited

Loading

liangguixing95 commented Aug 19, 2022 •

edited

Loading

zerollzeng commented Aug 22, 2022 •

edited

Loading

lix19937 commented May 11, 2024 •

edited

Loading