Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Code 4: Miscellaneous (IShuffleLayer Reshape_427: reshape changes volume. Reshaping [900,1,256] to [900,7200,32].) #2245

Closed
liangguixing95 opened this issue Aug 15, 2022 · 18 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@liangguixing95
Copy link

hello, when i coverted my onnx model to TensorRT by the command,
./trtexec --onnx=model.onnx --saveEngine=model.engine
i got big diff between pytorch result and trt result. i located the problem which might be related to the decoder transformer part of my model. so i only coverted the transformer part to onnx and try to find out what is wrong. but when i run the command ./trtexec --onnx=decoder_transformer.onnx --saveEngine=decoder_transformer.engineto covert onnx to trt. i got an error which didn't appear during the "model.onnx" converting.
error
The error comes from the cross attention part. but the error disappears when i only covert the cross attention module to onnx and trt by ./trtexec --onnx=cross_attention.onnx --saveEngine=cross_attention.engine. so finally i can not figure out how to solve the problem to get correct trt result and open a issue for some help. Thanks~

Environment
TensorRT Version: 8.4.1.5+cuda11.6
NVIDIA GPU: A100
NVIDIA Driver Version: 510.47.03
CUDA Version: 11.6
CUDNN Version: 8.4.0.27
Operating System: Ubuntu 20.04.2 LTS
Python Version: 3.7.13
PyTorch Version: 1.10

@zerollzeng
Copy link
Collaborator

Usually, this happened when your model has a dynamic input shape and a fixed reshape operation, can you check it first?

@zerollzeng zerollzeng self-assigned this Aug 15, 2022
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Aug 15, 2022
@frankvp11
Copy link

frankvp11 commented Aug 16, 2022

I got this same error. What do you want me to check? @zerollzeng
Edit: I am training using the balloon example (idk where the link was anymore) and used their dataset and configurations.

@zerollzeng
Copy link
Collaborator

Check the onnx model first, e.g. run it with onnx runtime with a preset input shapes.

@zerollzeng
Copy link
Collaborator

the problem here is simple, support you have a reshape layer, reshape a tensor to 2x6, it's has an input of axb, then axb must equal to 2x6=12

@frankvp11
Copy link

Yeah- I made another issue explaining my issue more closely, but I knew what you meant before already. Ill check it later with onnxruntime

@liangguixing95
Copy link
Author

liangguixing95 commented Aug 19, 2022

I've found out the reason which is related to the layer norm. In my model, the input of LN is a tensor of [900,1,256], the LN function is called by nn.functional.layer_norm(input, [256,]) , the output in the pytorch version has no problem but get a wrong output shape of [900,900,256] for onnx. I fixed the problem by revise the method into nn.functional.layer_norm(input, [1, 256]) . you can check if your code get the same problem @frankvp11

@liangguixing95
Copy link
Author

liangguixing95 commented Aug 19, 2022

I've fixed the shape error but got another new problem. the outputs of onnx and trtfp32 engine are quite different after the torch.bmm operator in cross attention module.
bmm
I compare the output of q,k,attn of onnx and trt and print the max diff of each pair. q,k of them are the same, but attn are quite different. as show below. I have no idea to solve this. @zerollzeng
diff2

@frankvp11
Copy link

I'm working with Detectron2 so its impossible for me to realistically edit the source code.

@zerollzeng
Copy link
Collaborator

I compare the output of q,k,attn of onnx and trt and print the max diff of each pair. q,k of them are the same, but attn are quite different. as show below. I have no idea to solve this

Can you provide a reproduce so that I can check it on my side? I would prefer a minimal onnx model.

@liangguixing95
Copy link
Author

https://drive.google.com/drive/folders/13LGb4uCEzrLV4k1dRa9FBHPnrrAwXfSf?usp=sharing
Hear are the onnx model and some debug inputs i used to preduce the diff comparison log.

@zerollzeng
Copy link
Collaborator

zerollzeng commented Aug 22, 2022

I can't reproduce it using polygraphy, all output is matched:

[I] Accuracy Comparison | trt-runner-N0-08/22/22-15:50:44 vs. onnxrt-runner-N0-08/22/22-15:50:44
[I]     Comparing Output: '72' (dtype=float32, shape=(8, 900, 32)) with '72' (dtype=float32, shape=(8, 900, 32))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-08/22/22-15:50:44: 72 | Stats: mean=-0.0027745, std-dev=0.1346, var=0.018118, median=-7.5492e-05, min=-0.53595 at (2, 16, 0), max=0.58039 at (2, 300, 21), avg-magnitude=0.10865
[I]         onnxrt-runner-N0-08/22/22-15:50:44: 72 | Stats: mean=-0.0027745, std-dev=0.1346, var=0.018118, median=-7.5492e-05, min=-0.53595 at (2, 16, 0), max=0.58039 at (2, 300, 21), avg-magnitude=0.10865
[I]         Error Metrics: 72
[I]             Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=0] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]             Relative Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]         PASSED | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     Comparing Output: '73' (dtype=float32, shape=(8, 12000, 32)) with '73' (dtype=float32, shape=(8, 12000, 32))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-08/22/22-15:50:44: 73 | Stats: mean=0.062328, std-dev=0.72619, var=0.52735, median=0.055339, min=-3.2914 at (3, 5027, 19), max=3.1621 at (1, 3771, 3), avg-magnitude=0.5761
[I]         onnxrt-runner-N0-08/22/22-15:50:44: 73 | Stats: mean=0.062328, std-dev=0.72619, var=0.52735, median=0.055339, min=-3.2914 at (3, 5027, 19), max=3.1621 at (1, 3771, 3), avg-magnitude=0.5761
[I]         Error Metrics: 73
[I]             Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=0] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]             Relative Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]         PASSED | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     Comparing Output: '76' (dtype=float32, shape=(8, 900, 12000)) with '76' (dtype=float32, shape=(8, 900, 12000))
[I]     Tolerance: [abs=1e-05, rel=1e-05] | Checking elemwise error
[I]         trt-runner-N0-08/22/22-15:50:44: 76 | Stats: mean=-0.24013, std-dev=0.44643, var=0.1993, median=-0.23786, min=-3.2709 at (2, 191, 11177), max=2.4214 at (1, 174, 3771), avg-magnitude=0.40642
[I]         onnxrt-runner-N0-08/22/22-15:50:44: 76 | Stats: mean=-0.24013, std-dev=0.44643, var=0.1993, median=-0.23786, min=-3.2709 at (2, 191, 11177), max=2.4214 at (1, 174, 3771), avg-magnitude=0.40642
[I]         Error Metrics: 76
[I]             Minimum Required Tolerance: elemwise error | [abs=0] OR [rel=0] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]             Relative Difference | Stats: mean=0, std-dev=0, var=0, median=0, min=0 at (0, 0, 0), max=0 at (0, 0, 0), avg-magnitude=0
[I]         PASSED | Difference is within tolerance (rel=1e-05, abs=1e-05)
[I]     PASSED | All outputs matched | Outputs: ['72', '73', '76']
[I] PASSED | Command: /usr/local/bin/polygraphy run module.onnx --trt --onnxrt

@zerollzeng
Copy link
Collaborator

A suggestion: after constant folding, the network structure is simpler:
image

polygraphy surgeon sanitize module.onnx --fold-constants -o module_folded.onnx

@frankvp11
Copy link

@zerollzeng does constant folding make the model better/faster?

@liangguixing95
Copy link
Author

@zerollzeng does constant folding make the model better/faster?
Constant folding brings some performance degradation for my case. The onnx file provided is a minimal part of the cross attention module in my model. Running the onnx by polygraphy shows there may be no problem. But when using the real data, the max diff of the outputs are quite large as the log show above.

@zerollzeng
Copy link
Collaborator

Constant folding brings some performance degradation for my case. The onnx file provided is a minimal part of the cross attention module in my model. Running the onnx by polygraphy shows there may be no problem. But when using the real data, the max diff of the outputs are quite large as the log show above.

Are you using the real data for input? it might be caused by your input data, e.g. if you feed random binary data to it, it might be large value like e+6

@ttyio
Copy link
Collaborator

ttyio commented Dec 6, 2022

closing since no activity for more than 3 weeks, please reopen if you still have question, thanks!

@ttyio ttyio closed this as completed Dec 6, 2022
@fanchuanster
Copy link

Use NGC pytorch:22.12-py3 instead of pytorch:22.07-py3 to fix “Error Code 4: Miscellaneous (IShuffleLayer Reshape_179: reshape changes volume. Reshaping [784] to [1])"

@lix19937
Copy link

lix19937 commented May 11, 2024

I also come across this problem

[05/11/2024-15:07:32] [V] [TRT] Insert CopyNode after ConstantNode that produces a Myelin graph output: 25021
[05/11/2024-15:07:33] [E] Error[4]: [shapeCompiler.cpp::evaluateShapeChecks::1180] Error Code 4: Internal Error (kOPT values for profile 0 violate shape constraints: IShuffleLayer Reshape_1933: reshaping failed for tensor: 3516 Reshape would change volume.)
[05/11/2024-15:07:33] [E] Error[2]: [builder.cpp::buildSerializedNetwork::743] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[05/11/2024-15:07:33] [E] Engine could not be created from network
[05/11/2024-15:07:33] [E] Building engine failed
[05/11/2024-15:07:33] [E] Failed to create engine from model or file.
[05/11/2024-15:07:33] [E] Engine set up failed

the onnx's input all are fixed shape, but inner network has data-dependent op like nonzero, if I replace all code related to data-dependent operations with plugins for implementation, the errors will not occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

6 participants