Different output from HF and Tensorrt-llm #2754
Unanswered
Ericoool9614
asked this question in
Q&A
Replies: 1 comment
-
The difference appears after the first logits output from runtime.generation.handle_per_step |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Model: Internvl2-8B
Precision: BF16, no quantization
No sampling strategy (temperature=0, do_sample=False in HF generation_config, so both are greedy search)
Single GPU execution, no model parallelism
No batch (batch size = 1)
Inference is performed using Hugging Face model.chat method and Tensorrt-llm MultimodalModelRunner.run() method.
Beta Was this translation helpful? Give feedback.
All reactions