Different output from HF and Tensorrt-llm #2754

Ericoool9614 · 2025-02-06T09:25:27Z

Ericoool9614
Feb 6, 2025

Model: Internvl2-8B
Precision: BF16, no quantization
No sampling strategy (temperature=0, do_sample=False in HF generation_config, so both are greedy search)
Single GPU execution, no model parallelism
No batch (batch size = 1)
Inference is performed using Hugging Face model.chat method and Tensorrt-llm MultimodalModelRunner.run() method.

Ericoool9614 · 2025-02-06T09:37:27Z

Ericoool9614
Feb 6, 2025
Author

The difference appears after the first logits output from runtime.generation.handle_per_step

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different output from HF and Tensorrt-llm #2754

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Different output from HF and Tensorrt-llm #2754

Ericoool9614 Feb 6, 2025

Replies: 1 comment

Ericoool9614 Feb 6, 2025 Author

Ericoool9614
Feb 6, 2025

Ericoool9614
Feb 6, 2025
Author