[Bugfix] Avoid repeatedly creating dummy data during engine startup #17935

DarkLight1337 · 2025-05-10T03:07:24Z

This PR fixes an issue where the startup time of multimodal models is multipled because dummy data is created multiple times during profile run and graph capturing. Instead of disabling cache when dummy data is generated, the cache is now always enabled. To conserve memory, the cache is instead cleared at the end of the engine start process.

Cache reset code is taken from #16478

github-actions · 2025-05-10T03:07:32Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: DarkLight1337 <[email protected]>

vllm/v1/engine/llm_engine.py

Signed-off-by: DarkLight1337 <[email protected]>

ywang96 · 2025-05-12T08:14:39Z

I'm actually not seeing the speedup from this PR with the following test script

# python3 test.py --model Qwen/Qwen2.5-VL-3B-Instruct

# test.py
from vllm.multimodal import MULTIMODAL_REGISTRY
from vllm.engine.async_llm_engine import AsyncEngineArgs
from vllm.engine.async_llm_engine import UsageContext
from vllm.utils import FlexibleArgumentParser
from vllm.v1.core.encoder_cache_manager import compute_encoder_budget
import time

if __name__ == "__main__":
    parser = FlexibleArgumentParser()
    parser = AsyncEngineArgs.add_cli_args(parser)
    args = parser.parse_args()
    engine_args = AsyncEngineArgs.from_cli_args(args)
    vllm_config = engine_args.create_engine_config(UsageContext.ENGINE_CONTEXT)
    
    start_time = time.perf_counter()
    encoder_compute_budget, encoder_cache_size = compute_encoder_budget(
        model_config=vllm_config.model_config,
        scheduler_config=vllm_config.scheduler_config,
        mm_registry=MULTIMODAL_REGISTRY,
    )
    print(f"Time taken: {time.perf_counter() - start_time}")

On main

Time taken: 10.416103872470558

This branch

Time taken: 16.340763847343624

DarkLight1337 · 2025-05-12T09:00:12Z

The cache is only effective if you run compute_encoder_budget a second time.

DarkLight1337 · 2025-05-12T09:01:18Z

In actual inference this means that the multi-modal data used in dummy run is cached (since it's called after compute_encoder_budget), so there should still be an overall speedup.

DarkLight1337 · 2025-05-12T09:41:49Z

OK it seems that I do need to implement the async version of this method...

Signed-off-by: DarkLight1337 <[email protected]>

…llm-project#17935) Signed-off-by: DarkLight1337 <[email protected]>

…llm-project#17935) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

…llm-project#17935) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: minpeter <[email protected]>

DarkLight1337 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, zhuohan123 and youkaichao as code owners May 10, 2025 03:07

mergify bot added multi-modality Related to multi-modality (#4194) v1 labels May 10, 2025

[Bugfix] Avoid repeatedly creating dummy data during engine startup

7d0557c

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 force-pushed the no-disable-cache branch from 6265eaa to 7d0557c Compare May 10, 2025 14:00

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label May 11, 2025

ywang96 reviewed May 11, 2025

View reviewed changes

vllm/v1/engine/llm_engine.py Show resolved Hide resolved

DarkLight1337 added 2 commits May 12, 2025 00:44

Should call reset cache from LLMEngine instead of EngineCore

7402be6

Signed-off-by: DarkLight1337 <[email protected]>

Fix

b1af78d

Signed-off-by: DarkLight1337 <[email protected]>

Async reset

37e2fc6

Signed-off-by: DarkLight1337 <[email protected]>

mergify bot added the frontend label May 12, 2025

Fix

1a1080f

Signed-off-by: DarkLight1337 <[email protected]>

ywang96 approved these changes May 13, 2025

View reviewed changes

vllm-bot merged commit 61e0a50 into vllm-project:main May 13, 2025
62 of 65 checks passed

DarkLight1337 deleted the no-disable-cache branch May 13, 2025 05:40

DarkLight1337 mentioned this pull request May 13, 2025

[Bugfix] Fix entrypoints metrics tests #18063

Merged

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

[Bugfix] Avoid repeatedly creating dummy data during engine startup (v…

115dec1

…llm-project#17935) Signed-off-by: DarkLight1337 <[email protected]>

NickLucche mentioned this pull request May 15, 2025

[PD] Heterogenous TP + #7 robertgshaw2-redhat/vllm#14

Closed

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Bugfix] Avoid repeatedly creating dummy data during engine startup (v…

fa7e846

…llm-project#17935) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

DarkLight1337 mentioned this pull request Jun 17, 2025

[Multimodal] Optimize Qwen2/2.5-VL startup time #19756

Merged

4 tasks

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[Bugfix] Avoid repeatedly creating dummy data during engine startup (v…

d63dd27

…llm-project#17935) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: minpeter <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Avoid repeatedly creating dummy data during engine startup #17935

[Bugfix] Avoid repeatedly creating dummy data during engine startup #17935

Uh oh!

DarkLight1337 commented May 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented May 10, 2025

Uh oh!

Uh oh!

ywang96 commented May 12, 2025

Uh oh!

DarkLight1337 commented May 12, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented May 12, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented May 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Avoid repeatedly creating dummy data during engine startup #17935

[Bugfix] Avoid repeatedly creating dummy data during engine startup #17935

Uh oh!

Conversation

DarkLight1337 commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 10, 2025

Uh oh!

Uh oh!

ywang96 commented May 12, 2025

Uh oh!

DarkLight1337 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented May 12, 2025

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented May 10, 2025 •

edited

Loading

DarkLight1337 commented May 12, 2025 •

edited

Loading

DarkLight1337 commented May 12, 2025 •

edited

Loading