test(vllm): add Qwen3-Reranker-4B smoke and benchmark tests by Yadan-Wei · Pull Request #6152 · aws/deep-learning-containers

Yadan-Wei · 2026-05-27T20:44:34Z

Summary

Add Qwen3-Reranker-4B end-to-end test coverage on the AL2023 vLLM image (EC2 smoke + benchmark + SageMaker endpoint), plus a handful of unrelated CI fixes that surfaced while iterating.

Reranker test coverage

New EC2 smoke test (vllm_reranker_smoke_test.sh) and benchmark (vllm_reranker_benchmark_test.sh) under test/vllm/scripts/. Both load the model via Path C — --runner pooling + --hf_overrides {"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true} — so the standard Cohere-compatible /v1/rerank endpoint registers, then exercise it end-to-end.
New SageMaker endpoint test entry in vllm-model-tests.yml sagemaker: section, gated by required_image_pattern: amzn2023. Runs on ml.g6.xlarge (24 GB L4; same VRAM tier as g5.xlarge with better us-west-2 capacity).
Wire-up in .github/config/model-tests/vllm-model-tests.yml for smoke + benchmark on x86-g6xl-runner (1× L4, ~8 GB BF16 weights fit comfortably).
Model artifact at s3://dlc-cicd-models/llm-models/qwen3-reranker-4b.tar.gz (6.0 GiB).

Shared chat template

New test/vllm/scripts/amzn2023/qwen3_reranker_chat_template.jinja is the single source of truth.
EC2: copied into the container alongside other test scripts via the existing docker cp step in reusable-vllm-model-tests.yml and dispatch-vllm-benchmark.yml. Scripts read it from /models/qwen3_reranker_chat_template.jinja.
SageMaker: read at test runtime by test_sm_model_serving.py, flattened (newlines → {{ "\n" }}), shell-quoted, and injected as SM_VLLM_CHAT_TEMPLATE.
The SM test runner is moved to test/vllm/sagemaker/amzn2023/test_sm_model_serving.py and the Ubuntu PR workflow path filters are scoped to top-level files only, so amzn2023-only YAML changes no longer trigger Ubuntu builds.

SM-specific quoting workarounds

The vLLM AMZN2023 SageMaker container runs vLLM under standard-supervisor (from model-hosting-container-standards). It joins argv with single spaces into one string, then supervisord re-parses via shlex.split, which strips unprotected double quotes and breaks tokens on whitespace. Two SM env values needed shell-quote-protective wrapping:

SM_VLLM_HF_OVERRIDES — JSON literal wrapped in single quotes in the YAML value.
SM_VLLM_CHAT_TEMPLATE — _flatten_jinja() switched to double-quoted Jinja \n expressions and outer single-quote-wraps the whole template at injection time.
Upstream fix would be shlex.join() (or shlex.quote() in the join) at model_hosting_container_standards/supervisor/scripts/standard_supervisor.py:198.

Adjacent fixes (rolled in to keep CI green)

transformers<5.10 pin in docker/vllm/Dockerfile.amzn2023. transformers 5.10.0 added an unguarded prepare_inputs_layout → fetch_audio call that breaks voxtral with the MistralCommonFeatureExtractor shipped in mistral-common 1.11.2. Voxtral was passing on e7e19d66 (last build before transformers 5.10.x got resolved into the image) and failing after.
CVE allowlist updates for the existing mooncake libetcd_wrapper.so static-link issue:
- CVE-2026-39821 (golang.org/x/net) added to vllm/ and vllm_server/.
- CVE-2026-42504 (go/stdlib) added to vllm/ and vllm_server/.
- Removed a stale duplicate CVE-2026-39821 v0.38.0 entry in vllm/.
/tmp cleanup in reusable-vllm-upstream-tests.yml. The x86-g6xl-runner CodeBuild fleet has a 7.6 GiB tmpfs that isn't always purged between jobs; earlier ray-ec2 jobs leak multi-GB /tmp/ray-ec2-* dirs, which crashed our cuda-test at "Checkout vLLM tests" with ENOSPC. Added a per-job step that removes known leakage patterns before the heavy steps.
Ubuntu PR workflow path filters — replaced test/vllm/scripts/** + !-negations (which dorny/paths-filter handles unreliably) with explicit enumeration of top-level scripts only. amzn2023-only changes (under test/vllm/scripts/amzn2023/, test/vllm/scripts/benchmark/, test/vllm/sagemaker/amzn2023/) no longer trigger Ubuntu vllm builds.

Test plan

EC2 smoke test-model (qwen3-reranker-4b) on PR - vLLM EC2 AMZN2023 — asserts top result is the relevant doc, descending score order, relevant-vs-irrelevant margin > 0.1, single-doc request works.
EC2 benchmark benchmark (qwen3-reranker-4b) on the dispatch workflow with min_rps: 20.
SM endpoint test-model (qwen3-reranker-4b) on PR - vLLM SageMaker AMZN2023 — deploys to ml.g6.xlarge, hits /v1/rerank, validates results[0].relevance_score.
voxtral-mini-4b smoke test passes on PR - vLLM EC2 AMZN2023 (transformers pin verified).
security-test / ecr-vulnerability-scan passes on PR - vLLM SageMaker AMZN2023 (CVE-2026-42504 allowlisted).
Ubuntu workflows (PR - vLLM EC2, PR - vLLM SageMaker) skip when only amzn2023-only paths change.

Add /v1/rerank tests for Qwen3-Reranker-4B on the AL2023 vLLM image, using the synthetic Qwen3ForSequenceClassification load-time conversion so the Cohere-compatible /v1/rerank endpoint registers (Path C in the upstream qwen3-reranker example). The hf_overrides JSON and chat template are baked into the scripts to keep YAML extra_args clean. Model uploaded to s3://dlc-cicd-models/llm-models/qwen3-reranker-4b.tar.gz. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…betcd_wrapper.so) Same pattern as the existing CVE-2026-33814 / 39820 entries: golang.org/x/net is statically linked into mooncake libetcd_wrapper.so and cannot be patched without a mooncake-transfer-engine rebuild. Unblocks the security-scan pipeline stage. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

The failing security-test/ecr-vulnerability-scan job uses framework=vllm_server, which reads test/security/data/ecr_scan_allowlist/vllm_server/framework_allowlist.json. Mirrors the entry already added to vllm/ for the same mooncake libetcd_wrapper.so finding. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

First benchmark run on x86-g6xl-runner (1x L4) produced 41.7 rps for 50 requests / concurrency 4 / 8 docs per request (p99 0.107s). Setting min_rps to ~50% of observed catches real regressions while leaving headroom for environmental noise. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Adds a SageMaker /v1/rerank endpoint test for qwen3-reranker-4b on amzn2023 images and shares the qwen3 reranker chat template across EC2 + SM via the existing test_fixtures plumbing. - Upload qwen3_reranker.jinja to s3://dlc-cicd-models/test-fixtures/ chat-templates/ as the single source of truth. - EC2 smoke + benchmark scripts now read the template from /models/test-fixtures/qwen3_reranker.jinja (downloaded by the reusable workflow) instead of writing it inline via heredoc. - SM test runner downloads the fixture via boto3, flattens newlines to {{ '\n' }} so the template survives the SM entrypoint's line-oriented env-var loop, and injects it via SM_VLLM_CHAT_TEMPLATE. vLLM's --chat-template falls back to inline Jinja for non-path values containing '{', '}', or newline; the {{ '\n' }} expressions render to real newlines at request time, reproducing the original template byte-for-byte. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Temporary diagnostic to identify what is filling /tmp on the x86-g6xl-runner CodeBuild fleet during cuda-test. Captures df -h and du output at four checkpoints: after DLC checkout, after container pull, after vllm_source checkout, after setup, and after tests. Will be reverted once we have the data. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

This reverts commit f57ceb1.

The x86-g6xl-runner CodeBuild fleet has a 7.6 GiB tmpfs at /tmp that is reused across jobs and is not always purged. Earlier ray ec2 jobs leak multi-GB /tmp/ray-ec2-* dirs (1.1-1.9 GiB each), filling /tmp to 100% before our cuda/regression/example-test jobs even start. The "Checkout vLLM tests" step then dies with ENOSPC and crashes the GHA worker daemon. Add a per-job cleanup step that removes known leakage patterns (ray-ec2-*, agent-log, mcetmp*, ray_test_images) before the heavy container pull and vllm checkout steps. df -h is logged before and after for visibility. A more durable fix would broaden the buildspec pre_build cleanup to match these patterns, but that affects every fleet-using pipeline and warrants a separate review. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

The qwen3 reranker chat template now lives in the repo at test/vllm/scripts/amzn2023/qwen3_reranker_chat_template.jinja instead of s3://dlc-cicd-models/test-fixtures/chat-templates/. EC2 and SageMaker both consume the same file: - EC2: model-test and benchmark workflows docker cp *.jinja from scripts/amzn2023/ into /models/ alongside the test scripts. Reranker scripts read from /models/qwen3_reranker_chat_template.jinja. - SageMaker: test_sm_model_serving.py reads the file directly from the repo, flattens newlines to {{ '\n' }}, and injects it as SM_VLLM_CHAT_TEMPLATE. No more boto3 download, no more test_fixtures plumbing for the SM path. Drops _fetch_fixture_text helper and the inline_chat_template_fixture indirection. The orphaned S3 object can be cleaned up separately. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…builds for amzn2023-only changes Two unrelated CI fixes, packaged together since they're both YAML-only: 1. SM reranker instance: ml.g5.xlarge -> ml.g6.xlarge for the qwen3-reranker-4b sagemaker entry. ml.g5.xlarge in us-west-2 has been hanging at endpoint provisioning (37+ min on first test of the sequential batch in run 26867713663 / 26869554970, vs ~15 min on the 2026-06-02 baseline). g6.xlarge has the same 24 GiB VRAM (L4 vs A10G) and 4 vCPU / 16 GiB host RAM as g5.xlarge — same-tier replacement, no other config changes needed for the 8 GiB BF16 model with max-model-len=10000 and gpu-memory-utilization=0.85. 2. Ubuntu vLLM PR workflows: the on: pull_request: paths trigger in pr-vllm-ec2.yml and pr-vllm-sagemaker.yml already negates test/vllm/scripts/amzn2023/** and test/vllm/scripts/benchmark/**, but the dorny/paths-filter "build-change" gate further down does not. PRs that touch both an Ubuntu-relevant path AND an amzn2023 script were rebuilding the Ubuntu image unnecessarily. Mirror the same negations into the build-change filter. Verified that test/vllm/sagemaker/** must NOT be excluded — it is consumed by reusable-vllm-sagemaker-tests.yml line 47 + 53, which pr-vllm-sagemaker.yml line 259 invokes. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Move the config-driven SageMaker test runner under test/vllm/sagemaker/amzn2023/ since every entry it exercises today gates on required_image_pattern: amzn2023, and tighten the Ubuntu PR workflow filters so they only fire on paths Ubuntu actually uses. Replace !-negation patterns with explicit enumeration since negation in dorny/paths-filter (and sometimes GHA's own filter) is unreliable. - mv test/vllm/sagemaker/test_sm_model_serving.py test/vllm/sagemaker/amzn2023/test_sm_model_serving.py (Path(__file__).parents[3] -> parents[4] to keep the repo-root resolution correct after the deeper nesting.) - pr-vllm-sagemaker.yml triggers + check-changes: test/vllm/sagemaker/** -> test/vllm/sagemaker/*.py and test/vllm/sagemaker/requirements.txt. Ubuntu still runs test_sm_endpoint.py (DeepSeek), but no longer fires when only amzn2023/test_sm_model_serving.py changes. - pr-vllm-ec2.yml + pr-vllm-sagemaker.yml: enumerate scripts/vllm/{dockerd_entrypoint.sh,sagemaker_entrypoint.sh, sagemaker_serve.py} explicitly instead of scripts/vllm/** plus !scripts/vllm/amzn2023/** + !scripts/vllm/omni_*. Same for test/vllm/scripts/*.sh instead of test/vllm/scripts/** plus !amzn2023/** + !benchmark/**. - pr-vllm-sagemaker-amzn2023.yml is unchanged: it keeps test/vllm/sagemaker/** so the new amzn2023 subdir + the existing top-level test_sm_endpoint.py both flow through. The reusable workflow's `pytest vllm/sagemaker` recurses, so amzn2023 picks up both files automatically. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

The v0.48.0 entry covers the same CVE on the current mooncake-transfer-engine; the v0.38.0 entry was a stale duplicate from the prior mooncake version. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…anker

The SM endpoint container runs vllm via standard-supervisor (provided by model-hosting-container-standards). standard-supervisor joins argv with a single space into one string and writes it as the supervisord [program:app] command= field. supervisord then re-parses with shlex.split, which strips unprotected double quotes — so a JSON value like {"architectures":["X"]} arrives at vllm's argparse as {architectures:[X]} and json.loads rejects it with: argument --hf-overrides: Value {architectures:[...]} cannot be converted to <function loads at 0x...> Wrap the JSON in literal single quotes inside the YAML value so the joined command string is `--hf-overrides '{...}'`. shlex.split then strips the single quotes and emits the JSON as one argv element with inner double quotes intact. Upstream fix would be to add shlex.quote() to standard-supervisor's argv joining at scripts/standard_supervisor.py:198, or to use shlex.join() instead of " ".join(). Signed-off-by: Yadan Wei <yadanwei@amazon.com>

transformers v5.10.0 added a prepare_inputs_layout helper to ProcessorMixin (processing_utils.py:704) that calls self.feature_extractor.fetch_audio(...) without first checking hasattr. mistral-common 1.11.2's MistralCommonFeatureExtractor (used by vllm 0.22.1rc0 for voxtral) does not implement fetch_audio, so engine startup raises: AttributeError: 'MistralCommonFeatureExtractor' object has no attribute 'fetch_audio' Pin to <5.10 until either (a) transformers adds a hasattr guard or (b) mistral-common implements fetch_audio. v5.9.0's processing_utils does not contain prepare_inputs_layout, so the bad code path is absent. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

…rapper.so) Same root cause as the existing mooncake-pinned go/stdlib CVEs in this allowlist: go/stdlib is statically linked into /opt/venv/lib/python3.12/site-packages/mooncake/libetcd_wrapper.so and cannot be patched without a mooncake-transfer-engine rebuild. Description: decoding a maliciously-crafted MIME header containing many invalid encoded-words can consume excessive CPU. Pin fix would be go/stdlib>=1.26.4. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Mirror of the entry I added to vllm/framework_allowlist.json earlier; the AMZN2023 PR workflow (pr-vllm-sagemaker-amzn2023.yml, pr-vllm-ec2-amzn2023.yml) reads the vllm_server/ allowlist while the Ubuntu workflow reads the vllm/ allowlist, so both paths need the same entry. Same root cause as the other mooncake-pinned go/stdlib entries in this allowlist: go/stdlib 1.24.11 is statically linked into /opt/venv/lib/python3.12/site-packages/mooncake/libetcd_wrapper.so and cannot be patched without a mooncake-transfer-engine rebuild. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Same root cause as the prior --hf-overrides shell-quote fix: inside the SM container, standard-supervisor (model-hosting-container- standards) joins argv with single spaces and supervisord re-splits via shlex, stripping unprotected quotes and breaking on whitespace. The chat template runs through the same path, and the Jinja expressions ({{ '\n' }}) plus the literal angle-brackets/whitespace in the template body break apart on shlex.split: api_server.py: error: unrecognized arguments: }}Judge whether the Document meets the requirements based on the Query and the Instruct provided. ... Two changes in _flatten_jinja: - Use double-quoted Jinja string literals ({{ "\n" }}) instead of single-quoted ({{ '\n' }}) so the template body contains no inner single quotes that would clash with the outer shell-quote wrapping. - Wrap the entire flattened value in literal single quotes so supervisord's shlex.split treats it as one argv element with the inner content (angle brackets, double quotes, spaces) intact. The function now raises if the template happens to contain a single quote, since that would still clash with the outer wrapping. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

aws-deep-learning-containers-ci Bot added the authorized label May 27, 2026

Yadan Wei and others added 18 commits May 27, 2026 16:32

Merge branch 'main' into vllm-al2023-qwen3-reranker

a07c443

Revert "DIAG: log disk usage at cuda-test checkpoints (will revert)"

c5bb5c7

This reverts commit f57ceb1.

test(vllm): remove duplicate CVE-2026-39821 v0.38.0 allowlist entry

120821d

The v0.48.0 entry covers the same CVE on the current mooncake-transfer-engine; the v0.38.0 entry was a stale duplicate from the prior mooncake version. Signed-off-by: Yadan Wei <yadanwei@amazon.com>

Merge remote-tracking branch 'origin/main' into vllm-al2023-qwen3-rer…

ace2ba4

…anker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(vllm): add Qwen3-Reranker-4B smoke and benchmark tests#6152

test(vllm): add Qwen3-Reranker-4B smoke and benchmark tests#6152
Yadan-Wei wants to merge 19 commits into
mainfrom
vllm-al2023-qwen3-reranker

Yadan-Wei commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yadan-Wei commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reranker test coverage

Shared chat template

SM-specific quoting workarounds

Adjacent fixes (rolled in to keep CI green)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Yadan-Wei commented May 27, 2026 •

edited

Loading