test(vllm): add Qwen3-Reranker-4B smoke and benchmark tests#6152
Open
Yadan-Wei wants to merge 19 commits into
Open
test(vllm): add Qwen3-Reranker-4B smoke and benchmark tests#6152Yadan-Wei wants to merge 19 commits into
Yadan-Wei wants to merge 19 commits into
Conversation
Add /v1/rerank tests for Qwen3-Reranker-4B on the AL2023 vLLM image, using the synthetic Qwen3ForSequenceClassification load-time conversion so the Cohere-compatible /v1/rerank endpoint registers (Path C in the upstream qwen3-reranker example). The hf_overrides JSON and chat template are baked into the scripts to keep YAML extra_args clean. Model uploaded to s3://dlc-cicd-models/llm-models/qwen3-reranker-4b.tar.gz. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…betcd_wrapper.so) Same pattern as the existing CVE-2026-33814 / 39820 entries: golang.org/x/net is statically linked into mooncake libetcd_wrapper.so and cannot be patched without a mooncake-transfer-engine rebuild. Unblocks the security-scan pipeline stage. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The failing security-test/ecr-vulnerability-scan job uses framework=vllm_server, which reads test/security/data/ecr_scan_allowlist/vllm_server/framework_allowlist.json. Mirrors the entry already added to vllm/ for the same mooncake libetcd_wrapper.so finding. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
First benchmark run on x86-g6xl-runner (1x L4) produced 41.7 rps for 50 requests / concurrency 4 / 8 docs per request (p99 0.107s). Setting min_rps to ~50% of observed catches real regressions while leaving headroom for environmental noise. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Adds a SageMaker /v1/rerank endpoint test for qwen3-reranker-4b on
amzn2023 images and shares the qwen3 reranker chat template across
EC2 + SM via the existing test_fixtures plumbing.
- Upload qwen3_reranker.jinja to s3://dlc-cicd-models/test-fixtures/
chat-templates/ as the single source of truth.
- EC2 smoke + benchmark scripts now read the template from
/models/test-fixtures/qwen3_reranker.jinja (downloaded by the
reusable workflow) instead of writing it inline via heredoc.
- SM test runner downloads the fixture via boto3, flattens newlines
to {{ '\n' }} so the template survives the SM entrypoint's
line-oriented env-var loop, and injects it via SM_VLLM_CHAT_TEMPLATE.
vLLM's --chat-template falls back to inline Jinja for non-path
values containing '{', '}', or newline; the {{ '\n' }} expressions
render to real newlines at request time, reproducing the original
template byte-for-byte.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Temporary diagnostic to identify what is filling /tmp on the x86-g6xl-runner CodeBuild fleet during cuda-test. Captures df -h and du output at four checkpoints: after DLC checkout, after container pull, after vllm_source checkout, after setup, and after tests. Will be reverted once we have the data. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
This reverts commit f57ceb1.
The x86-g6xl-runner CodeBuild fleet has a 7.6 GiB tmpfs at /tmp that is reused across jobs and is not always purged. Earlier ray ec2 jobs leak multi-GB /tmp/ray-ec2-* dirs (1.1-1.9 GiB each), filling /tmp to 100% before our cuda/regression/example-test jobs even start. The "Checkout vLLM tests" step then dies with ENOSPC and crashes the GHA worker daemon. Add a per-job cleanup step that removes known leakage patterns (ray-ec2-*, agent-log, mcetmp*, ray_test_images) before the heavy container pull and vllm checkout steps. df -h is logged before and after for visibility. A more durable fix would broaden the buildspec pre_build cleanup to match these patterns, but that affects every fleet-using pipeline and warrants a separate review. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The qwen3 reranker chat template now lives in the repo at
test/vllm/scripts/amzn2023/qwen3_reranker_chat_template.jinja
instead of s3://dlc-cicd-models/test-fixtures/chat-templates/.
EC2 and SageMaker both consume the same file:
- EC2: model-test and benchmark workflows docker cp *.jinja from
scripts/amzn2023/ into /models/ alongside the test scripts.
Reranker scripts read from /models/qwen3_reranker_chat_template.jinja.
- SageMaker: test_sm_model_serving.py reads the file directly from
the repo, flattens newlines to {{ '\n' }}, and injects it as
SM_VLLM_CHAT_TEMPLATE. No more boto3 download, no more
test_fixtures plumbing for the SM path.
Drops _fetch_fixture_text helper and the inline_chat_template_fixture
indirection. The orphaned S3 object can be cleaned up separately.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…builds for amzn2023-only changes Two unrelated CI fixes, packaged together since they're both YAML-only: 1. SM reranker instance: ml.g5.xlarge -> ml.g6.xlarge for the qwen3-reranker-4b sagemaker entry. ml.g5.xlarge in us-west-2 has been hanging at endpoint provisioning (37+ min on first test of the sequential batch in run 26867713663 / 26869554970, vs ~15 min on the 2026-06-02 baseline). g6.xlarge has the same 24 GiB VRAM (L4 vs A10G) and 4 vCPU / 16 GiB host RAM as g5.xlarge — same-tier replacement, no other config changes needed for the 8 GiB BF16 model with max-model-len=10000 and gpu-memory-utilization=0.85. 2. Ubuntu vLLM PR workflows: the on: pull_request: paths trigger in pr-vllm-ec2.yml and pr-vllm-sagemaker.yml already negates test/vllm/scripts/amzn2023/** and test/vllm/scripts/benchmark/**, but the dorny/paths-filter "build-change" gate further down does not. PRs that touch both an Ubuntu-relevant path AND an amzn2023 script were rebuilding the Ubuntu image unnecessarily. Mirror the same negations into the build-change filter. Verified that test/vllm/sagemaker/** must NOT be excluded — it is consumed by reusable-vllm-sagemaker-tests.yml line 47 + 53, which pr-vllm-sagemaker.yml line 259 invokes. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Move the config-driven SageMaker test runner under
test/vllm/sagemaker/amzn2023/ since every entry it exercises today
gates on required_image_pattern: amzn2023, and tighten the Ubuntu PR
workflow filters so they only fire on paths Ubuntu actually uses.
Replace !-negation patterns with explicit enumeration since negation
in dorny/paths-filter (and sometimes GHA's own filter) is unreliable.
- mv test/vllm/sagemaker/test_sm_model_serving.py
test/vllm/sagemaker/amzn2023/test_sm_model_serving.py
(Path(__file__).parents[3] -> parents[4] to keep the repo-root
resolution correct after the deeper nesting.)
- pr-vllm-sagemaker.yml triggers + check-changes:
test/vllm/sagemaker/** -> test/vllm/sagemaker/*.py and
test/vllm/sagemaker/requirements.txt. Ubuntu still runs
test_sm_endpoint.py (DeepSeek), but no longer fires when only
amzn2023/test_sm_model_serving.py changes.
- pr-vllm-ec2.yml + pr-vllm-sagemaker.yml: enumerate
scripts/vllm/{dockerd_entrypoint.sh,sagemaker_entrypoint.sh,
sagemaker_serve.py} explicitly instead of scripts/vllm/** plus
!scripts/vllm/amzn2023/** + !scripts/vllm/omni_*. Same for
test/vllm/scripts/*.sh instead of test/vllm/scripts/** plus
!amzn2023/** + !benchmark/**.
- pr-vllm-sagemaker-amzn2023.yml is unchanged: it keeps
test/vllm/sagemaker/** so the new amzn2023 subdir + the existing
top-level test_sm_endpoint.py both flow through. The reusable
workflow's `pytest vllm/sagemaker` recurses, so amzn2023 picks up
both files automatically.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The v0.48.0 entry covers the same CVE on the current mooncake-transfer-engine; the v0.38.0 entry was a stale duplicate from the prior mooncake version. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The SM endpoint container runs vllm via standard-supervisor (provided
by model-hosting-container-standards). standard-supervisor joins argv
with a single space into one string and writes it as the supervisord
[program:app] command= field. supervisord then re-parses with
shlex.split, which strips unprotected double quotes — so a JSON value
like {"architectures":["X"]} arrives at vllm's argparse as
{architectures:[X]} and json.loads rejects it with:
argument --hf-overrides: Value {architectures:[...]} cannot be
converted to <function loads at 0x...>
Wrap the JSON in literal single quotes inside the YAML value so the
joined command string is `--hf-overrides '{...}'`. shlex.split then
strips the single quotes and emits the JSON as one argv element with
inner double quotes intact.
Upstream fix would be to add shlex.quote() to standard-supervisor's
argv joining at scripts/standard_supervisor.py:198, or to use
shlex.join() instead of " ".join().
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
transformers v5.10.0 added a prepare_inputs_layout helper to ProcessorMixin (processing_utils.py:704) that calls self.feature_extractor.fetch_audio(...) without first checking hasattr. mistral-common 1.11.2's MistralCommonFeatureExtractor (used by vllm 0.22.1rc0 for voxtral) does not implement fetch_audio, so engine startup raises: AttributeError: 'MistralCommonFeatureExtractor' object has no attribute 'fetch_audio' Pin to <5.10 until either (a) transformers adds a hasattr guard or (b) mistral-common implements fetch_audio. v5.9.0's processing_utils does not contain prepare_inputs_layout, so the bad code path is absent. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…rapper.so) Same root cause as the existing mooncake-pinned go/stdlib CVEs in this allowlist: go/stdlib is statically linked into /opt/venv/lib/python3.12/site-packages/mooncake/libetcd_wrapper.so and cannot be patched without a mooncake-transfer-engine rebuild. Description: decoding a maliciously-crafted MIME header containing many invalid encoded-words can consume excessive CPU. Pin fix would be go/stdlib>=1.26.4. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Mirror of the entry I added to vllm/framework_allowlist.json earlier; the AMZN2023 PR workflow (pr-vllm-sagemaker-amzn2023.yml, pr-vllm-ec2-amzn2023.yml) reads the vllm_server/ allowlist while the Ubuntu workflow reads the vllm/ allowlist, so both paths need the same entry. Same root cause as the other mooncake-pinned go/stdlib entries in this allowlist: go/stdlib 1.24.11 is statically linked into /opt/venv/lib/python3.12/site-packages/mooncake/libetcd_wrapper.so and cannot be patched without a mooncake-transfer-engine rebuild. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Same root cause as the prior --hf-overrides shell-quote fix: inside
the SM container, standard-supervisor (model-hosting-container-
standards) joins argv with single spaces and supervisord re-splits
via shlex, stripping unprotected quotes and breaking on whitespace.
The chat template runs through the same path, and the Jinja
expressions ({{ '\n' }}) plus the literal angle-brackets/whitespace
in the template body break apart on shlex.split:
api_server.py: error: unrecognized arguments: }}Judge whether the
Document meets the requirements based on the Query and the Instruct
provided. ...
Two changes in _flatten_jinja:
- Use double-quoted Jinja string literals ({{ "\n" }}) instead of
single-quoted ({{ '\n' }}) so the template body contains no inner
single quotes that would clash with the outer shell-quote wrapping.
- Wrap the entire flattened value in literal single quotes so
supervisord's shlex.split treats it as one argv element with the
inner content (angle brackets, double quotes, spaces) intact.
The function now raises if the template happens to contain a single
quote, since that would still clash with the outer wrapping.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add Qwen3-Reranker-4B end-to-end test coverage on the AL2023 vLLM image (EC2 smoke + benchmark + SageMaker endpoint), plus a handful of unrelated CI fixes that surfaced while iterating.
Reranker test coverage
vllm_reranker_smoke_test.sh) and benchmark (vllm_reranker_benchmark_test.sh) undertest/vllm/scripts/. Both load the model via Path C —--runner pooling+--hf_overrides {"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}— so the standard Cohere-compatible/v1/rerankendpoint registers, then exercise it end-to-end.vllm-model-tests.ymlsagemaker:section, gated byrequired_image_pattern: amzn2023. Runs onml.g6.xlarge(24 GB L4; same VRAM tier as g5.xlarge with better us-west-2 capacity)..github/config/model-tests/vllm-model-tests.ymlfor smoke + benchmark onx86-g6xl-runner(1× L4, ~8 GB BF16 weights fit comfortably).s3://dlc-cicd-models/llm-models/qwen3-reranker-4b.tar.gz(6.0 GiB).Shared chat template
test/vllm/scripts/amzn2023/qwen3_reranker_chat_template.jinjais the single source of truth.docker cpstep inreusable-vllm-model-tests.ymlanddispatch-vllm-benchmark.yml. Scripts read it from/models/qwen3_reranker_chat_template.jinja.test_sm_model_serving.py, flattened (newlines →{{ "\n" }}), shell-quoted, and injected asSM_VLLM_CHAT_TEMPLATE.test/vllm/sagemaker/amzn2023/test_sm_model_serving.pyand the Ubuntu PR workflow path filters are scoped to top-level files only, so amzn2023-only YAML changes no longer trigger Ubuntu builds.SM-specific quoting workarounds
The vLLM AMZN2023 SageMaker container runs vLLM under
standard-supervisor(frommodel-hosting-container-standards). It joins argv with single spaces into one string, then supervisord re-parses viashlex.split, which strips unprotected double quotes and breaks tokens on whitespace. Two SM env values needed shell-quote-protective wrapping:SM_VLLM_HF_OVERRIDES— JSON literal wrapped in single quotes in the YAML value.SM_VLLM_CHAT_TEMPLATE—_flatten_jinja()switched to double-quoted Jinja\nexpressions and outer single-quote-wraps the whole template at injection time.Upstream fix would be
shlex.join()(orshlex.quote()in the join) atmodel_hosting_container_standards/supervisor/scripts/standard_supervisor.py:198.Adjacent fixes (rolled in to keep CI green)
transformers<5.10pin indocker/vllm/Dockerfile.amzn2023. transformers 5.10.0 added an unguardedprepare_inputs_layout → fetch_audiocall that breaks voxtral with theMistralCommonFeatureExtractorshipped in mistral-common 1.11.2. Voxtral was passing one7e19d66(last build before transformers 5.10.x got resolved into the image) and failing after.libetcd_wrapper.sostatic-link issue:CVE-2026-39821(golang.org/x/net) added tovllm/andvllm_server/.CVE-2026-42504(go/stdlib) added tovllm/andvllm_server/.CVE-2026-39821v0.38.0 entry invllm/./tmpcleanup inreusable-vllm-upstream-tests.yml. Thex86-g6xl-runnerCodeBuild fleet has a 7.6 GiB tmpfs that isn't always purged between jobs; earlier ray-ec2 jobs leak multi-GB/tmp/ray-ec2-*dirs, which crashed our cuda-test at "Checkout vLLM tests" with ENOSPC. Added a per-job step that removes known leakage patterns before the heavy steps.test/vllm/scripts/**+!-negations (which dorny/paths-filter handles unreliably) with explicit enumeration of top-level scripts only. amzn2023-only changes (undertest/vllm/scripts/amzn2023/,test/vllm/scripts/benchmark/,test/vllm/sagemaker/amzn2023/) no longer trigger Ubuntu vllm builds.Test plan
test-model (qwen3-reranker-4b)onPR - vLLM EC2 AMZN2023— asserts top result is the relevant doc, descending score order, relevant-vs-irrelevant margin > 0.1, single-doc request works.benchmark (qwen3-reranker-4b)on the dispatch workflow withmin_rps: 20.test-model (qwen3-reranker-4b)onPR - vLLM SageMaker AMZN2023— deploys toml.g6.xlarge, hits/v1/rerank, validatesresults[0].relevance_score.voxtral-mini-4bsmoke test passes onPR - vLLM EC2 AMZN2023(transformers pin verified).security-test / ecr-vulnerability-scanpasses onPR - vLLM SageMaker AMZN2023(CVE-2026-42504 allowlisted).PR - vLLM EC2,PR - vLLM SageMaker) skip when only amzn2023-only paths change.