Skip to content

test(vllm): add Qwen3-Reranker-4B smoke and benchmark tests#6152

Open
Yadan-Wei wants to merge 19 commits into
mainfrom
vllm-al2023-qwen3-reranker
Open

test(vllm): add Qwen3-Reranker-4B smoke and benchmark tests#6152
Yadan-Wei wants to merge 19 commits into
mainfrom
vllm-al2023-qwen3-reranker

Conversation

@Yadan-Wei
Copy link
Copy Markdown
Contributor

@Yadan-Wei Yadan-Wei commented May 27, 2026

Summary

Add Qwen3-Reranker-4B end-to-end test coverage on the AL2023 vLLM image (EC2 smoke + benchmark + SageMaker endpoint), plus a handful of unrelated CI fixes that surfaced while iterating.

Reranker test coverage

  • New EC2 smoke test (vllm_reranker_smoke_test.sh) and benchmark (vllm_reranker_benchmark_test.sh) under test/vllm/scripts/. Both load the model via Path C — --runner pooling + --hf_overrides {"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true} — so the standard Cohere-compatible /v1/rerank endpoint registers, then exercise it end-to-end.
  • New SageMaker endpoint test entry in vllm-model-tests.yml sagemaker: section, gated by required_image_pattern: amzn2023. Runs on ml.g6.xlarge (24 GB L4; same VRAM tier as g5.xlarge with better us-west-2 capacity).
  • Wire-up in .github/config/model-tests/vllm-model-tests.yml for smoke + benchmark on x86-g6xl-runner (1× L4, ~8 GB BF16 weights fit comfortably).
  • Model artifact at s3://dlc-cicd-models/llm-models/qwen3-reranker-4b.tar.gz (6.0 GiB).

Shared chat template

  • New test/vllm/scripts/amzn2023/qwen3_reranker_chat_template.jinja is the single source of truth.
  • EC2: copied into the container alongside other test scripts via the existing docker cp step in reusable-vllm-model-tests.yml and dispatch-vllm-benchmark.yml. Scripts read it from /models/qwen3_reranker_chat_template.jinja.
  • SageMaker: read at test runtime by test_sm_model_serving.py, flattened (newlines → {{ "\n" }}), shell-quoted, and injected as SM_VLLM_CHAT_TEMPLATE.
  • The SM test runner is moved to test/vllm/sagemaker/amzn2023/test_sm_model_serving.py and the Ubuntu PR workflow path filters are scoped to top-level files only, so amzn2023-only YAML changes no longer trigger Ubuntu builds.

SM-specific quoting workarounds

The vLLM AMZN2023 SageMaker container runs vLLM under standard-supervisor (from model-hosting-container-standards). It joins argv with single spaces into one string, then supervisord re-parses via shlex.split, which strips unprotected double quotes and breaks tokens on whitespace. Two SM env values needed shell-quote-protective wrapping:

  • SM_VLLM_HF_OVERRIDES — JSON literal wrapped in single quotes in the YAML value.
  • SM_VLLM_CHAT_TEMPLATE_flatten_jinja() switched to double-quoted Jinja \n expressions and outer single-quote-wraps the whole template at injection time.
    Upstream fix would be shlex.join() (or shlex.quote() in the join) at model_hosting_container_standards/supervisor/scripts/standard_supervisor.py:198.

Adjacent fixes (rolled in to keep CI green)

  • transformers<5.10 pin in docker/vllm/Dockerfile.amzn2023. transformers 5.10.0 added an unguarded prepare_inputs_layout → fetch_audio call that breaks voxtral with the MistralCommonFeatureExtractor shipped in mistral-common 1.11.2. Voxtral was passing on e7e19d66 (last build before transformers 5.10.x got resolved into the image) and failing after.
  • CVE allowlist updates for the existing mooncake libetcd_wrapper.so static-link issue:
    • CVE-2026-39821 (golang.org/x/net) added to vllm/ and vllm_server/.
    • CVE-2026-42504 (go/stdlib) added to vllm/ and vllm_server/.
    • Removed a stale duplicate CVE-2026-39821 v0.38.0 entry in vllm/.
  • /tmp cleanup in reusable-vllm-upstream-tests.yml. The x86-g6xl-runner CodeBuild fleet has a 7.6 GiB tmpfs that isn't always purged between jobs; earlier ray-ec2 jobs leak multi-GB /tmp/ray-ec2-* dirs, which crashed our cuda-test at "Checkout vLLM tests" with ENOSPC. Added a per-job step that removes known leakage patterns before the heavy steps.
  • Ubuntu PR workflow path filters — replaced test/vllm/scripts/** + !-negations (which dorny/paths-filter handles unreliably) with explicit enumeration of top-level scripts only. amzn2023-only changes (under test/vllm/scripts/amzn2023/, test/vllm/scripts/benchmark/, test/vllm/sagemaker/amzn2023/) no longer trigger Ubuntu vllm builds.

Test plan

  • EC2 smoke test-model (qwen3-reranker-4b) on PR - vLLM EC2 AMZN2023 — asserts top result is the relevant doc, descending score order, relevant-vs-irrelevant margin > 0.1, single-doc request works.
  • EC2 benchmark benchmark (qwen3-reranker-4b) on the dispatch workflow with min_rps: 20.
  • SM endpoint test-model (qwen3-reranker-4b) on PR - vLLM SageMaker AMZN2023 — deploys to ml.g6.xlarge, hits /v1/rerank, validates results[0].relevance_score.
  • voxtral-mini-4b smoke test passes on PR - vLLM EC2 AMZN2023 (transformers pin verified).
  • security-test / ecr-vulnerability-scan passes on PR - vLLM SageMaker AMZN2023 (CVE-2026-42504 allowlisted).
  • Ubuntu workflows (PR - vLLM EC2, PR - vLLM SageMaker) skip when only amzn2023-only paths change.

Add /v1/rerank tests for Qwen3-Reranker-4B on the AL2023 vLLM image, using
the synthetic Qwen3ForSequenceClassification load-time conversion so the
Cohere-compatible /v1/rerank endpoint registers (Path C in the upstream
qwen3-reranker example). The hf_overrides JSON and chat template are baked
into the scripts to keep YAML extra_args clean. Model uploaded to
s3://dlc-cicd-models/llm-models/qwen3-reranker-4b.tar.gz.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan Wei and others added 18 commits May 27, 2026 16:32
…betcd_wrapper.so)

Same pattern as the existing CVE-2026-33814 / 39820 entries: golang.org/x/net is
statically linked into mooncake libetcd_wrapper.so and cannot be patched without
a mooncake-transfer-engine rebuild. Unblocks the security-scan pipeline stage.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The failing security-test/ecr-vulnerability-scan job uses framework=vllm_server,
which reads test/security/data/ecr_scan_allowlist/vllm_server/framework_allowlist.json.
Mirrors the entry already added to vllm/ for the same mooncake libetcd_wrapper.so finding.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
First benchmark run on x86-g6xl-runner (1x L4) produced 41.7 rps for
50 requests / concurrency 4 / 8 docs per request (p99 0.107s). Setting
min_rps to ~50% of observed catches real regressions while leaving
headroom for environmental noise.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Adds a SageMaker /v1/rerank endpoint test for qwen3-reranker-4b on
amzn2023 images and shares the qwen3 reranker chat template across
EC2 + SM via the existing test_fixtures plumbing.

- Upload qwen3_reranker.jinja to s3://dlc-cicd-models/test-fixtures/
  chat-templates/ as the single source of truth.
- EC2 smoke + benchmark scripts now read the template from
  /models/test-fixtures/qwen3_reranker.jinja (downloaded by the
  reusable workflow) instead of writing it inline via heredoc.
- SM test runner downloads the fixture via boto3, flattens newlines
  to {{ '\n' }} so the template survives the SM entrypoint's
  line-oriented env-var loop, and injects it via SM_VLLM_CHAT_TEMPLATE.
  vLLM's --chat-template falls back to inline Jinja for non-path
  values containing '{', '}', or newline; the {{ '\n' }} expressions
  render to real newlines at request time, reproducing the original
  template byte-for-byte.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Temporary diagnostic to identify what is filling /tmp on the
x86-g6xl-runner CodeBuild fleet during cuda-test. Captures df -h and
du output at four checkpoints: after DLC checkout, after container
pull, after vllm_source checkout, after setup, and after tests.

Will be reverted once we have the data.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The x86-g6xl-runner CodeBuild fleet has a 7.6 GiB tmpfs at /tmp that
is reused across jobs and is not always purged. Earlier ray ec2 jobs
leak multi-GB /tmp/ray-ec2-* dirs (1.1-1.9 GiB each), filling /tmp to
100% before our cuda/regression/example-test jobs even start. The
"Checkout vLLM tests" step then dies with ENOSPC and crashes the
GHA worker daemon.

Add a per-job cleanup step that removes known leakage patterns
(ray-ec2-*, agent-log, mcetmp*, ray_test_images) before the heavy
container pull and vllm checkout steps. df -h is logged before and
after for visibility.

A more durable fix would broaden the buildspec pre_build cleanup to
match these patterns, but that affects every fleet-using pipeline
and warrants a separate review.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The qwen3 reranker chat template now lives in the repo at
test/vllm/scripts/amzn2023/qwen3_reranker_chat_template.jinja
instead of s3://dlc-cicd-models/test-fixtures/chat-templates/.
EC2 and SageMaker both consume the same file:

- EC2: model-test and benchmark workflows docker cp *.jinja from
  scripts/amzn2023/ into /models/ alongside the test scripts.
  Reranker scripts read from /models/qwen3_reranker_chat_template.jinja.
- SageMaker: test_sm_model_serving.py reads the file directly from
  the repo, flattens newlines to {{ '\n' }}, and injects it as
  SM_VLLM_CHAT_TEMPLATE. No more boto3 download, no more
  test_fixtures plumbing for the SM path.

Drops _fetch_fixture_text helper and the inline_chat_template_fixture
indirection. The orphaned S3 object can be cleaned up separately.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…builds for amzn2023-only changes

Two unrelated CI fixes, packaged together since they're both YAML-only:

1. SM reranker instance: ml.g5.xlarge -> ml.g6.xlarge for the
   qwen3-reranker-4b sagemaker entry. ml.g5.xlarge in us-west-2 has
   been hanging at endpoint provisioning (37+ min on first test of the
   sequential batch in run 26867713663 / 26869554970, vs ~15 min on
   the 2026-06-02 baseline). g6.xlarge has the same 24 GiB VRAM (L4 vs
   A10G) and 4 vCPU / 16 GiB host RAM as g5.xlarge — same-tier
   replacement, no other config changes needed for the 8 GiB BF16
   model with max-model-len=10000 and gpu-memory-utilization=0.85.

2. Ubuntu vLLM PR workflows: the on: pull_request: paths trigger in
   pr-vllm-ec2.yml and pr-vllm-sagemaker.yml already negates
   test/vllm/scripts/amzn2023/** and test/vllm/scripts/benchmark/**,
   but the dorny/paths-filter "build-change" gate further down does
   not. PRs that touch both an Ubuntu-relevant path AND an amzn2023
   script were rebuilding the Ubuntu image unnecessarily. Mirror the
   same negations into the build-change filter.
   Verified that test/vllm/sagemaker/** must NOT be excluded — it is
   consumed by reusable-vllm-sagemaker-tests.yml line 47 + 53, which
   pr-vllm-sagemaker.yml line 259 invokes.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Move the config-driven SageMaker test runner under
test/vllm/sagemaker/amzn2023/ since every entry it exercises today
gates on required_image_pattern: amzn2023, and tighten the Ubuntu PR
workflow filters so they only fire on paths Ubuntu actually uses.
Replace !-negation patterns with explicit enumeration since negation
in dorny/paths-filter (and sometimes GHA's own filter) is unreliable.

- mv test/vllm/sagemaker/test_sm_model_serving.py
     test/vllm/sagemaker/amzn2023/test_sm_model_serving.py
  (Path(__file__).parents[3] -> parents[4] to keep the repo-root
  resolution correct after the deeper nesting.)

- pr-vllm-sagemaker.yml triggers + check-changes:
  test/vllm/sagemaker/** -> test/vllm/sagemaker/*.py and
  test/vllm/sagemaker/requirements.txt. Ubuntu still runs
  test_sm_endpoint.py (DeepSeek), but no longer fires when only
  amzn2023/test_sm_model_serving.py changes.

- pr-vllm-ec2.yml + pr-vllm-sagemaker.yml: enumerate
  scripts/vllm/{dockerd_entrypoint.sh,sagemaker_entrypoint.sh,
  sagemaker_serve.py} explicitly instead of scripts/vllm/** plus
  !scripts/vllm/amzn2023/** + !scripts/vllm/omni_*. Same for
  test/vllm/scripts/*.sh instead of test/vllm/scripts/** plus
  !amzn2023/** + !benchmark/**.

- pr-vllm-sagemaker-amzn2023.yml is unchanged: it keeps
  test/vllm/sagemaker/** so the new amzn2023 subdir + the existing
  top-level test_sm_endpoint.py both flow through. The reusable
  workflow's `pytest vllm/sagemaker` recurses, so amzn2023 picks up
  both files automatically.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The v0.48.0 entry covers the same CVE on the current
mooncake-transfer-engine; the v0.38.0 entry was a stale duplicate
from the prior mooncake version.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The SM endpoint container runs vllm via standard-supervisor (provided
by model-hosting-container-standards). standard-supervisor joins argv
with a single space into one string and writes it as the supervisord
[program:app] command= field. supervisord then re-parses with
shlex.split, which strips unprotected double quotes — so a JSON value
like {"architectures":["X"]} arrives at vllm's argparse as
{architectures:[X]} and json.loads rejects it with:

  argument --hf-overrides: Value {architectures:[...]} cannot be
  converted to <function loads at 0x...>

Wrap the JSON in literal single quotes inside the YAML value so the
joined command string is `--hf-overrides '{...}'`. shlex.split then
strips the single quotes and emits the JSON as one argv element with
inner double quotes intact.

Upstream fix would be to add shlex.quote() to standard-supervisor's
argv joining at scripts/standard_supervisor.py:198, or to use
shlex.join() instead of " ".join().

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
transformers v5.10.0 added a prepare_inputs_layout helper to
ProcessorMixin (processing_utils.py:704) that calls
self.feature_extractor.fetch_audio(...) without first checking
hasattr. mistral-common 1.11.2's MistralCommonFeatureExtractor
(used by vllm 0.22.1rc0 for voxtral) does not implement fetch_audio,
so engine startup raises:

  AttributeError: 'MistralCommonFeatureExtractor' object has no
  attribute 'fetch_audio'

Pin to <5.10 until either (a) transformers adds a hasattr guard or
(b) mistral-common implements fetch_audio. v5.9.0's processing_utils
does not contain prepare_inputs_layout, so the bad code path is
absent.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…rapper.so)

Same root cause as the existing mooncake-pinned go/stdlib CVEs in
this allowlist: go/stdlib is statically linked into
/opt/venv/lib/python3.12/site-packages/mooncake/libetcd_wrapper.so
and cannot be patched without a mooncake-transfer-engine rebuild.

Description: decoding a maliciously-crafted MIME header containing
many invalid encoded-words can consume excessive CPU.
Pin fix would be go/stdlib>=1.26.4.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Mirror of the entry I added to vllm/framework_allowlist.json earlier;
the AMZN2023 PR workflow (pr-vllm-sagemaker-amzn2023.yml,
pr-vllm-ec2-amzn2023.yml) reads the vllm_server/ allowlist while the
Ubuntu workflow reads the vllm/ allowlist, so both paths need the
same entry.

Same root cause as the other mooncake-pinned go/stdlib entries in
this allowlist: go/stdlib 1.24.11 is statically linked into
/opt/venv/lib/python3.12/site-packages/mooncake/libetcd_wrapper.so
and cannot be patched without a mooncake-transfer-engine rebuild.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Same root cause as the prior --hf-overrides shell-quote fix: inside
the SM container, standard-supervisor (model-hosting-container-
standards) joins argv with single spaces and supervisord re-splits
via shlex, stripping unprotected quotes and breaking on whitespace.

The chat template runs through the same path, and the Jinja
expressions ({{ '\n' }}) plus the literal angle-brackets/whitespace
in the template body break apart on shlex.split:

  api_server.py: error: unrecognized arguments: }}Judge whether the
  Document meets the requirements based on the Query and the Instruct
  provided. ...

Two changes in _flatten_jinja:
- Use double-quoted Jinja string literals ({{ "\n" }}) instead of
  single-quoted ({{ '\n' }}) so the template body contains no inner
  single quotes that would clash with the outer shell-quote wrapping.
- Wrap the entire flattened value in literal single quotes so
  supervisord's shlex.split treats it as one argv element with the
  inner content (angle brackets, double quotes, spaces) intact.

The function now raises if the template happens to contain a single
quote, since that would still clash with the outer wrapping.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant