Skip to content

feat: SGLang refactor, distributed eval fixes, and cache simplification#1253

Merged
Luodian merged 8 commits intomainfrom
brianli/sglang-distributed-eval
Mar 15, 2026
Merged

feat: SGLang refactor, distributed eval fixes, and cache simplification#1253
Luodian merged 8 commits intomainfrom
brianli/sglang-distributed-eval

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Mar 15, 2026

Summary

Follow-up to #1247. Refactors the SGLang model wrapper, fixes distributed eval (TP+DP), simplifies the response cache, and hardens dataset loading for remote filesystems.

SGLang model wrapper refactor

  • Remove qwen_vl_utils dependency from generic wrapper — no longer needed for non-Qwen models
  • Pass per-request image_data to Engine.generate() instead of flattening across the batch
  • Initialize _config with AutoConfig instead of returning the processor object
  • Patch torchvision.io.read_video fallback when video_fps metadata is missing
  • Pass flat image list to Engine.generate() instead of nested lists

Distributed eval (TP+DP)

  • Use global rank in model wrappers for correct TP+DP data dispatch
  • Add Slurm-aware progress reporting for batch jobs (lmms_eval/models/model_utils/progress.py)
  • Redirect HF datasets cache to local scratch directory on remote filesystems to avoid NFS file-lock contention

Response cache simplification

  • Simplify cache lifecycle to single create / finalize API (removes segment/seal complexity)
  • Context-length and batch-size tuning for long-context thinking models

Tests

  • Expanded response cache tests for simplified API
  • Filelock cross-class singleton regression test
  • Task dataset cache redirect test

Deps

  • Add torchcodec to pyproject.toml video extras

Test plan

  • SGLang wrapper tested on Qwen3.5-4B with TP=2 DP=16 (4-node FluidStack)
  • Cache simplification verified with 31-task eval run
  • Filelock patch regression test passes
  • Dataset cache redirect tested on remote FS (NFS/VAST)

Depends on #1247

Luodian and others added 8 commits March 15, 2026 14:49
Merge dummy_video_reader into a single dummy model that serves both use cases:
- Default mode: instant no-op responses for dataset hydration and task smoke tests
- Video-bench mode (read_bytes/decode_num_frames > 0): full IO/decode latency tracking

The old name dummy_video_reader is kept as a MODEL_ALIASES alias for backward compat.
… inputs

SGLang's Engine runs its own Qwen3-VL processor internally. When
lmms-eval pre-tokenized inputs with the HF processor and passed the
expanded input_ids to SGLang, pad tokens were expanded twice, causing
IndexError on image inputs and potential failures on video inputs.

- Image path: pass prompt text directly to Engine.generate() instead of
  pre-tokenized input_ids, letting SGLang handle tokenization end-to-end
- Video path: pass prompt text + video_data to Engine.generate() using
  SGLang's native video support instead of pre-tokenizing and swapping
  video tokens to image tokens
- Fix tools check: use truthy check instead of 'is not None' so empty
  list from disabled MCP does not trigger tool-handling code paths
- Fix tools param: pass tools=None instead of tools=[] to
  apply_chat_template to avoid unexpected preprocessing
- Lazy-import MCP deps: avoid ImportError at module load when mcp
  package is not installed
- Broaden optional metric imports: catch Exception instead of
  ImportError so numpy/spacy binary incompatibilities do not crash
  metric aggregation for unrelated tasks
SGLang model wrapper:
- Remove qwen_vl_utils dependency from generic wrapper
- Pass per-request image_data instead of flattening across batch
- Initialize _config with AutoConfig instead of returning processor
- Patch torchvision read_video missing video_fps fallback
- Pass flat image list to Engine.generate instead of nested lists

Distributed eval:
- Use global rank in model wrappers for correct TP+DP dispatch
- Add Slurm-aware progress reporting for batch jobs
- Redirect HF datasets cache to local scratch on remote FS

Response cache:
- Simplify to single create/finalize API
- Context-length and batch-size tuning for thinking models

Tests:
- Expanded cache tests for simplified API
- Filelock cross-class singleton regression test
- Task dataset cache redirect test

Deps:
- Add torchcodec to pyproject.toml
@Luodian Luodian merged commit 9e69834 into main Mar 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant