Skip to content

feat(tasks): add public model benchmark tags#1230

Merged
Luodian merged 11 commits intoEvolvingLMMs-Lab:mainfrom
Luodian:codex/model-benchmark-tags
Mar 7, 2026
Merged

feat(tasks): add public model benchmark tags#1230
Luodian merged 11 commits intoEvolvingLMMs-Lab:mainfrom
Luodian:codex/model-benchmark-tags

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Mar 7, 2026

Summary

This PR adds curated tag metadata that maps runnable lmms_eval image tasks to model families whose official public evaluation materials highlight those benchmarks.

The goal is to make it possible to select a practical, vendor-aligned benchmark slice directly from YAML using existing TaskManager tag expansion, without introducing any new evaluator logic.

What changed

This PR adds four model-family tags:

  • public_eval_qwen3_5_family
  • public_eval_seed2_family
  • public_eval_gemini3_family
  • public_eval_gpt5_family

It applies those tags to the corresponding task YAMLs and adds documentation for the naming convention, scope, and maintenance rules in docs/guides/model_benchmark_tags.md.

Current tag expansion

public_eval_qwen3_5_family

  • ai2d -> lmms-lab/ai2d (test)
  • charxiv_val_reasoning -> princeton-nlp/CharXiv (validation)
  • hallusion_bench_image -> lmms-lab/HallusionBench (image)
  • mathvista_testmini_cot -> AI4Math/MathVista (testmini)
  • mathvista_testmini_format -> AI4Math/MathVista (testmini)
  • mathvista_testmini_solution -> AI4Math/MathVista (testmini)
  • mmbench_en_dev -> lmms-lab/MMBench config en (dev)
  • mmlongbench_doc -> yubo2333/MMLongBench-Doc (train)
  • mmmu_pro_standard -> MMMU/MMMU_Pro config standard (10 options) (test)
  • mmmu_pro_vision -> MMMU/MMMU_Pro config vision (test)
  • mmmu_val -> lmms-lab/MMMU (validation)
  • mmstar -> Lin-Chen/MMStar (val)
  • ocrbench -> echo840/OCRBench (test)
  • omnidocbench -> ouyanglinke/OmniDocBench_tsv (train)
  • realworldqa -> lmms-lab/RealWorldQA (test)

public_eval_seed2_family

  • charxiv_val_descriptive -> princeton-nlp/CharXiv (validation)
  • charxiv_val_reasoning -> princeton-nlp/CharXiv (validation)
  • mathvision_testmini -> MathLLMs/MathVision (testmini)
  • mmlongbench_doc -> yubo2333/MMLongBench-Doc (train)
  • ocrbench_v2 -> ling99/OCRBench_v2 (test)
  • omnidocbench -> ouyanglinke/OmniDocBench_tsv (train)

public_eval_gemini3_family

  • charxiv_val_reasoning -> princeton-nlp/CharXiv (validation)
  • mmmu_pro_standard -> MMMU/MMMU_Pro config standard (10 options) (test)
  • mmmu_pro_vision -> MMMU/MMMU_Pro config vision (test)
  • omnidocbench -> ouyanglinke/OmniDocBench_tsv (train)

public_eval_gpt5_family

  • mmmu_val -> lmms-lab/MMMU (validation)

Why this helps

Today, the repo already supports tag as a first-class task selector, but there is no built-in way to ask for "the public benchmark slice most associated with model family X".

With these tags in place, users can run commands like:

uv run python -m lmms_eval --tasks public_eval_qwen3_5_family

This keeps the implementation simple, avoids adding hard-coded model-specific logic in Python, and makes future curation incremental at the YAML layer.

Validation

  • pre-commit run --all-files
  • Verified that TaskManager indexes and expands all four new tags successfully

Luodian and others added 11 commits March 4, 2026 13:41
Sync TP ranks under external_launcher and keep max_new_tokens at least as large as the model-side setting so reasoning outputs are not task-capped.
Inject a padding request when a rank receives zero docs and align request/filter synchronization across ranks so TP+DP jobs with limit<=world_size no longer crash or hang.
fix(vllm): support TP+DP dispatch and model-side max_new_tokens precedence
…5-printer-compat

fix: pin wandb 0.25 and support printer API rename
@Luodian Luodian changed the title [codex] add public model benchmark tags feat(tasks): add public model benchmark tags Mar 7, 2026
@Luodian Luodian merged commit 1233094 into EvolvingLMMs-Lab:main Mar 7, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant