feat(tasks): add public model benchmark tags by Luodian · Pull Request #1230 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-03-07T06:25:07Z

Summary

This PR adds curated tag metadata that maps runnable lmms_eval image tasks to model families whose official public evaluation materials highlight those benchmarks.

The goal is to make it possible to select a practical, vendor-aligned benchmark slice directly from YAML using existing TaskManager tag expansion, without introducing any new evaluator logic.

What changed

This PR adds four model-family tags:

public_eval_qwen3_5_family
public_eval_seed2_family
public_eval_gemini3_family
public_eval_gpt5_family

It applies those tags to the corresponding task YAMLs and adds documentation for the naming convention, scope, and maintenance rules in docs/guides/model_benchmark_tags.md.

Current tag expansion

`public_eval_qwen3_5_family`

ai2d -> lmms-lab/ai2d (test)
charxiv_val_reasoning -> princeton-nlp/CharXiv (validation)
hallusion_bench_image -> lmms-lab/HallusionBench (image)
mathvista_testmini_cot -> AI4Math/MathVista (testmini)
mathvista_testmini_format -> AI4Math/MathVista (testmini)
mathvista_testmini_solution -> AI4Math/MathVista (testmini)
mmbench_en_dev -> lmms-lab/MMBench config en (dev)
mmlongbench_doc -> yubo2333/MMLongBench-Doc (train)
mmmu_pro_standard -> MMMU/MMMU_Pro config standard (10 options) (test)
mmmu_pro_vision -> MMMU/MMMU_Pro config vision (test)
mmmu_val -> lmms-lab/MMMU (validation)
mmstar -> Lin-Chen/MMStar (val)
ocrbench -> echo840/OCRBench (test)
omnidocbench -> ouyanglinke/OmniDocBench_tsv (train)
realworldqa -> lmms-lab/RealWorldQA (test)

`public_eval_seed2_family`

charxiv_val_descriptive -> princeton-nlp/CharXiv (validation)
charxiv_val_reasoning -> princeton-nlp/CharXiv (validation)
mathvision_testmini -> MathLLMs/MathVision (testmini)
mmlongbench_doc -> yubo2333/MMLongBench-Doc (train)
ocrbench_v2 -> ling99/OCRBench_v2 (test)
omnidocbench -> ouyanglinke/OmniDocBench_tsv (train)

`public_eval_gemini3_family`

charxiv_val_reasoning -> princeton-nlp/CharXiv (validation)
mmmu_pro_standard -> MMMU/MMMU_Pro config standard (10 options) (test)
mmmu_pro_vision -> MMMU/MMMU_Pro config vision (test)
omnidocbench -> ouyanglinke/OmniDocBench_tsv (train)

`public_eval_gpt5_family`

mmmu_val -> lmms-lab/MMMU (validation)

Why this helps

Today, the repo already supports tag as a first-class task selector, but there is no built-in way to ask for "the public benchmark slice most associated with model family X".

With these tags in place, users can run commands like:

uv run python -m lmms_eval --tasks public_eval_qwen3_5_family

This keeps the implementation simple, avoids adding hard-coded model-specific logic in Python, and makes future curation incremental at the YAML layer.

Validation

pre-commit run --all-files
Verified that TaskManager indexes and expands all four new tags successfully

Sync TP ranks under external_launcher and keep max_new_tokens at least as large as the model-side setting so reasoning outputs are not task-capped.

Inject a padding request when a rank receives zero docs and align request/filter synchronization across ranks so TP+DP jobs with limit<=world_size no longer crash or hang.

fix(vllm): support TP+DP dispatch and model-side max_new_tokens precedence

…5-printer-compat fix: pin wandb 0.25 and support printer API rename

Luodian and others added 11 commits March 4, 2026 13:41

fix(vllm): enable TP+DP dispatch and token-limit precedence

e9b6abb

Sync TP ranks under external_launcher and keep max_new_tokens at least as large as the model-side setting so reasoning outputs are not task-capped.

style: auto-fix lint (black + isort)

4f7ab22

fix(evaluator): keep distributed runs alive on empty task shards

4fa7475

Inject a padding request when a rank receives zero docs and align request/filter synchronization across ranks so TP+DP jobs with limit<=world_size no longer crash or hang.

style: auto-fix lint (black + isort)

ba9260f

Merge pull request teowu#1 from the-AMI-Labs/fix/dptp-eval

986496d

fix(vllm): support TP+DP dispatch and model-side max_new_tokens precedence

Merge remote-tracking branch 'upstream/main'

098cf8c

fix: pin wandb 0.25 and add printer fallback

ccf3e1d

Merge pull request EvolvingLMMs-Lab#3 from the-AMI-Labs/fix/wandb-0-2…

8daad06

…5-printer-compat fix: pin wandb 0.25 and support printer API rename

fix: use shared HF cache for longvideobench datasets

c28f4d6

feat(tui): add 3D triangle favicon for webui

053b2ca

add public model benchmark tags

b3e3f7b

Luodian changed the title ~~[codex] add public model benchmark tags~~ feat(tasks): add public model benchmark tags Mar 7, 2026

Luodian merged commit 1233094 into EvolvingLMMs-Lab:main Mar 7, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tasks): add public model benchmark tags#1230

feat(tasks): add public model benchmark tags#1230
Luodian merged 11 commits intoEvolvingLMMs-Lab:mainfrom
Luodian:codex/model-benchmark-tags

Luodian commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Mar 7, 2026

Summary

What changed

Current tag expansion

public_eval_qwen3_5_family

public_eval_seed2_family

public_eval_gemini3_family

public_eval_gpt5_family

Why this helps

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`public_eval_qwen3_5_family`

`public_eval_seed2_family`

`public_eval_gemini3_family`

`public_eval_gpt5_family`