feat: add Multi-Level Existence Benchmark (MLE-Bench) by ryf1123 · Pull Request #1228 · EvolvingLMMs-Lab/lmms-eval

ryf1123 · 2026-03-07T04:16:48Z

Adds evaluation support for the Multi-Level Existence Benchmark (MLE-Bench), introduced in the ICLR 2026 Oral paper:

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos
ICLR 2026 (Oral) — https://openreview.net/forum?id=pfw176o1YJ
Project page: https://junlinhan.github.io/projects/lsbs/
Dataset: https://huggingface.co/datasets/JunlinHan/Multi-Level_Existence_Bench

Summary

MLE-Bench is a visual perception benchmark with 1,861 image + 4-choice QA pairs about object existence
Questions are stratified by target object size (small 0–30% / medium 30–60% / large 60–100% of image area)
Evaluates "pure" perception ability independent of complex reasoning

In scope

New task directory lmms_eval/tasks/mle_bench/ with utils.py and YAML configs
4 tasks: mle_bench (full), mle_bench_small, mle_bench_medium, mle_bench_large
Metrics: per-category accuracy (small / medium / large) + macro-average

Out of scope

No changes to existing tasks, models, or core framework code

Validation

python -m lmms_eval --model openai --model_args model=glm-4v-flash,base_url=https://open.bigmodel.cn/api/paas/v4/ --tasks mle_bench --limit 100 | sample size: N=100 (stratified) | key metrics: small=91.2%, medium=93.9%, large=93.9%, overall=93.0% | result: pass

Risk / Compatibility

No breaking changes; purely additive new task directory
Dataset is publicly available on HuggingFace under CC-BY-NC-4.0

Type of Change

Adds evaluation support for the Multi-Level Existence Benchmark (MLE-Bench), introduced in the ICLR 2026 Oral paper: Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos ICLR 2026 (Oral) https://openreview.net/forum?id=pfw176o1YJ Project page: https://junlinhan.github.io/projects/lsbs/ MLE-Bench evaluates fine-grained visual perception in multimodal models using 4-choice questions about object existence, categorised by the target object's relative size (proportion of image pixels occupied): - small (existence_0-30): 732 samples, objects occupying 0-30% of image - medium (existence_30-60): 698 samples, objects occupying 30-60% of image - large (existence_60-100): 431 samples, objects occupying 60-100% of image Dataset: https://huggingface.co/datasets/JunlinHan/Multi-Level_Existence_Bench Tasks added: - mle_bench : full evaluation (1,861 samples) - mle_bench_small : small-object subset - mle_bench_medium : medium-object subset - mle_bench_large : large-object subset Metrics: per-category accuracy (small / medium / large) + macro-average

Luodian · 2026-03-07T06:52:03Z

@claude please lint this

claude · 2026-03-07T06:52:22Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

style(tasks): format mle_bench utils

5f9c61d

Luodian approved these changes Mar 7, 2026

View reviewed changes

Luodian merged commit 06221cc into EvolvingLMMs-Lab:main Mar 7, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Multi-Level Existence Benchmark (MLE-Bench)#1228

feat: add Multi-Level Existence Benchmark (MLE-Bench)#1228
Luodian merged 2 commits intoEvolvingLMMs-Lab:mainfrom
ryf1123:feat/add-mle-bench

ryf1123 commented Mar 7, 2026

Uh oh!

Luodian commented Mar 7, 2026

Uh oh!

claude bot commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryf1123 commented Mar 7, 2026

Summary

In scope

Out of scope

Validation

Risk / Compatibility

Type of Change

Uh oh!

Luodian commented Mar 7, 2026

Uh oh!

claude bot commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants