Skip to content

feat: add Multi-Level Existence Benchmark (MLE-Bench)#1228

Merged
Luodian merged 2 commits intoEvolvingLMMs-Lab:mainfrom
ryf1123:feat/add-mle-bench
Mar 7, 2026
Merged

feat: add Multi-Level Existence Benchmark (MLE-Bench)#1228
Luodian merged 2 commits intoEvolvingLMMs-Lab:mainfrom
ryf1123:feat/add-mle-bench

Conversation

@ryf1123
Copy link
Copy Markdown
Contributor

@ryf1123 ryf1123 commented Mar 7, 2026

Adds evaluation support for the Multi-Level Existence Benchmark (MLE-Bench), introduced in the ICLR 2026 Oral paper:

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos
ICLR 2026 (Oral) — https://openreview.net/forum?id=pfw176o1YJ
Project page: https://junlinhan.github.io/projects/lsbs/
Dataset: https://huggingface.co/datasets/JunlinHan/Multi-Level_Existence_Bench

Summary

  • MLE-Bench is a visual perception benchmark with 1,861 image + 4-choice QA pairs about object existence
  • Questions are stratified by target object size (small 0–30% / medium 30–60% / large 60–100% of image area)
  • Evaluates "pure" perception ability independent of complex reasoning

In scope

  • New task directory lmms_eval/tasks/mle_bench/ with utils.py and YAML configs
  • 4 tasks: mle_bench (full), mle_bench_small, mle_bench_medium, mle_bench_large
  • Metrics: per-category accuracy (small / medium / large) + macro-average

Out of scope

  • No changes to existing tasks, models, or core framework code

Validation

  • python -m lmms_eval --model openai --model_args model=glm-4v-flash,base_url=https://open.bigmodel.cn/api/paas/v4/ --tasks mle_bench --limit 100 | sample size: N=100 (stratified) | key metrics: small=91.2%, medium=93.9%, large=93.9%, overall=93.0% | result: pass

Risk / Compatibility

  • No breaking changes; purely additive new task directory
  • Dataset is publicly available on HuggingFace under CC-BY-NC-4.0

Type of Change

  • Bug fix (non-breaking change)
  • New feature
  • New benchmark/task
  • New model integration
  • Breaking change
  • Documentation update
  • Refactoring (no functional changes)

Adds evaluation support for the Multi-Level Existence Benchmark (MLE-Bench),
introduced in the ICLR 2026 Oral paper:

  Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
  Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos
  ICLR 2026 (Oral)
  https://openreview.net/forum?id=pfw176o1YJ
  Project page: https://junlinhan.github.io/projects/lsbs/

MLE-Bench evaluates fine-grained visual perception in multimodal models
using 4-choice questions about object existence, categorised by the
target object's relative size (proportion of image pixels occupied):

  - small  (existence_0-30):   732 samples, objects occupying 0-30% of image
  - medium (existence_30-60):  698 samples, objects occupying 30-60% of image
  - large  (existence_60-100): 431 samples, objects occupying 60-100% of image

Dataset: https://huggingface.co/datasets/JunlinHan/Multi-Level_Existence_Bench

Tasks added:
  - mle_bench        : full evaluation (1,861 samples)
  - mle_bench_small  : small-object subset
  - mle_bench_medium : medium-object subset
  - mle_bench_large  : large-object subset

Metrics: per-category accuracy (small / medium / large) + macro-average
@Luodian
Copy link
Copy Markdown
Contributor

Luodian commented Mar 7, 2026

@claude please lint this

@claude
Copy link
Copy Markdown

claude bot commented Mar 7, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@Luodian Luodian merged commit 06221cc into EvolvingLMMs-Lab:main Mar 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants