feat: add COVER and WM-aBench video understanding benchmarks by Luodian · Pull Request #1273 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-03-26T16:06:13Z

Summary

Add COVER (Counterfactual Video Reasoning, ACL Findings 2025) benchmark for causal video understanding
Add WM-aBench (World Models aBench) with 36+ task variants for comprehensive world model evaluation

Benchmarks

COVER

Tests causal understanding in videos via counterfactual question generation. Includes generate_qa.py for automatic QA pair generation from video annotations.

WM-aBench

Evaluates world model capabilities across multiple dimensions:

Spatial: relative position, occupancy, multiview
Motion: direction, speed, trajectory
Physical: mechanistic knowledge, compositionality
Temporal: extension, positioning, transitivity
Visual: attribute recognition (color, shape, material)
Counting: discrete counting, relative counting

Uses ManISkill, TDW, Physion, Habitat, and CARLA simulation environments.

Test plan

Verify task registration with lmms-eval --tasks list | grep -E "cover|wm_abench"
Run COVER with a video-capable model
Run a WM-aBench subset task
Confirm group yaml correctly aggregates all subtasks

Add two video understanding benchmarks: - COVER: Counterfactual Video Reasoning (ACL Findings 2025) - tests causal understanding in videos via counterfactual question generation - WM-aBench: World Models aBench with 36+ task variants covering spatial reasoning, motion understanding, object interactions, physical properties, temporal reasoning, and visual attributes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add COVER and WM-aBench video understanding benchmarks#1273

feat: add COVER and WM-aBench video understanding benchmarks#1273
Luodian wants to merge 1 commit intomainfrom
feat/cover-wm-abench

Luodian commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Mar 26, 2026

Summary

Benchmarks

COVER

WM-aBench

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant