Abstract Reasoning Benchmarks for LLMs

Abstract Reasoning Definition

In the context of Large Language Models (LLMs), abstract reasoning is defined as the ability to:

Abstraction: Extract essential patterns and underlying structures from concrete instances, independent of superficial details or specific symbolic representations. This involves information compression, generalization, and focusing on core, reasoning-relevant features.
Reasoning: Apply consistent rules, logical operations, and inferential processes to these abstracted patterns to derive new conclusions, solve problems, and make predictions. This goes beyond simple pattern matching or memorization, requiring genuine understanding of abstract relationships and rules.

The benchmarks listed below are designed to evaluate these two core processes in LLMs, particularly focusing on tasks that require invariance to surface-level changes and generalization to novel situations based on abstract understanding.

Benchmark/Dataset	Description	Source (Link)	Abstract Reasoning Relevance
AI2 Reasoning Challenge (ARC)	Tests LLMs on grade-school science questions requiring general knowledge and reasoning to answer complex science questions.	arXiv	Abstraction & Rule Application: Science questions often require abstracting scientific principles and applying them to specific scenarios. Moving beyond factual recall, it tests the ability to reason with abstract concepts in science.
LAMBADA	Evaluates the ability of language models to understand and predict text based on long-range context in narratives.	arXiv	Abstraction of Narrative Structure: Understanding long narratives demands abstraction of the overall narrative structure and coherence. Predicting text based on long context requires reasoning about abstract narrative flows and thematic elements.
MultiNLI (Multi-Genre Natural Language Inference)	Tests natural language inference by assigning labels (entailment, contradiction, neutral) to hypotheses based on premises across genres.	arXiv	Abstracting Sentence Meaning & Logical Rules: Deep NLI requires abstracting the meaning of sentences and applying logical rules to determine relationships (entailment, contradiction). This goes beyond surface-level keyword matching and requires abstract semantic understanding.
WinoGrande	Problems based on the Winograd Schema Challenge, testing context understanding in sentences with subtle variations.	arXiv	Abstract Contextual Reasoning: Winograd Schemas test pronoun resolution and require nuanced contextual understanding. Solving them involves abstractly reasoning about the situation and entities described, going beyond simple word associations.
SciQ	Multiple-choice science questions, often with supporting text, testing science-based reasoning and understanding of scientific principles.	arXiv	Applying Abstract Scientific Rules: Answering science questions in SciQ requires applying abstract scientific principles and rules to specific questions, often involving deduction and inference from supporting text.
GSM8K	Grade-school math word problems requiring basic to intermediate math operations and multi-step problem-solving.	arXiv	Abstraction of Mathematical Concepts & Rules: Math word problems necessitate abstracting the mathematical concepts and operations described in natural language. Solving them involves applying abstract mathematical rules and logical steps.
DROP (Discrete Reasoning Over Paragraphs)	Reading comprehension benchmark requiring models to navigate text and perform discrete operations like addition or sorting.	arXiv	Abstracting Numerical & Relational Information for Rule-Based Operations: DROP explicitly tests the ability to abstract numerical and relational information from text and apply discrete reasoning rules (addition, sorting) to answer questions.
CRASS (Counterfactual Reasoning Assessment)	Evaluates counterfactual reasoning ("what if" scenarios) abilities of LLMs.	arXiv	Abstracting Hypothetical Worlds & Causal Rules: Counterfactual reasoning inherently involves abstracting away from the real world and constructing hypothetical scenarios. It tests the ability to reason about cause and effect and apply rules in these abstract, alternative contexts.
BBH (Big-Bench Hard)	Subset of BIG-Bench with challenging tasks demanding multi-step and advanced reasoning skills.	arXiv	Diverse Abstract Reasoning Tasks: BBH encompasses a wide range of challenging tasks, many of which inherently require abstract reasoning, complex problem-solving, and the application of diverse rules and patterns in novel domains.
AGIEval	Collection of standardized tests (GRE, GMAT, SAT, LSAT, etc.) evaluating reasoning and problem-solving across academic/professional scenarios.	arXiv	Standardized Tests of Abstract Reasoning Abilities: Standardized tests like GRE, GMAT, and LSAT often contain sections specifically designed to test abstract reasoning, logical deduction, and analytical skills, reflecting a broad range of abstract cognitive abilities.
BoolQ	Yes/no questions from Google searches with Wikipedia context, testing inference from non-explicit contextual information.	arXiv	Abstract Inference from Context: While focused on question answering, BoolQ can involve abstract inference when the answer is not directly stated. It requires understanding implicit relationships and applying logical rules to deduce the correct yes/no answer from context.
PIQA (Physical Interaction: Question Answering)	Tests knowledge of the physical world through hypothetical scenarios and solutions.	arXiv	Abstracting Physical Laws & Commonsense Rules: Reasoning about physical interactions involves abstracting physical laws and common-sense rules to understand hypothetical situations and predict outcomes in the physical world.
CodeXGLUE	Evaluates LLMs' ability to understand and work with code across tasks like code completion and translation.	arXiv	Abstracting Code Logic & Rules: Code understanding and generation are fundamentally abstract. It involves manipulating abstract symbols, applying logical programming rules, and reasoning about program structure and semantics.
HumanEval	Programming challenges evaluating LLMs' ability to write functional code based on instructions.	arXiv	Abstract Rule Application in Code Generation: Code generation requires abstract reasoning to translate natural language requirements into executable code. It tests the ability to apply programming rules and logic in an abstract symbolic domain.
MBPP (Mostly Basic Python Programming)	Python programming problems for entry-level programmers.	arXiv	Basic Abstract Programming Logic: Solving even basic programming problems involves abstract thinking to represent problem logic and translate it into code, applying fundamental programming rules.
NPHardEval	Dynamic benchmark focusing on reasoning via Complexity Classes.	arXiv	Abstract Computational Reasoning: Dealing with complexity classes and NP-hard problems requires abstract thinking about problem structures, algorithmic complexity, and computational limits, representing a form of abstract computational and mathematical reasoning.
LLMs for Relational Reasoning (Survey)	Research direction focusing on relational reasoning capabilities of LLMs.	arXiv	Direct Focus on Abstract Relational Reasoning: Relational reasoning—understanding and manipulating relationships between entities and concepts—is a core component of abstract reasoning, making this research area and related benchmarks directly relevant.
Logical Reasoning Evaluation (Survey)	Research evaluating the logical reasoning capabilities of LLMs comprehensively.	arXiv	Direct Focus on Abstract Logical Rules: Logical reasoning, emphasizing deduction and induction, is a specific type of abstract reasoning. Benchmarks in this area directly assess the application of abstract logical rules.
PlanBench	Benchmark for evaluating LLMs on planning and reasoning about change.	arXiv	Abstract Planning & Rule-Based Action Sequences: Planning, especially in dynamic environments, requires abstracting goals, actions, and states. It tests the ability to reason about sequences of actions and apply rules in abstract planning scenarios.
CogEval	Benchmark for evaluating cognitive maps and planning abilities in LLMs.	arXiv	Abstract Cognitive Maps & Navigational Rules: Cognitive maps are abstract representations of environments, and planning involves abstract goal paths. CogEval assesses abstract reasoning in the context of spatial reasoning and applying navigational rules within abstract maps.
CRUXEval	Benchmark for code reasoning, understanding, and execution.	arXiv	Abstract Code Logic & Execution Rules: Code reasoning is inherently abstract, involving logical deduction, understanding abstract data structures, and manipulating symbolic representations according to programming execution rules.
MATH	High school mathematics competition problems, requiring multi-step reasoning to solve.	arXiv	Advanced Mathematical Abstraction: MATH tests high-level mathematical reasoning across various domains, demanding abstraction of complex mathematical concepts and application of theorems and problem-solving strategies.
AQuA	Algebra Question Answering dataset focusing on symbolic reasoning in algebra problems.	arXiv	Symbolic and Algebraic Abstraction: AQuA specifically evaluates the ability to abstract algebraic problems from natural language descriptions and perform symbolic reasoning to find solutions.
EntailmentBank	Benchmark for deductive reasoning, requiring building logical proof chains in natural language.	arXiv	Deductive Logical Abstraction: EntailmentBank directly assesses deductive reasoning, a core aspect of abstract logic, by requiring models to understand and construct abstract logical arguments.
CLUTRR	Diagnostic benchmark for inductive reasoning from text, focusing on family relationship inference from narratives.	arXiv	Inductive Relational Abstraction: CLUTRR tests inductive reasoning – another key aspect of abstract thought – by requiring models to abstract relational patterns from text and generalize to new instances.
HotPotQA	Dataset for multi-hop question answering, requiring reasoning across multiple documents.	arXiv	Multi-Document Information Abstraction & Synthesis: HotPotQA requires abstracting relevant information from multiple documents and synthesizing it to answer complex questions, testing higher-order abstract reasoning.
CommonsenseQA	Question answering challenge targeting commonsense knowledge and reasoning.	arXiv	Commonsense Abstraction & Inference: CommonsenseQA assesses the ability to apply abstract commonsense knowledge to answer questions, requiring inference beyond factual recall.
TimeDial	Dataset for temporal commonsense reasoning in dialog, focusing on event sequencing and time-related inferences.	arXiv	Temporal Abstraction & Reasoning: TimeDial directly evaluates temporal reasoning, a form of abstract reasoning about time and event order, in the context of dialogue.
SpartQA	Textual Question Answering Benchmark for Spatial Reasoning.	arXiv	Spatial Abstraction & Reasoning: SpartQA focuses on spatial reasoning, testing the ability to understand and reason about spatial relationships described in text, a key aspect of abstract spatial cognition.
GPQA	Graduate-level Question Answering Dataset designed to evaluate expert knowledge and reasoning across diverse domains.	arXiv	Expert Domain Knowledge & Reasoning: GPQA goes beyond general question answering, focusing on the ability to demonstrate expert-level knowledge within specific academic domains like physics, biology, and history. It assesses the model's capacity to understand complex, nuanced questions requiring in-depth subject matter expertise and reasoning beyond typical benchmark datasets. This challenges models to move beyond superficial understanding and exhibit true expert-level comprehension.
MMLU (Massive Multitask Language Understanding)	Benchmark covering 57 diverse subjects, testing broad knowledge and reasoning.	arXiv	Broad Abstract Knowledge Application & Reasoning: MMLU's wide coverage tests the ability to apply abstract knowledge and reasoning across diverse domains, assessing general abstract cognitive abilities.
C-Eval	A Multi-Level Multi-Discipline Chinese Evaluation Suite, evaluating broad knowledge and reasoning in Chinese.	https://cevalbenchmark.com/	Broad Abstract Knowledge Application & Reasoning (Chinese): C-Eval, similar to MMLU but in Chinese, assesses broad abstract knowledge and reasoning across disciplines, but with a focus on Chinese language and cultural contexts.
MR-BEN (Meta-Reasoning Benchmark)	Benchmark for meta-reasoning, evaluating the ability to detect and analyze errors in generated reasoning steps.	arXiv	Meta-Abstract Reasoning & Error Correction: MR-BEN tests meta-reasoning, a higher-level abstract ability to reflect on and correct one's own reasoning process, critical for robust abstract problem-solving.
UGMathBench (Undergraduate Math Benchmark)	Diverse and dynamic benchmark for undergraduate mathematical reasoning.	arXiv	Advanced Undergraduate Mathematical Abstraction: UGMathBench focuses on the complexities of undergraduate-level mathematics, requiring advanced abstract mathematical reasoning and problem-solving skills.
MARVEL (Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning)	Benchmark for multimodal abstract visual reasoning with geometric and abstract shapes.	arXiv	Multimodal Abstract Visual Reasoning: MARVEL directly targets abstract visual reasoning, testing the ability to understand and reason with abstract geometric shapes and visual patterns, extending abstract reasoning to the visual domain.
ARB (Advanced Reasoning Benchmark)	Benchmark for advanced reasoning in math, physics, biology, chemistry, and law, targeting expert-level reasoning.	arXiv	Expert-Level Multi-Domain Abstract Reasoning: ARB challenges models with expert-level questions across diverse fields, demanding advanced abstract reasoning capabilities in complex, knowledge-intensive domains.
AoT Collection (Abstraction-of-Thought Collection)	Dataset introduced with the Abstraction-of-Thought method, designed to improve reasoning through explicit abstraction.	arXiv	Abstraction-Focused Reasoning Data: While primarily a dataset for training, AoT Collection is inherently designed to promote and evaluate reasoning processes that explicitly incorporate abstraction, making it relevant to abstract reasoning evaluation.
H-ARC (Human performance on ARC)	Dataset and study providing robust human performance estimates on the ARC benchmark for abstract reasoning.	arXiv	Human-Level Abstract Reasoning Baseline (ARC): H-ARC provides a refined human baseline for the ARC benchmark, crucial for contextualizing and evaluating LLMs' abstract reasoning performance relative to human capabilities.
τ-bench (Tool-Agent-User Interaction Benchmark)	A benchmark for evaluating tool-agent-user interaction in real-world domains, focusing on user simulation, API tool usage, and domain-specific policy adherence.	arXiv	Realistic Interaction and Rule Following: τ-bench assesses abstract reasoning by requiring agents to interact with simulated users and APIs, demonstrating the ability to follow complex, domain-specific rules and generalize beyond pattern matching in realistic scenarios.
AGIEval (A Human-Centric Benchmark for Evaluating Foundation Models)	A human-centric benchmark derived from standardized exams (college entrance, law school admission, etc.) to evaluate foundation models.	arXiv	Human-Level Cognition Assessment: AGIEval evaluates abstract reasoning by using tasks from human-designed standardized exams, assessing the models' ability to handle complex, human-level cognitive tasks that require abstract reasoning skills in real-world scenarios.
CODEI/O (Condensing Reasoning Patterns via Code Input-Output Prediction)	An approach and dataset transforming code into input-output prediction tasks to evaluate reasoning patterns.	arXiv	Reasoning Pattern Extraction from Code: CODEI/O, while focused on code, probes abstract reasoning by training models on code input-output prediction, forcing them to learn universal reasoning primitives independent of specific syntax, applicable to broader abstract reasoning tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Abstract Reasoning Benchmarks for LLMs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Abstract Reasoning Benchmarks for LLMs