CSTutorBench

A benchmark for evaluating LLM tutors in CS within VEX VR, a block based robotics programming environment. The benchmark tests whether an AI tutor gives responses that are accurate, concise, age-appropriate, and pedagogically sound for middle school students.

Overview

This is a single turn benchmark which accesses LLMs based on snapshots of common student states. There are 17 scenarios (EXs short for examples) where the LLM tutor receives Blockly code in xml format and a student's question then responds. Currently, all 17 scenarios exist within VEX's Coral Reef Clean-up assignment.

Features

Running the benchmark on LLMs through Ollama or AIStudio
Evaluating the rubric with a LLM through Ollama or Claude (instructions given to Claude in CLAUDE.md)
Support for multiple assignments
A Question Builder web tool for authoring new EXs
An 8-criterion rubric (0-2 points each, 16 points max per EX)

Dataset

17 questions across four types:

Type	Count	Description
`debugging`	8	Student has a bug and describes unexpected behavior
`debugging_iterative`	6	Student tried to fix a bug 2–3 times before asking; multiple code attempts provided
`optimization`	1	Student's code works but they want to improve it
`conceptual`	2	Student is asking how something works, not fixing broken code

Bug categories covered: missing_forever_loop, blocking_command_drive_for, inverted_logic, sequential_blocking_conflict, wrong_sensor_unavailable, absolute_vs_relative_heading, dead_reckoning_no_sensing, sensor_field_of_view, misunderstood_block_magnet, misplaced_forever_loop, blocking_command_still_too_long, inverted_logic_wrong_fix, finite_repeat_instead_of_forever, sensor_added_wrong_position, absolute_heading_escalating_numbers, micro_step_jitter, optimization_sensor_driven

Location: Dataset/benchmark.jsonl — one JSON object per line. See Dataset/GUIDE.md for the full schema.

benchmark.jsonl is a generated file. The source of truth is Dataset/rubric_template.yaml and Dataset/EX*/question.yaml. Run python Dataset/build.py to rebuild it after editing any YAML file.

Note: A Dataset/EX3/ folder exists but EX3 is excluded from the main benchmark. Dataset/benchmark_with_EX3.jsonl preserves it for reference. Dataset/benchmark_with_cliff.jsonl is a historical snapshot predating the coral boundary correction and the current rubric — kept for reference only.

Rubric

Every response is scored on 8 criteria (16 points max):

#	Criterion	What it measures (abbreviated)
1	`conciseness`	Under 300 characters; no padding or question-restating
2	`vocabulary`	Language accessible to a middle schooler; no unexplained jargon
3	`accuracy`	No incorrect claims about block behavior, sensors, or robot physics
4	`formatting`	Clean prose; no raw XML, markdown artifacts, or complex sentence structures
5	`tone`	Encouraging and patient; praise is meaningful, not automatic filler
6	`actionability`	Student knows what to look at or try next after reading the response
7	`targetedness`	Engages with this student's specific situation, not generic advice
8	(type-specific)	See below

Type-specific criterion (8th):

Criterion	Question type	What it measures
`hint_not_solution`	`debugging`	Guides the student toward the fix without stating it outright (Socratic)
`acknowledges_progression`	`debugging_iterative`	Recognizes and validates the student's self-directed iteration history
`builds_on_success`	`optimization`	Acknowledges what already works before suggesting improvements
`conceptual_clarity`	`conceptual`	Explains the concept in a way that builds intuition, not just states a fact

NOTE

Every EX may also provide per-question rubric additions or overrides to give context specific to that example.

Models Evaluated

11 models have been run through the full benchmark (Trial1 and Trial2 each):

Model	Notes
`deepseek-r1_8b`	Thinking model
`gemma-4-31b-it`
`gemma3_27b`
`gemma3_4b`
`gemma4_e4b`	Edge variant
`gpt-oss_20b`
`nemotron-3-super_120b`
`olmo-3_7b-think`	Thinking model
`qwen3-coder_30b`	Coder variant
`qwen3.5_9b`
`qwen3.6_latest`

Results live in Evaluations/<model>/Trial<N>/<judge>/. The most up to date results use the hybrid-claude-sonnet-4-6-v4 judge. results_script-check-hybrid-claude-sonnet-4-6-v4.xlsx is a spreadsheet of those results with the character-count conciseness rules applied across all questions.

Running the Benchmark

With a local Ollama model

python run_benchmark.py --model qwen2.5:32b

# Run a subset of questions
python run_benchmark.py --model qwen2.5:32b --questions 1 5   # EX1 through EX5
python run_benchmark.py --model qwen2.5:32b --questions 7     # EX7 only
python run_benchmark.py --model qwen2.5:32b --questions 1 3 7 # EX1, EX3, EX7

Responses are saved to Responses/<model>/Trial<N>/EX*.txt. Each run automatically creates the next trial directory (Trial1, Trial2, ...).

With Google AI Studio (Gemini/Gemma)

python run_benchmark_aistudio.py --model gemini-2.0-flash --api-key YOUR_KEY

Or set AISTUDIO_API_KEY in your environment and omit --api-key. Rate limit retries are handled automatically.

Evaluating Responses

# --model <model to be evaluated> --trial <N> --judge <model used as judge>
python evaluate.py --model <model_slug> --trial <N> --judge <judge_model>

Results are saved to Evaluations/<model>/Trial<N>/<judge>/EX*.json and summary.json.

EX*.json — per-question scores and reasoning for all 8 criteria
summary.json — aggregate report with per-criterion and per-question-type breakdowns

With Claude Code

Claude can act directly as the judge without calling evaluate.py. This currently produces higher quality scores than a local LLM.

Open Claude Code in this repository and tell it to evaluate a model:

evaluate gemma3_27b trial 1
evaluate gemma3_27b and gemma3_4b          # runs both in parallel
evaluate nemotron                          # partial name match, latest trial

Claude reads CLAUDE.md for the full scoring instructions and writes results to Evaluations/<model>/Trial<N>/claude-sonnet-4-6-v4/. After scoring it runs a connector scan to verify formatting scores, then prints a summary table.

Judge versioning

The judge subfolder name identifies the judge model and rubric version used. Current production judge subfolder is claude-sonnet-4-6-v4. Earlier subfolders (claude-sonnet-4-6, claude-sonnet-4-6-v2, claude-sonnet-4-6-v3) reflect prior rubric iterations and are kept for historical comparison. hybrid-claude-sonnet-4-6-v4 subfolders contain scores that were manually reviewed and corrected after the automated pass.

Utility scripts

# Recalculate total_score and percent from criteria arrays; rebuild summary.json
python fix_scores.py --model <model_slug> --trial <N> --judge <judge>

# Generate an XLSX summary across all models for a given judge
python make_xlsx.py --judge <judge>

# Compare two evaluation runs and report absolute score differences
python abs_diff.py --model <model_slug> --trial <N> --judge-a <judge> --judge-b <judge>

Evaluator tools

EvaluatorTools/conciseness_analyze.py counts characters, words, and approximate sentences in a response. evaluate.py calls it automatically when scoring the conciseness criterion. You can also run it manually:

python EvaluatorTools/conciseness_analyze.py --text "Your response here"
python EvaluatorTools/conciseness_analyze.py --model gemma3_27b --trial 1 --ex 6

Human Readable Sheet

python make_grading_sheet.py  # outputs a .xlsx viewable in Excel or Google Sheets
python make_xlsx.py --judge <judge>  # XLSX summary of all evaluation results for a given judge

make_grading_sheet.py produces a blank grading sheet from the current benchmark questions — useful for human review. make_xlsx.py produces a filled results summary from existing evaluation output.

Adding New Assignments

An assignment is the VEX VR challenge the student is working on (e.g. Coral Reef Rescue). Each assignment has its own file that gets injected into the system prompt so the tutor LLM understands the task context.

Steps

Create the assignment file in LLMUnifiedConfig/Assignments/<AssignmentName>.txt. The file should contain the student-facing challenge description plus any environment notes the tutor LLM needs — collection mechanics, boundary behavior, available sensors, what the example solution looks like, and any blocks that are not relevant to this challenge. Use CoralReefRescue.txt as a reference.
Author questions for the new assignment using the Question Builder (see Adding New Questions below). Set the assignment field in the question form to match the filename exactly (without .txt). This value is stored in each question.yaml and carried into benchmark.jsonl.
Rebuild the benchmark if you authored any YAML files manually outside the Question Builder:
```
python Dataset/build.py
```

The run scripts (run_benchmark.py, run_benchmark_aistudio.py) will automatically find and load the correct assignment file by matching the assignment field in each benchmark entry against filenames in LLMUnifiedConfig/Assignments/. If no match is found, the script will exit with a list of available assignment names.

Adding New Questions

Use the Question Builder — a local web tool for authoring new benchmark entries with a drag-and-drop Blockly interface.

Requirements

Python 3 with PyYAML (pip install pyyaml)
Ollama running locally (only needed for the Test feature)
Internet access (Blockly loads from CDN)

Usage

python question_builder/server.py

Then open http://localhost:8765 in a browser.

The tool provides:

Blockly workspace — drag-and-drop blocks to build the student's code. For debugging_iterative questions, add up to 3 attempt tabs.
Question form — fill in question type, assignment, bug/topic category, student question, and evaluator notes. Entry-specific rubric overrides and scoring guide overrides are optional.
Test panel — select any locally available Ollama model and send the current question to it to preview how a model responds before saving.
Save — writes Dataset/EX{N}/question.yaml and the XML file(s), then automatically rebuilds benchmark.jsonl.

Repository Structure

CSTutorBench/
├── README.md
├── CLAUDE.md                               # Scoring instructions for Claude Code as judge
├── results_script-check-hybrid-claude-sonnet-4-6-v4.xlsx  # Hybrid results with character-count conciseness rules
├── Dataset/
│   ├── GUIDE.md                        # Full schema documentation
│   ├── rubric_template.yaml            # Shared rubric text for all criteria
│   ├── benchmark.jsonl                 # Generated — one JSON object per line
│   ├── benchmark_with_EX3.jsonl        # Variant including the excluded EX3
│   ├── benchmark_with_cliff.jsonl      # Pre-correction snapshot (old boundary logic)
│   ├── build.py                        # Rebuilds benchmark.jsonl from YAML sources
│   ├── EX1/
│   │   ├── VEX.xml                     # Student's Blockly code
│   │   └── question.yaml               # Question metadata and rubric overrides
│   ├── EX2/ … EX18/
│   └── EX3/                            # Excluded from main benchmark
├── LLMUnifiedConfig/
│   ├── SystemPrompt.txt                # System prompt template for the tutor LLM
│   ├── VEXVRReferenceTable.txt         # Block reference table injected into the prompt
│   └── Assignments/
│       └── CoralReefRescue.txt         # Assignment context injected into the prompt
├── question_builder/
│   ├── server.py                       # Local HTTP server for the authoring UI
│   ├── index.html                      # Question Builder web interface
│   ├── blocks.js                       # Blockly block definitions
│   └── VEXBlockly.xml                  # Blockly toolbox definition
├── Responses/
│   └── <model>/Trial<N>/EX*.txt        # Raw model responses
├── Evaluations/
│   ├── review-analysis.md              # Human vs. judge disagreement analysis
│   └── <model>/Trial<N>/<judge>/
│       ├── EX*.json                    # Per-question scores and reasoning
│       └── summary.json               # Aggregate results
├── EvaluatorTools/
│   └── conciseness_analyze.py          # Count chars/words/sentences for conciseness scoring
├── run_benchmark.py                    # Collect responses from a local Ollama model
├── run_benchmark_aistudio.py           # Collect responses from Google AI Studio
├── evaluate.py                         # Score responses with an LLM judge
├── fix_scores.py                       # Recalculate totals from criteria arrays
├── make_xlsx.py                        # Generate XLSX summary across all models
├── abs_diff.py                         # Compare two judge runs, report differences
└── make_grading_sheet.py               # Generate XLSX sheet for human grading

Dependencies

Python 3.8+
pyyaml — for Dataset/build.py and question_builder/server.py (pip install pyyaml)
Ollama — for run_benchmark.py and the Question Builder test feature
openpyxl — for make_grading_sheet.py and make_xlsx.py (pip install openpyxl)
No additional dependencies for run_benchmark_aistudio.py or evaluate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CSTutorBench

Overview

Features

Dataset

Rubric

Models Evaluated

Running the Benchmark

With a local Ollama model

With Google AI Studio (Gemini/Gemma)

Evaluating Responses

With Claude Code

Judge versioning

Utility scripts

Evaluator tools

Human Readable Sheet

Adding New Assignments

Steps

Adding New Questions

Requirements

Usage

Repository Structure

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.claude		.claude
Dataset		Dataset
Evaluations		Evaluations
EvaluatorTools		EvaluatorTools
LLMUnifiedConfig		LLMUnifiedConfig
Responses		Responses
question_builder		question_builder
CLAUDE.md		CLAUDE.md
README.md		README.md
abs_diff.py		abs_diff.py
evaluate.py		evaluate.py
fix_scores.py		fix_scores.py
make_grading_sheet.py		make_grading_sheet.py
make_xlsx.py		make_xlsx.py
results_script-check-hybrid-claude-sonnet-4-6-v4.xlsx		results_script-check-hybrid-claude-sonnet-4-6-v4.xlsx
run_benchmark.py		run_benchmark.py
run_benchmark_aistudio.py		run_benchmark_aistudio.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CSTutorBench

Overview

Features

Dataset

Rubric

Models Evaluated

Running the Benchmark

With a local Ollama model

With Google AI Studio (Gemini/Gemma)

Evaluating Responses

With Claude Code

Judge versioning

Utility scripts

Evaluator tools

Human Readable Sheet

Adding New Assignments

Steps

Adding New Questions

Requirements

Usage

Repository Structure

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages