A benchmark for evaluating LLM tutors in CS within VEX VR, a block based robotics programming environment. The benchmark tests whether an AI tutor gives responses that are accurate, concise, age-appropriate, and pedagogically sound for middle school students.
This is a single turn benchmark which accesses LLMs based on snapshots of common student states. There are 17 scenarios (EXs short for examples) where the LLM tutor receives Blockly code in xml format and a student's question then responds. Currently, all 17 scenarios exist within VEX's Coral Reef Clean-up assignment.
- Running the benchmark on LLMs through Ollama or AIStudio
- Evaluating the rubric with a LLM through Ollama or Claude (instructions given to Claude in CLAUDE.md)
- Support for multiple assignments
- A Question Builder web tool for authoring new EXs
- An 8-criterion rubric (0-2 points each, 16 points max per EX)
17 questions across four types:
| Type | Count | Description |
|---|---|---|
debugging |
8 | Student has a bug and describes unexpected behavior |
debugging_iterative |
6 | Student tried to fix a bug 2–3 times before asking; multiple code attempts provided |
optimization |
1 | Student's code works but they want to improve it |
conceptual |
2 | Student is asking how something works, not fixing broken code |
Bug categories covered: missing_forever_loop, blocking_command_drive_for, inverted_logic, sequential_blocking_conflict, wrong_sensor_unavailable, absolute_vs_relative_heading, dead_reckoning_no_sensing, sensor_field_of_view, misunderstood_block_magnet, misplaced_forever_loop, blocking_command_still_too_long, inverted_logic_wrong_fix, finite_repeat_instead_of_forever, sensor_added_wrong_position, absolute_heading_escalating_numbers, micro_step_jitter, optimization_sensor_driven
Location: Dataset/benchmark.jsonl — one JSON object per line. See Dataset/GUIDE.md for the full schema.
benchmark.jsonlis a generated file. The source of truth isDataset/rubric_template.yamlandDataset/EX*/question.yaml. Runpython Dataset/build.pyto rebuild it after editing any YAML file.
Note: A
Dataset/EX3/folder exists but EX3 is excluded from the main benchmark.Dataset/benchmark_with_EX3.jsonlpreserves it for reference.Dataset/benchmark_with_cliff.jsonlis a historical snapshot predating the coral boundary correction and the current rubric — kept for reference only.
Every response is scored on 8 criteria (16 points max):
| # | Criterion | What it measures (abbreviated) |
|---|---|---|
| 1 | conciseness |
Under 300 characters; no padding or question-restating |
| 2 | vocabulary |
Language accessible to a middle schooler; no unexplained jargon |
| 3 | accuracy |
No incorrect claims about block behavior, sensors, or robot physics |
| 4 | formatting |
Clean prose; no raw XML, markdown artifacts, or complex sentence structures |
| 5 | tone |
Encouraging and patient; praise is meaningful, not automatic filler |
| 6 | actionability |
Student knows what to look at or try next after reading the response |
| 7 | targetedness |
Engages with this student's specific situation, not generic advice |
| 8 | (type-specific) | See below |
Type-specific criterion (8th):
| Criterion | Question type | What it measures |
|---|---|---|
hint_not_solution |
debugging |
Guides the student toward the fix without stating it outright (Socratic) |
acknowledges_progression |
debugging_iterative |
Recognizes and validates the student's self-directed iteration history |
builds_on_success |
optimization |
Acknowledges what already works before suggesting improvements |
conceptual_clarity |
conceptual |
Explains the concept in a way that builds intuition, not just states a fact |
NOTE
Every EX may also provide per-question rubric additions or overrides to give context specific to that example.
11 models have been run through the full benchmark (Trial1 and Trial2 each):
| Model | Notes |
|---|---|
deepseek-r1_8b |
Thinking model |
gemma-4-31b-it |
|
gemma3_27b |
|
gemma3_4b |
|
gemma4_e4b |
Edge variant |
gpt-oss_20b |
|
nemotron-3-super_120b |
|
olmo-3_7b-think |
Thinking model |
qwen3-coder_30b |
Coder variant |
qwen3.5_9b |
|
qwen3.6_latest |
Results live in Evaluations/<model>/Trial<N>/<judge>/.
The most up to date results use the hybrid-claude-sonnet-4-6-v4 judge. results_script-check-hybrid-claude-sonnet-4-6-v4.xlsx is a spreadsheet of those results with the character-count conciseness rules applied across all questions.
python run_benchmark.py --model qwen2.5:32b# Run a subset of questions
python run_benchmark.py --model qwen2.5:32b --questions 1 5 # EX1 through EX5
python run_benchmark.py --model qwen2.5:32b --questions 7 # EX7 only
python run_benchmark.py --model qwen2.5:32b --questions 1 3 7 # EX1, EX3, EX7Responses are saved to Responses/<model>/Trial<N>/EX*.txt. Each run automatically creates the next trial directory (Trial1, Trial2, ...).
python run_benchmark_aistudio.py --model gemini-2.0-flash --api-key YOUR_KEYOr set AISTUDIO_API_KEY in your environment and omit --api-key. Rate limit retries are handled automatically.
# --model <model to be evaluated> --trial <N> --judge <model used as judge>
python evaluate.py --model <model_slug> --trial <N> --judge <judge_model>Results are saved to Evaluations/<model>/Trial<N>/<judge>/EX*.json and summary.json.
EX*.json— per-question scores and reasoning for all 8 criteriasummary.json— aggregate report with per-criterion and per-question-type breakdowns
Claude can act directly as the judge without calling evaluate.py. This currently produces higher quality scores than a local LLM.
Open Claude Code in this repository and tell it to evaluate a model:
evaluate gemma3_27b trial 1
evaluate gemma3_27b and gemma3_4b # runs both in parallel
evaluate nemotron # partial name match, latest trial
Claude reads CLAUDE.md for the full scoring instructions and writes results to Evaluations/<model>/Trial<N>/claude-sonnet-4-6-v4/. After scoring it runs a connector scan to verify formatting scores, then prints a summary table.
The judge subfolder name identifies the judge model and rubric version used. Current production judge subfolder is claude-sonnet-4-6-v4. Earlier subfolders (claude-sonnet-4-6, claude-sonnet-4-6-v2, claude-sonnet-4-6-v3) reflect prior rubric iterations and are kept for historical comparison. hybrid-claude-sonnet-4-6-v4 subfolders contain scores that were manually reviewed and corrected after the automated pass.
# Recalculate total_score and percent from criteria arrays; rebuild summary.json
python fix_scores.py --model <model_slug> --trial <N> --judge <judge>
# Generate an XLSX summary across all models for a given judge
python make_xlsx.py --judge <judge>
# Compare two evaluation runs and report absolute score differences
python abs_diff.py --model <model_slug> --trial <N> --judge-a <judge> --judge-b <judge>EvaluatorTools/conciseness_analyze.py counts characters, words, and approximate sentences in a response. evaluate.py calls it automatically when scoring the conciseness criterion. You can also run it manually:
python EvaluatorTools/conciseness_analyze.py --text "Your response here"
python EvaluatorTools/conciseness_analyze.py --model gemma3_27b --trial 1 --ex 6python make_grading_sheet.py # outputs a .xlsx viewable in Excel or Google Sheets
python make_xlsx.py --judge <judge> # XLSX summary of all evaluation results for a given judgemake_grading_sheet.py produces a blank grading sheet from the current benchmark questions — useful for human review. make_xlsx.py produces a filled results summary from existing evaluation output.
An assignment is the VEX VR challenge the student is working on (e.g. Coral Reef Rescue). Each assignment has its own file that gets injected into the system prompt so the tutor LLM understands the task context.
-
Create the assignment file in
LLMUnifiedConfig/Assignments/<AssignmentName>.txt. The file should contain the student-facing challenge description plus any environment notes the tutor LLM needs — collection mechanics, boundary behavior, available sensors, what the example solution looks like, and any blocks that are not relevant to this challenge. UseCoralReefRescue.txtas a reference. -
Author questions for the new assignment using the Question Builder (see Adding New Questions below). Set the
assignmentfield in the question form to match the filename exactly (without.txt). This value is stored in eachquestion.yamland carried intobenchmark.jsonl. -
Rebuild the benchmark if you authored any YAML files manually outside the Question Builder:
python Dataset/build.py
The run scripts (run_benchmark.py, run_benchmark_aistudio.py) will automatically find and load the correct assignment file by matching the assignment field in each benchmark entry against filenames in LLMUnifiedConfig/Assignments/. If no match is found, the script will exit with a list of available assignment names.
Use the Question Builder — a local web tool for authoring new benchmark entries with a drag-and-drop Blockly interface.
- Python 3 with PyYAML (
pip install pyyaml) - Ollama running locally (only needed for the Test feature)
- Internet access (Blockly loads from CDN)
python question_builder/server.pyThen open http://localhost:8765 in a browser.
The tool provides:
- Blockly workspace — drag-and-drop blocks to build the student's code. For
debugging_iterativequestions, add up to 3 attempt tabs. - Question form — fill in question type, assignment, bug/topic category, student question, and evaluator notes. Entry-specific rubric overrides and scoring guide overrides are optional.
- Test panel — select any locally available Ollama model and send the current question to it to preview how a model responds before saving.
- Save — writes
Dataset/EX{N}/question.yamland the XML file(s), then automatically rebuildsbenchmark.jsonl.
CSTutorBench/
├── README.md
├── CLAUDE.md # Scoring instructions for Claude Code as judge
├── results_script-check-hybrid-claude-sonnet-4-6-v4.xlsx # Hybrid results with character-count conciseness rules
├── Dataset/
│ ├── GUIDE.md # Full schema documentation
│ ├── rubric_template.yaml # Shared rubric text for all criteria
│ ├── benchmark.jsonl # Generated — one JSON object per line
│ ├── benchmark_with_EX3.jsonl # Variant including the excluded EX3
│ ├── benchmark_with_cliff.jsonl # Pre-correction snapshot (old boundary logic)
│ ├── build.py # Rebuilds benchmark.jsonl from YAML sources
│ ├── EX1/
│ │ ├── VEX.xml # Student's Blockly code
│ │ └── question.yaml # Question metadata and rubric overrides
│ ├── EX2/ … EX18/
│ └── EX3/ # Excluded from main benchmark
├── LLMUnifiedConfig/
│ ├── SystemPrompt.txt # System prompt template for the tutor LLM
│ ├── VEXVRReferenceTable.txt # Block reference table injected into the prompt
│ └── Assignments/
│ └── CoralReefRescue.txt # Assignment context injected into the prompt
├── question_builder/
│ ├── server.py # Local HTTP server for the authoring UI
│ ├── index.html # Question Builder web interface
│ ├── blocks.js # Blockly block definitions
│ └── VEXBlockly.xml # Blockly toolbox definition
├── Responses/
│ └── <model>/Trial<N>/EX*.txt # Raw model responses
├── Evaluations/
│ ├── review-analysis.md # Human vs. judge disagreement analysis
│ └── <model>/Trial<N>/<judge>/
│ ├── EX*.json # Per-question scores and reasoning
│ └── summary.json # Aggregate results
├── EvaluatorTools/
│ └── conciseness_analyze.py # Count chars/words/sentences for conciseness scoring
├── run_benchmark.py # Collect responses from a local Ollama model
├── run_benchmark_aistudio.py # Collect responses from Google AI Studio
├── evaluate.py # Score responses with an LLM judge
├── fix_scores.py # Recalculate totals from criteria arrays
├── make_xlsx.py # Generate XLSX summary across all models
├── abs_diff.py # Compare two judge runs, report differences
└── make_grading_sheet.py # Generate XLSX sheet for human grading
- Python 3.8+
pyyaml— forDataset/build.pyandquestion_builder/server.py(pip install pyyaml)- Ollama — for
run_benchmark.pyand the Question Builder test feature openpyxl— formake_grading_sheet.pyandmake_xlsx.py(pip install openpyxl)- No additional dependencies for
run_benchmark_aistudio.pyorevaluate.py