Skip to content

InviteInstitute/CSTutorBench

Repository files navigation

CSTutorBench

A benchmark for evaluating LLM tutors in CS within VEX VR, a block based robotics programming environment. The benchmark tests whether an AI tutor gives responses that are accurate, concise, age-appropriate, and pedagogically sound for middle school students.


Overview

This is a single turn benchmark which accesses LLMs based on snapshots of common student states. There are 17 scenarios (EXs short for examples) where the LLM tutor receives Blockly code in xml format and a student's question then responds. Currently, all 17 scenarios exist within VEX's Coral Reef Clean-up assignment.

Features

  • Running the benchmark on LLMs through Ollama or AIStudio
  • Evaluating the rubric with a LLM through Ollama or Claude (instructions given to Claude in CLAUDE.md)
  • Support for multiple assignments
  • A Question Builder web tool for authoring new EXs
  • An 8-criterion rubric (0-2 points each, 16 points max per EX)

Dataset

17 questions across four types:

Type Count Description
debugging 8 Student has a bug and describes unexpected behavior
debugging_iterative 6 Student tried to fix a bug 2–3 times before asking; multiple code attempts provided
optimization 1 Student's code works but they want to improve it
conceptual 2 Student is asking how something works, not fixing broken code

Bug categories covered: missing_forever_loop, blocking_command_drive_for, inverted_logic, sequential_blocking_conflict, wrong_sensor_unavailable, absolute_vs_relative_heading, dead_reckoning_no_sensing, sensor_field_of_view, misunderstood_block_magnet, misplaced_forever_loop, blocking_command_still_too_long, inverted_logic_wrong_fix, finite_repeat_instead_of_forever, sensor_added_wrong_position, absolute_heading_escalating_numbers, micro_step_jitter, optimization_sensor_driven

Location: Dataset/benchmark.jsonl — one JSON object per line. See Dataset/GUIDE.md for the full schema.

benchmark.jsonl is a generated file. The source of truth is Dataset/rubric_template.yaml and Dataset/EX*/question.yaml. Run python Dataset/build.py to rebuild it after editing any YAML file.

Note: A Dataset/EX3/ folder exists but EX3 is excluded from the main benchmark. Dataset/benchmark_with_EX3.jsonl preserves it for reference. Dataset/benchmark_with_cliff.jsonl is a historical snapshot predating the coral boundary correction and the current rubric — kept for reference only.


Rubric

Every response is scored on 8 criteria (16 points max):

# Criterion What it measures (abbreviated)
1 conciseness Under 300 characters; no padding or question-restating
2 vocabulary Language accessible to a middle schooler; no unexplained jargon
3 accuracy No incorrect claims about block behavior, sensors, or robot physics
4 formatting Clean prose; no raw XML, markdown artifacts, or complex sentence structures
5 tone Encouraging and patient; praise is meaningful, not automatic filler
6 actionability Student knows what to look at or try next after reading the response
7 targetedness Engages with this student's specific situation, not generic advice
8 (type-specific) See below

Type-specific criterion (8th):

Criterion Question type What it measures
hint_not_solution debugging Guides the student toward the fix without stating it outright (Socratic)
acknowledges_progression debugging_iterative Recognizes and validates the student's self-directed iteration history
builds_on_success optimization Acknowledges what already works before suggesting improvements
conceptual_clarity conceptual Explains the concept in a way that builds intuition, not just states a fact

NOTE

Every EX may also provide per-question rubric additions or overrides to give context specific to that example.


Models Evaluated

11 models have been run through the full benchmark (Trial1 and Trial2 each):

Model Notes
deepseek-r1_8b Thinking model
gemma-4-31b-it
gemma3_27b
gemma3_4b
gemma4_e4b Edge variant
gpt-oss_20b
nemotron-3-super_120b
olmo-3_7b-think Thinking model
qwen3-coder_30b Coder variant
qwen3.5_9b
qwen3.6_latest

Results live in Evaluations/<model>/Trial<N>/<judge>/. The most up to date results use the hybrid-claude-sonnet-4-6-v4 judge. results_script-check-hybrid-claude-sonnet-4-6-v4.xlsx is a spreadsheet of those results with the character-count conciseness rules applied across all questions.


Running the Benchmark

With a local Ollama model

python run_benchmark.py --model qwen2.5:32b
# Run a subset of questions
python run_benchmark.py --model qwen2.5:32b --questions 1 5   # EX1 through EX5
python run_benchmark.py --model qwen2.5:32b --questions 7     # EX7 only
python run_benchmark.py --model qwen2.5:32b --questions 1 3 7 # EX1, EX3, EX7

Responses are saved to Responses/<model>/Trial<N>/EX*.txt. Each run automatically creates the next trial directory (Trial1, Trial2, ...).

With Google AI Studio (Gemini/Gemma)

python run_benchmark_aistudio.py --model gemini-2.0-flash --api-key YOUR_KEY

Or set AISTUDIO_API_KEY in your environment and omit --api-key. Rate limit retries are handled automatically.


Evaluating Responses

# --model <model to be evaluated> --trial <N> --judge <model used as judge>
python evaluate.py --model <model_slug> --trial <N> --judge <judge_model>

Results are saved to Evaluations/<model>/Trial<N>/<judge>/EX*.json and summary.json.

  • EX*.json — per-question scores and reasoning for all 8 criteria
  • summary.json — aggregate report with per-criterion and per-question-type breakdowns

With Claude Code

Claude can act directly as the judge without calling evaluate.py. This currently produces higher quality scores than a local LLM.

Open Claude Code in this repository and tell it to evaluate a model:

evaluate gemma3_27b trial 1
evaluate gemma3_27b and gemma3_4b          # runs both in parallel
evaluate nemotron                          # partial name match, latest trial

Claude reads CLAUDE.md for the full scoring instructions and writes results to Evaluations/<model>/Trial<N>/claude-sonnet-4-6-v4/. After scoring it runs a connector scan to verify formatting scores, then prints a summary table.

Judge versioning

The judge subfolder name identifies the judge model and rubric version used. Current production judge subfolder is claude-sonnet-4-6-v4. Earlier subfolders (claude-sonnet-4-6, claude-sonnet-4-6-v2, claude-sonnet-4-6-v3) reflect prior rubric iterations and are kept for historical comparison. hybrid-claude-sonnet-4-6-v4 subfolders contain scores that were manually reviewed and corrected after the automated pass.

Utility scripts

# Recalculate total_score and percent from criteria arrays; rebuild summary.json
python fix_scores.py --model <model_slug> --trial <N> --judge <judge>

# Generate an XLSX summary across all models for a given judge
python make_xlsx.py --judge <judge>

# Compare two evaluation runs and report absolute score differences
python abs_diff.py --model <model_slug> --trial <N> --judge-a <judge> --judge-b <judge>

Evaluator tools

EvaluatorTools/conciseness_analyze.py counts characters, words, and approximate sentences in a response. evaluate.py calls it automatically when scoring the conciseness criterion. You can also run it manually:

python EvaluatorTools/conciseness_analyze.py --text "Your response here"
python EvaluatorTools/conciseness_analyze.py --model gemma3_27b --trial 1 --ex 6

Human Readable Sheet

python make_grading_sheet.py  # outputs a .xlsx viewable in Excel or Google Sheets
python make_xlsx.py --judge <judge>  # XLSX summary of all evaluation results for a given judge

make_grading_sheet.py produces a blank grading sheet from the current benchmark questions — useful for human review. make_xlsx.py produces a filled results summary from existing evaluation output.


Adding New Assignments

An assignment is the VEX VR challenge the student is working on (e.g. Coral Reef Rescue). Each assignment has its own file that gets injected into the system prompt so the tutor LLM understands the task context.

Steps

  1. Create the assignment file in LLMUnifiedConfig/Assignments/<AssignmentName>.txt. The file should contain the student-facing challenge description plus any environment notes the tutor LLM needs — collection mechanics, boundary behavior, available sensors, what the example solution looks like, and any blocks that are not relevant to this challenge. Use CoralReefRescue.txt as a reference.

  2. Author questions for the new assignment using the Question Builder (see Adding New Questions below). Set the assignment field in the question form to match the filename exactly (without .txt). This value is stored in each question.yaml and carried into benchmark.jsonl.

  3. Rebuild the benchmark if you authored any YAML files manually outside the Question Builder:

    python Dataset/build.py

The run scripts (run_benchmark.py, run_benchmark_aistudio.py) will automatically find and load the correct assignment file by matching the assignment field in each benchmark entry against filenames in LLMUnifiedConfig/Assignments/. If no match is found, the script will exit with a list of available assignment names.


Adding New Questions

Use the Question Builder — a local web tool for authoring new benchmark entries with a drag-and-drop Blockly interface.

Requirements

  • Python 3 with PyYAML (pip install pyyaml)
  • Ollama running locally (only needed for the Test feature)
  • Internet access (Blockly loads from CDN)

Usage

python question_builder/server.py

Then open http://localhost:8765 in a browser.

The tool provides:

  • Blockly workspace — drag-and-drop blocks to build the student's code. For debugging_iterative questions, add up to 3 attempt tabs.
  • Question form — fill in question type, assignment, bug/topic category, student question, and evaluator notes. Entry-specific rubric overrides and scoring guide overrides are optional.
  • Test panel — select any locally available Ollama model and send the current question to it to preview how a model responds before saving.
  • Save — writes Dataset/EX{N}/question.yaml and the XML file(s), then automatically rebuilds benchmark.jsonl.

Repository Structure

CSTutorBench/
├── README.md
├── CLAUDE.md                               # Scoring instructions for Claude Code as judge
├── results_script-check-hybrid-claude-sonnet-4-6-v4.xlsx  # Hybrid results with character-count conciseness rules
├── Dataset/
│   ├── GUIDE.md                        # Full schema documentation
│   ├── rubric_template.yaml            # Shared rubric text for all criteria
│   ├── benchmark.jsonl                 # Generated — one JSON object per line
│   ├── benchmark_with_EX3.jsonl        # Variant including the excluded EX3
│   ├── benchmark_with_cliff.jsonl      # Pre-correction snapshot (old boundary logic)
│   ├── build.py                        # Rebuilds benchmark.jsonl from YAML sources
│   ├── EX1/
│   │   ├── VEX.xml                     # Student's Blockly code
│   │   └── question.yaml               # Question metadata and rubric overrides
│   ├── EX2/ … EX18/
│   └── EX3/                            # Excluded from main benchmark
├── LLMUnifiedConfig/
│   ├── SystemPrompt.txt                # System prompt template for the tutor LLM
│   ├── VEXVRReferenceTable.txt         # Block reference table injected into the prompt
│   └── Assignments/
│       └── CoralReefRescue.txt         # Assignment context injected into the prompt
├── question_builder/
│   ├── server.py                       # Local HTTP server for the authoring UI
│   ├── index.html                      # Question Builder web interface
│   ├── blocks.js                       # Blockly block definitions
│   └── VEXBlockly.xml                  # Blockly toolbox definition
├── Responses/
│   └── <model>/Trial<N>/EX*.txt        # Raw model responses
├── Evaluations/
│   ├── review-analysis.md              # Human vs. judge disagreement analysis
│   └── <model>/Trial<N>/<judge>/
│       ├── EX*.json                    # Per-question scores and reasoning
│       └── summary.json               # Aggregate results
├── EvaluatorTools/
│   └── conciseness_analyze.py          # Count chars/words/sentences for conciseness scoring
├── run_benchmark.py                    # Collect responses from a local Ollama model
├── run_benchmark_aistudio.py           # Collect responses from Google AI Studio
├── evaluate.py                         # Score responses with an LLM judge
├── fix_scores.py                       # Recalculate totals from criteria arrays
├── make_xlsx.py                        # Generate XLSX summary across all models
├── abs_diff.py                         # Compare two judge runs, report differences
└── make_grading_sheet.py               # Generate XLSX sheet for human grading

Dependencies

  • Python 3.8+
  • pyyaml — for Dataset/build.py and question_builder/server.py (pip install pyyaml)
  • Ollama — for run_benchmark.py and the Question Builder test feature
  • openpyxl — for make_grading_sheet.py and make_xlsx.py (pip install openpyxl)
  • No additional dependencies for run_benchmark_aistudio.py or evaluate.py

About

A benchmark for tutoring of coding skills in VEX/block-based coding environments, and associated LLMs for automated assessment of LLM outputs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors