A task-oriented walkthrough of the CLI and the three-pane serve UI. If
you want the one-pager pitch, read README.md. This
document assumes you've already got the repo cloned and bun or
tsx/node on PATH.
- Install
- Core concept (60 seconds)
- 10-second tour —
quickstart&init --wizard - Workflow A — iterate locally with
serve/watch - Workflow B — gate a pull request
- Workflow C — seed a dataset from a CSV
- Workflow D — disagree with the judge
- Workflow E — scheduled drift detection
- The config file
- Rubrics
- Evaluators (non-LLM metrics)
- Providers
- Exit codes and CI gating
- Cost & safety caps
- Common recipes
- Troubleshooting
Until rubric is published to npm, run it from the repo:
git clone https://github.com/rubric/rubric
cd rubric
# Option 1 — link as a global `rubric`
npm link --workspace=packages/cli
# Option 2 — run directly via tsx (no install needed)
alias rubric='tsx packages/cli/src/bin.ts'
# Option 3 — single-file binary
cd packages/cli && bun run build:binary && ./dist/rubric --helpVerify:
rubric --version # rubric 0.0.0
rubric --helprubric compares two prompts (or two models) across a dataset using an LLM judge, then rolls the per-case verdicts into a win/loss/tie summary you can gate CI on. One cell = one (case × model × A-side × B-side) evaluation. A run is every cell evaluated end-to-end, concurrently.
dataset (JSONL) → for each case:
generate A = baseline(case)
generate B = candidate(case)
judge(A, B, case) → "a" | "b" | "tie"
─────────────────────────────
summary: wins / losses / ties / errors / win-rate
Because the judge is just another LLM, rubric also ships an override
log: every time you disagree with a cell's verdict via rubric disagree, the correction is appended to a per-run overrides.jsonl.
That log is the calibration corpus — v2.3 will train a small residual
classifier on it to score the judge itself.
Two zero-friction entry points before you commit to writing a real config.
rubric quickstart runs a full end-to-end grid against a
hard-coded demo dataset using a deterministic mock provider and mock
judge. No API keys, no files written, ~10 seconds. It's the fastest way
to see what the output shape looks like.
rubric quickstart
# rubric quickstart — zero-config mock demo
# 5 cases × 1 model = 5 cells (mock provider + mock judge)
# ...
# Summary:
# wins: 4 losses: 0 ties: 1 errors: 0
# winRate: 100.0% (of decisive 4)rubric init --wizard --describe "<task>" scaffolds a real
workspace and asks the judge model to draft baseline.md,
candidate.md, and 10 input cases from a one-sentence task
description. Every auto-generated case is tagged with
"_autogenerated": true so reviewers know not to trust the verdict
until the cases have been vetted.
rubric init --wizard \
--describe "triage incoming customer support tickets by category and urgency"
# requires OPENAI_API_KEY — or pass --mock for a templated scaffold.
rubric init --wizard --mock \
--describe "triage incoming customer support tickets"
# deterministic templates, no LLM call, useful for seeding before you
# wire a real key.Autogenerated cases are a starting point, not ground truth — read them
with skepticism and trim the obviously-off ones before you trust the
verdict. The override log (rubric disagree) is the right place to
record where you diverged from the judge.
The fastest feedback loop. Zero API keys — --mock uses a deterministic
stub provider + judge so you can see the UI light up end-to-end.
mkdir my-prompts && cd my-prompts
rubric init # scaffolds config + prompts/ + data/
rubric serve --mock # → http://127.0.0.1:5174What you get:
- Left pane — Prompts. Tabs for
baselineandcandidate. Edit inline, ⌘S to save. A dot indicator tells you if the editor is clean or dirty. - Middle pane — Cases. Read-only list of the dataset, one row per case.
- Right pane — Results. Summary bar (wins / losses / ties / errors / win-rate / total cost / wall-sum) and a grid with one row per cell. Click a row to expand: verdict banner with the judge's reason, both outputs side-by-side, ±label buttons for calibration.
Header controls:
- mock mode — switch between the deterministic mock provider and your live providers. Leave it on for UI spelunking; turn it off when you want to spend tokens.
- ▶ Run — kicks off a sweep via SSE so the grid fills cell-by-cell.
Turn mock mode off and set the relevant API-key env var to run for real:
export OPENAI_API_KEY=sk-...
rubric serveIf you'd rather stay in your editor, rubric watch gives you the same
inner loop without opening the UI:
rubric watch # re-evals on save; cached across saves
rubric watch --once # one pass and exit (good for CI)
rubric watch --no-cache # every iteration re-runs every cellThe persistent judge-call cache is keyed on prompt + case + model + criteria, so only the cells you actually touched spend tokens. Tail the stderr chatter to see which cells hit the cache vs. re-ran.
This is the ship-critical flow. Wire rubric run --fail-on-regress
into CI and attach the outputs to the PR:
rubric run \
--config rubric.config.json \
--fail-on-regress \
--json-out rubric-run.json \
--report rubric-report.htmlrubric-run.json— structured v1 run payload. Feed torubric commentto render the PR comment.rubric-report.html— self-contained per-cell HTML report. Good CI artifact.
Render + post the PR comment:
rubric comment \
--from rubric-run.json \
--report-url https://ci.example.com/.../rubric-report.html \
--title "baseline.md vs candidate.md" > comment.mdComments are idempotent — subsequent runs update the same comment via a hidden HTML marker instead of stacking.
Or use the composite Action (wraps all of the above):
# .github/workflows/rubric.yml
on:
pull_request:
paths: ['prompts/**', 'data/**', 'rubric.config.json']
jobs:
eval:
runs-on: ubuntu-latest
permissions: { pull-requests: write, contents: read }
steps:
- uses: actions/checkout@v4
- uses: rubric/rubric@v2
with:
fail-on-regress: true
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}Drop a CSV export (Google Sheets, Excel, Notion, Linear, whatever) into
data/cases.jsonl without hand-writing a line:
rubric seed --from-csv tickets.csv --out data/cases.jsonlThe adapter expects a header row with at minimum an input column. An
expected column is honored verbatim; any other columns (category,
priority, ticket-id, ...) are stuffed into metadata so your
spreadsheet notes survive the import. Header matching is case-insensitive
and trim-forgiving.
PII heuristic: the seed command runs a regex sweep over imported cases and prints warnings to stderr for anything that looks like an email, phone, IP, SSN, or credit-card fragment. Review before publishing.
v2.1 also shipped
--from-langfuse / --from-helicone / --from-langsmith / --from-openai-logs / --from-synthetic. Those adapters were cut in v2.2 — consolidate your pipeline on CSV, which every tool can export.
When the judge gets a cell wrong, override it from the CLI or the serve
UI. Each override is appended to the run's overrides.jsonl and
surfaced in the PR comment footer.
# Pick the cell you want to override. Cell refs are: case-N/provider/model.
rubric disagree case-3/openai/gpt-4o-mini \
--verdict A \
--reason "judge missed the factual error in B"
# Cancel the most recent override on that cell.
rubric disagree case-3/openai/gpt-4o-mini --undoThe override log is append-only — nothing is ever mutated in place, so
you can git diff your history of disagreements.
In-UI override buttons in rubric serve are a v2.3 follow-up — the
v2.1 per-side + good / - bad calibration widget was removed along
with rubric calibrate, and the A / B / tie replacement hasn't
shipped yet. Drive disagreements from the CLI for v2.2.
Why this replaces rubric calibrate. In v2.1 calibration was a
separate labeling exercise with its own JSON file and its own UI. It
turned out almost nobody did it — labeling outside a real workflow is
too much friction. The override log is the same signal (judge vs.
human) captured as a byproduct of actually using the tool. v2.3 will
train a residual classifier on the override log and fold its output
into the badge.
Production models move under your feet. Drift detection is the cheap early-warning system: cron the existing eval against a frozen baseline
- dataset, open a GitHub issue when the candidate starts regressing.
Drop examples/drift-detector.yml into
.github/workflows/ and flip the cron to suit your release cadence.
The workflow:
- Runs
rubric run --fail-on-regress. - If exit code = 2, renders the standard PR-comment body.
- Upserts a single GitHub issue per
RUBRIC_DRIFT_MARKER. Same marker across runs = same issue. Closed issues get reopened on a fresh regression. No duplicate backlog noise.
Framed as best-effort, not an SLA.
Every rubric run appends to a local run registry at ~/.rubric/runs/<id>/
(manifest + append-only cells.jsonl + overrides.jsonl). You can
inspect past runs without re-executing anything:
rubric runs list # last 20 runs, newest first
rubric runs show <id> # manifest + summary
rubric runs status <id> # "<status> <done>/<total>"
rubric runs diff <a> <b> # summary delta between two runs
rubric runs rerun <id> [--mock] # re-execute with current promptsOverride the root with --registry-root <dir> on any subcommand if you
want a per-project registry instead of the per-user default.
rubric.config.json — committed to the repo, diff'd on PRs.
{
"$schema": "https://rubric.dev/schema/v1.json",
"prompts": {
"baseline": "prompts/baseline.md",
"candidate": "prompts/candidate.md"
},
"dataset": "data/cases.jsonl",
"models": ["openai/gpt-4o-mini"],
"judge": { "model": "openai/gpt-4o", "criteria": "default" },
"concurrency": 4,
"mode": "compare-prompts"
}Field notes:
prompts.baseline/prompts.candidate— filesystem paths relative to the config file's directory.models— array. Each model id isprovider/model. Every model gets its own cell in the grid.judge.model— usually a stronger model than what you're grading.judge.criteria— see next section.concurrency— parallel cells. 4 is a sane default; raise it if your provider's rate limits allow.mode—"compare-prompts"(v2.2's only supported mode). Thecompare-modelsmode was cut; compare two models by pointingbaseline.mdandcandidate.mdat the same file and listing the two models inmodels[].
judge.criteria accepts:
| Value | What it does |
|---|---|
"default" |
Pairwise LLM judge with a general "more correct, concise, on-task" rubric. |
"structural-json" |
Deterministic, no LLM call. Parses A and B as JSON and picks the side that deep-equals case.expected. |
{ "custom": "prose rubric…" } |
Inline custom rubric text fed to the LLM judge. |
{ "file": "rubric.md" } |
Team preset — load the rubric text from a file, path relative to config's directory. |
The structural-json rubric is perfect for tool-call / structured-output
evals: it's free, reproducible, and fails loud when the parser can't
extract JSON from one side.
Evaluators run alongside the pairwise judge on every cell. They're
deterministic, free, and useful for the checks that don't need model
opinion — exact match, required substrings, JSON validity, length
bands. Results land on CellResult.evaluations and roll up into the
summary as per-metric win rates.
Opt in by adding an evaluators block to rubric.config.json:
{
"evaluators": [
{ "type": "exact-match" },
{ "type": "contains", "needle": "SELECT" },
{ "type": "regex", "pattern": "^ERROR:", "flags": "m" },
{ "type": "length", "min": 1, "max": 500 },
{ "type": "json-valid" }
]
}Catalog:
type |
What it measures |
|---|---|
exact-match |
Whether the output equals case.expected (or metadata.<field> via "field": "metadata.gold"). trim + caseSensitive knobs. |
contains |
Whether the output contains the literal needle. |
regex |
Whether the output matches the pattern. flags follow JS RegExp (gim…). |
length |
Emits length.a / length.b always, and length_in_band.a/.b when min or max is set. |
json-valid |
Whether the output parses as JSON. Accepts ```json … ``` code fences. |
Every evaluator emits a .a and .b metric so the per-side rollup
lines up with the judge's A/B framing. You can stack them: evaluators
do not conflict with the pairwise judge — they're additive signal, not
a replacement.
Every evaluator accepts an optional failOn: 0..1 threshold. When
set, the candidate (B-side) pass rate for that evaluator's primary
metric must meet or exceed the threshold, or rubric run exits 2 —
the same exit code as --fail-on-regress. Evaluators without failOn
are report-only.
{
"evaluators": [
{ "type": "json-valid", "failOn": 1.0 },
{ "type": "exact-match", "failOn": 0.9 },
{ "type": "length", "min": 1, "max": 500, "failOn": 0.95 }
]
}The primary metric for each type is always the candidate side — that's what CI cares about (the new prompt crossing a quality bar):
type |
Gated metric |
|---|---|
exact-match |
exact_match.b |
contains |
contains.b |
regex |
regex.b |
length |
length_in_band.b |
json-valid |
json_valid.b |
Exit-code precedence when multiple signals fire:
2— regression (when--fail-on-regressand candidate lost more cells than it won), or anyfailOnbreach.1— judge errors with no regression or breach.0— clean run.
A metric with no contributing rows (everything skipped or errored) cannot breach a gate — the evaluator was never asked the question.
rubric run --format <mode> picks the stdout format. Human progress
logs always go to stderr in non-human modes so stdout stays parseable.
--format |
stdout | Use |
|---|---|---|
human |
Multi-line progress + summary block (default). | Interactive use. |
json |
One structured JSON object. Same shape as --json-out file. (Alias: --json.) |
Bots, machine consumers, rubric comment. |
compact |
One stable key=value line — exit=… wins=… losses=… winRate=… [gate=…]. |
CI logs, shell pipelines, grep/awk consumers. |
Compact format example (candidate failed an evaluator gate):
exit=2 run=r-20260425-abc123 wins=12 losses=3 ties=0 errors=0 winRate=0.8000 costUsd=0.024100 latencyMs=18214 gate=exact_match.b:0.8000<0.9
Field order is part of the contract — downstream consumers can rely on
it. costUsd / latencyMs appear only when the run captured them
(absent on mock runs); gate= entries appear only on breach.
Model ids are provider/model strings. Live mode auto-detects the
provider from the prefix:
| Prefix | Provider | Env var | Notes |
|---|---|---|---|
openai/ |
OpenAI | OPENAI_API_KEY |
e.g. openai/gpt-4o-mini |
groq/ |
Groq | GROQ_API_KEY |
OpenAI-compatible at api.groq.com/openai/v1 |
openrouter/ |
OpenRouter | OPENROUTER_API_KEY |
Nested ids OK, e.g. openrouter/anthropic/claude-3.5-sonnet |
ollama/ |
Ollama (local) | none | Expects localhost:11434; no API key needed |
| user-declared | any OpenAI-chat-compatible gateway | keyEnv / keyFile |
Declared under providers[] in the config — see below |
Judge and generation models follow the same rules and can mix — e.g. run generation on local Ollama, judge with Groq.
A lot of companies front OpenAI (or an in-house router) with an internal
gateway that wants a custom bearer token and one or two extra headers.
Declare a named provider in rubric.config.json and the same
<name>/<model> routing you already use for the built-ins just works.
Rules the config validator enforces:
- Name. Lowercase letters / digits / dashes, 1-32 chars.
openai,groq,openrouter,ollamaare reserved. - baseUrl. Must be
http://orhttps://. No trailing slash. - Auth. Exactly one of
keyEnv(env var name) orkeyFile(path to a gitignored secrets file). Inlinekeyis rejected with a loud error — we never want tokens living inside a config file that gets committed. - Headers. String → string map, merged into every request.
- wireFormat. Optional; only
"openai-chat"is supported in v1.1. That covers any gateway that speaks the OpenAI Chat Completions API.
-
keyEnvis the right call when your shell already exports the token (1Password op, direnv, CI secret). -
keyFileis the right call for local dev when you don't want the token in your shell environment. Gitignore the path:echo '.secrets/' >> .gitignore mkdir -p .secrets && chmod 700 .secrets echo -n "$TOKEN" > .secrets/corp-proxy.key
Then reference it:
{ "name": "corp-proxy", "baseUrl": "...", "keyFile": ".secrets/corp-proxy.key" }Relative paths resolve against the config file's directory;
~/...expands to$HOME. Trailing whitespace is trimmed soecho/ editor newlines don't poison the token.
{
"providers": [
{
"name": "corp-proxy",
"baseUrl": "https://generative-ai-proxy.rcp.us-east-1.data.corp.exp-aws.net/v1/proxy/openai/v1",
"keyEnv": "RUBRIC_CORP_PROXY_KEY",
"headers": { "x-client-app": "generative-ai-proxy" }
}
],
"models": ["corp-proxy/gpt-5.1"],
"judge": { "model": "corp-proxy/gpt-5.1", "criteria": "default" }
}Smoke-test before burning a full run:
export RUBRIC_CORP_PROXY_KEY="$(op read 'op://Private/corp-proxy/token')"
rubric providers test corp-proxy
# rubric providers test
# provider: corp-proxy
# model: gpt-5.1
# baseUrl: https://generative-ai-proxy.rcp.us-east-1.data.corp.exp-aws.net/v1/proxy/openai/v1
# auth: env RUBRIC_CORP_PROXY_KEY
# headers: {"x-client-app":"generative-ai-proxy"}
# prompt: "Reply in one short sentence: what is 2 + 2?"
#
# response (412ms):
# 2 + 2 equals 4.The authorization header is always injected from the resolved key; it
is never echoed in the smoke-test output, logs, or error messages. Any
header name matching /auth|token|key|secret/i is redacted in
diagnostic output.
rubric run exit codes:
| Code | Meaning |
|---|---|
| 0 | Pass. |
| 1 | Judge errored on at least one cell (fail-loud). |
| 2 | Candidate regressed — only emitted with --fail-on-regress. |
Use --fail-on-regress in CI to turn a judged loss into a red build.
Omit it for "report but don't block" trial runs.
rubric run enforces these on the CLI side so bad datasets can't
spend surprise money:
--max-prompt-chars N— fail ifbaseline.mdorcandidate.mdexceed N characters.--max-cases N— fail if the dataset has more than N rows.--scan-pii— warn (non-fatal) on case input/expected that looks like PII. Good smoke check before posting the dataset publicly.--cost-csv <path>— write per-cellcostUsd+latencyMsas CSV for spreadsheet analysis.
Hosted-sandbox-level caps (per-IP rate limits, $/day ceiling, upstream
moderation) are not yet wired; those belong to the future
rubric.dev surface.
rubric serve --mock # interactive
rubric run --mock --report report.html # headlessPoint baseline.md and candidate.md at the same file and list the
two models in models[]:
// rubric.config.json
{
"prompts": { "baseline": "prompts/shared.md", "candidate": "prompts/shared.md" },
"dataset": "data/cases.jsonl",
"models": ["openai/gpt-4o-mini", "openrouter/anthropic/claude-3.5-sonnet"],
"judge": { "model": "openai/gpt-4o", "criteria": "default" }
}Every model-vs-model pair becomes its own cell in the grid. The
dedicated compare-models mode and model-comparison rubric were cut
in v2.2 — this pattern covers the same ground with fewer moving parts.
"judge": { "model": "openai/gpt-4o", "criteria": "structural-json" }Each case needs expected set to the canonical JSON string. Deep-equal
with key-order tolerance; ```json code fences on either side are
stripped.
rubric runs list # find the run id
rubric runs rerun r-20260425-abc123 # re-execute with current promptsThe rerun reuses the original config — dataset, models, judge, and
seed — so the only thing that changes is what's in prompts/ on disk.
Good for "did my latest edit fix the case that regressed last week?"
no provider accepts judge.model "..."
The prefix on judge.model doesn't match any configured provider.
Check spelling and that the matching env var is set.
structural-json judge always picks tie
One or both sides failed to parse as JSON, or case.expected is
missing. The grader ties when it can't tell.
Judge keeps returning ties you disagree with
Either your prompts are too close for the judge to pick cleanly, or
the judge model isn't strong enough. Try a stronger judge.model, or
swap judge.criteria to a custom prose rubric ({ "custom": "..." })
that spells out what "better" means for your domain. Log the
disagreements via rubric disagree — v2.3 will train a calibration
model on exactly that signal.
Drift workflow opens a new issue every run
Check that GITHUB_TOKEN has issues: write and that the marker
(RUBRIC_DRIFT_MARKER) is stable across runs. The upsert uses the
GitHub Search API to find existing issues by marker; if search indexing
lags, the first run after the issue closes may create a duplicate.
Typecheck emits a pile of TS5097 errors
Fixed in v2.2.0 — tsconfig.base.json now sets
allowImportingTsExtensions: true / noEmit: true. If you've pulled
v2.2 and still see this, you're on a stale checkout; git pull and
re-run.