Skip to content

Latest commit

 

History

History
708 lines (551 loc) · 26.2 KB

File metadata and controls

708 lines (551 loc) · 26.2 KB

rubric — user guide

A task-oriented walkthrough of the CLI and the three-pane serve UI. If you want the one-pager pitch, read README.md. This document assumes you've already got the repo cloned and bun or tsx/node on PATH.


Install

Until rubric is published to npm, run it from the repo:

git clone https://github.com/rubric/rubric
cd rubric
# Option 1 — link as a global `rubric`
npm link --workspace=packages/cli
# Option 2 — run directly via tsx (no install needed)
alias rubric='tsx packages/cli/src/bin.ts'
# Option 3 — single-file binary
cd packages/cli && bun run build:binary && ./dist/rubric --help

Verify:

rubric --version     # rubric 0.0.0
rubric --help

Core concept (60 seconds)

rubric compares two prompts (or two models) across a dataset using an LLM judge, then rolls the per-case verdicts into a win/loss/tie summary you can gate CI on. One cell = one (case × model × A-side × B-side) evaluation. A run is every cell evaluated end-to-end, concurrently.

dataset (JSONL)  →  for each case:
                      generate A = baseline(case)
                      generate B = candidate(case)
                      judge(A, B, case)   → "a" | "b" | "tie"
                    ─────────────────────────────
                    summary: wins / losses / ties / errors / win-rate

Because the judge is just another LLM, rubric also ships an override log: every time you disagree with a cell's verdict via rubric disagree, the correction is appended to a per-run overrides.jsonl. That log is the calibration corpus — v2.3 will train a small residual classifier on it to score the judge itself.


10-second tour — quickstart & init --wizard

Two zero-friction entry points before you commit to writing a real config.

rubric quickstart runs a full end-to-end grid against a hard-coded demo dataset using a deterministic mock provider and mock judge. No API keys, no files written, ~10 seconds. It's the fastest way to see what the output shape looks like.

rubric quickstart
# rubric quickstart — zero-config mock demo
#   5 cases × 1 model = 5 cells (mock provider + mock judge)
#   ...
# Summary:
#   wins: 4   losses: 0   ties: 1   errors: 0
#   winRate: 100.0% (of decisive 4)

rubric init --wizard --describe "<task>" scaffolds a real workspace and asks the judge model to draft baseline.md, candidate.md, and 10 input cases from a one-sentence task description. Every auto-generated case is tagged with "_autogenerated": true so reviewers know not to trust the verdict until the cases have been vetted.

rubric init --wizard \
  --describe "triage incoming customer support tickets by category and urgency"
# requires OPENAI_API_KEY — or pass --mock for a templated scaffold.

rubric init --wizard --mock \
  --describe "triage incoming customer support tickets"
# deterministic templates, no LLM call, useful for seeding before you
# wire a real key.

Autogenerated cases are a starting point, not ground truth — read them with skepticism and trim the obviously-off ones before you trust the verdict. The override log (rubric disagree) is the right place to record where you diverged from the judge.


Workflow A — iterate locally with serve

The fastest feedback loop. Zero API keys — --mock uses a deterministic stub provider + judge so you can see the UI light up end-to-end.

mkdir my-prompts && cd my-prompts
rubric init                      # scaffolds config + prompts/ + data/
rubric serve --mock              # → http://127.0.0.1:5174

What you get:

  • Left pane — Prompts. Tabs for baseline and candidate. Edit inline, ⌘S to save. A dot indicator tells you if the editor is clean or dirty.
  • Middle pane — Cases. Read-only list of the dataset, one row per case.
  • Right pane — Results. Summary bar (wins / losses / ties / errors / win-rate / total cost / wall-sum) and a grid with one row per cell. Click a row to expand: verdict banner with the judge's reason, both outputs side-by-side, ±label buttons for calibration.

Header controls:

  • mock mode — switch between the deterministic mock provider and your live providers. Leave it on for UI spelunking; turn it off when you want to spend tokens.
  • ▶ Run — kicks off a sweep via SSE so the grid fills cell-by-cell.

Turn mock mode off and set the relevant API-key env var to run for real:

export OPENAI_API_KEY=sk-...
rubric serve

File-watch iteration loop (rubric watch)

If you'd rather stay in your editor, rubric watch gives you the same inner loop without opening the UI:

rubric watch                       # re-evals on save; cached across saves
rubric watch --once                # one pass and exit (good for CI)
rubric watch --no-cache            # every iteration re-runs every cell

The persistent judge-call cache is keyed on prompt + case + model + criteria, so only the cells you actually touched spend tokens. Tail the stderr chatter to see which cells hit the cache vs. re-ran.


Workflow B — gate a pull request

This is the ship-critical flow. Wire rubric run --fail-on-regress into CI and attach the outputs to the PR:

rubric run \
  --config rubric.config.json \
  --fail-on-regress \
  --json-out rubric-run.json \
  --report  rubric-report.html
  • rubric-run.json — structured v1 run payload. Feed to rubric comment to render the PR comment.
  • rubric-report.html — self-contained per-cell HTML report. Good CI artifact.

Render + post the PR comment:

rubric comment \
  --from rubric-run.json \
  --report-url https://ci.example.com/.../rubric-report.html \
  --title "baseline.md vs candidate.md"      > comment.md

Comments are idempotent — subsequent runs update the same comment via a hidden HTML marker instead of stacking.

Or use the composite Action (wraps all of the above):

# .github/workflows/rubric.yml
on:
  pull_request:
    paths: ['prompts/**', 'data/**', 'rubric.config.json']
jobs:
  eval:
    runs-on: ubuntu-latest
    permissions: { pull-requests: write, contents: read }
    steps:
      - uses: actions/checkout@v4
      - uses: rubric/rubric@v2
        with:
          fail-on-regress: true
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Workflow C — seed a dataset from a CSV

Drop a CSV export (Google Sheets, Excel, Notion, Linear, whatever) into data/cases.jsonl without hand-writing a line:

rubric seed --from-csv tickets.csv --out data/cases.jsonl

The adapter expects a header row with at minimum an input column. An expected column is honored verbatim; any other columns (category, priority, ticket-id, ...) are stuffed into metadata so your spreadsheet notes survive the import. Header matching is case-insensitive and trim-forgiving.

PII heuristic: the seed command runs a regex sweep over imported cases and prints warnings to stderr for anything that looks like an email, phone, IP, SSN, or credit-card fragment. Review before publishing.

v2.1 also shipped --from-langfuse / --from-helicone / --from-langsmith / --from-openai-logs / --from-synthetic. Those adapters were cut in v2.2 — consolidate your pipeline on CSV, which every tool can export.


Workflow D — disagree with the judge

When the judge gets a cell wrong, override it from the CLI or the serve UI. Each override is appended to the run's overrides.jsonl and surfaced in the PR comment footer.

# Pick the cell you want to override. Cell refs are: case-N/provider/model.
rubric disagree case-3/openai/gpt-4o-mini \
  --verdict A \
  --reason  "judge missed the factual error in B"

# Cancel the most recent override on that cell.
rubric disagree case-3/openai/gpt-4o-mini --undo

The override log is append-only — nothing is ever mutated in place, so you can git diff your history of disagreements.

In-UI override buttons in rubric serve are a v2.3 follow-up — the v2.1 per-side + good / - bad calibration widget was removed along with rubric calibrate, and the A / B / tie replacement hasn't shipped yet. Drive disagreements from the CLI for v2.2.

Why this replaces rubric calibrate. In v2.1 calibration was a separate labeling exercise with its own JSON file and its own UI. It turned out almost nobody did it — labeling outside a real workflow is too much friction. The override log is the same signal (judge vs. human) captured as a byproduct of actually using the tool. v2.3 will train a residual classifier on the override log and fold its output into the badge.


Workflow E — scheduled drift detection

Production models move under your feet. Drift detection is the cheap early-warning system: cron the existing eval against a frozen baseline

  • dataset, open a GitHub issue when the candidate starts regressing.

Drop examples/drift-detector.yml into .github/workflows/ and flip the cron to suit your release cadence. The workflow:

  1. Runs rubric run --fail-on-regress.
  2. If exit code = 2, renders the standard PR-comment body.
  3. Upserts a single GitHub issue per RUBRIC_DRIFT_MARKER. Same marker across runs = same issue. Closed issues get reopened on a fresh regression. No duplicate backlog noise.

Framed as best-effort, not an SLA.


Run registry (~/.rubric/runs/)

Every rubric run appends to a local run registry at ~/.rubric/runs/<id>/ (manifest + append-only cells.jsonl + overrides.jsonl). You can inspect past runs without re-executing anything:

rubric runs list                           # last 20 runs, newest first
rubric runs show <id>                      # manifest + summary
rubric runs status <id>                    # "<status>  <done>/<total>"
rubric runs diff <a> <b>                   # summary delta between two runs
rubric runs rerun <id> [--mock]            # re-execute with current prompts

Override the root with --registry-root <dir> on any subcommand if you want a per-project registry instead of the per-user default.


The config file

rubric.config.json — committed to the repo, diff'd on PRs.

{
  "$schema": "https://rubric.dev/schema/v1.json",
  "prompts": {
    "baseline":  "prompts/baseline.md",
    "candidate": "prompts/candidate.md"
  },
  "dataset":  "data/cases.jsonl",
  "models":   ["openai/gpt-4o-mini"],
  "judge":    { "model": "openai/gpt-4o", "criteria": "default" },
  "concurrency": 4,
  "mode":     "compare-prompts"
}

Field notes:

  • prompts.baseline / prompts.candidate — filesystem paths relative to the config file's directory.
  • models — array. Each model id is provider/model. Every model gets its own cell in the grid.
  • judge.model — usually a stronger model than what you're grading.
  • judge.criteria — see next section.
  • concurrency — parallel cells. 4 is a sane default; raise it if your provider's rate limits allow.
  • mode"compare-prompts" (v2.2's only supported mode). The compare-models mode was cut; compare two models by pointing baseline.md and candidate.md at the same file and listing the two models in models[].

Rubrics

judge.criteria accepts:

Value What it does
"default" Pairwise LLM judge with a general "more correct, concise, on-task" rubric.
"structural-json" Deterministic, no LLM call. Parses A and B as JSON and picks the side that deep-equals case.expected.
{ "custom": "prose rubric…" } Inline custom rubric text fed to the LLM judge.
{ "file": "rubric.md" } Team preset — load the rubric text from a file, path relative to config's directory.

The structural-json rubric is perfect for tool-call / structured-output evals: it's free, reproducible, and fails loud when the parser can't extract JSON from one side.


Evaluators (non-LLM metrics)

Evaluators run alongside the pairwise judge on every cell. They're deterministic, free, and useful for the checks that don't need model opinion — exact match, required substrings, JSON validity, length bands. Results land on CellResult.evaluations and roll up into the summary as per-metric win rates.

Opt in by adding an evaluators block to rubric.config.json:

{
  "evaluators": [
    { "type": "exact-match" },
    { "type": "contains", "needle": "SELECT" },
    { "type": "regex", "pattern": "^ERROR:", "flags": "m" },
    { "type": "length", "min": 1, "max": 500 },
    { "type": "json-valid" }
  ]
}

Catalog:

type What it measures
exact-match Whether the output equals case.expected (or metadata.<field> via "field": "metadata.gold"). trim + caseSensitive knobs.
contains Whether the output contains the literal needle.
regex Whether the output matches the pattern. flags follow JS RegExp (gim…).
length Emits length.a / length.b always, and length_in_band.a/.b when min or max is set.
json-valid Whether the output parses as JSON. Accepts ```json … ``` code fences.

Every evaluator emits a .a and .b metric so the per-side rollup lines up with the judge's A/B framing. You can stack them: evaluators do not conflict with the pairwise judge — they're additive signal, not a replacement.

Gating CI on evaluator pass rate (failOn)

Every evaluator accepts an optional failOn: 0..1 threshold. When set, the candidate (B-side) pass rate for that evaluator's primary metric must meet or exceed the threshold, or rubric run exits 2 — the same exit code as --fail-on-regress. Evaluators without failOn are report-only.

{
  "evaluators": [
    { "type": "json-valid",   "failOn": 1.0  },
    { "type": "exact-match",  "failOn": 0.9  },
    { "type": "length", "min": 1, "max": 500, "failOn": 0.95 }
  ]
}

The primary metric for each type is always the candidate side — that's what CI cares about (the new prompt crossing a quality bar):

type Gated metric
exact-match exact_match.b
contains contains.b
regex regex.b
length length_in_band.b
json-valid json_valid.b

Exit-code precedence when multiple signals fire:

  1. 2 — regression (when --fail-on-regress and candidate lost more cells than it won), or any failOn breach.
  2. 1 — judge errors with no regression or breach.
  3. 0 — clean run.

A metric with no contributing rows (everything skipped or errored) cannot breach a gate — the evaluator was never asked the question.

Output formats

rubric run --format <mode> picks the stdout format. Human progress logs always go to stderr in non-human modes so stdout stays parseable.

--format stdout Use
human Multi-line progress + summary block (default). Interactive use.
json One structured JSON object. Same shape as --json-out file. (Alias: --json.) Bots, machine consumers, rubric comment.
compact One stable key=value line — exit=… wins=… losses=… winRate=… [gate=…]. CI logs, shell pipelines, grep/awk consumers.

Compact format example (candidate failed an evaluator gate):

exit=2 run=r-20260425-abc123 wins=12 losses=3 ties=0 errors=0 winRate=0.8000 costUsd=0.024100 latencyMs=18214 gate=exact_match.b:0.8000<0.9

Field order is part of the contract — downstream consumers can rely on it. costUsd / latencyMs appear only when the run captured them (absent on mock runs); gate= entries appear only on breach.


Providers

Model ids are provider/model strings. Live mode auto-detects the provider from the prefix:

Prefix Provider Env var Notes
openai/ OpenAI OPENAI_API_KEY e.g. openai/gpt-4o-mini
groq/ Groq GROQ_API_KEY OpenAI-compatible at api.groq.com/openai/v1
openrouter/ OpenRouter OPENROUTER_API_KEY Nested ids OK, e.g. openrouter/anthropic/claude-3.5-sonnet
ollama/ Ollama (local) none Expects localhost:11434; no API key needed
user-declared any OpenAI-chat-compatible gateway keyEnv / keyFile Declared under providers[] in the config — see below

Judge and generation models follow the same rules and can mix — e.g. run generation on local Ollama, judge with Groq.

Corporate / self-hosted proxies

A lot of companies front OpenAI (or an in-house router) with an internal gateway that wants a custom bearer token and one or two extra headers. Declare a named provider in rubric.config.json and the same <name>/<model> routing you already use for the built-ins just works.

// rubric.config.json
{
  "prompts": { "baseline": "prompts/baseline.md", "candidate": "prompts/candidate.md" },
  "dataset": "data/cases.jsonl",
  "models":  ["corp-proxy/gpt-5.1"],
  "judge":   { "model": "corp-proxy/gpt-5.1", "criteria": "default" },
  "providers": [
    {
      "name":     "corp-proxy",
      "baseUrl":  "https://gateway.example.internal/v1/proxy/openai/v1",
      "keyEnv":   "CORP_PROXY_TOKEN",
      "headers":  { "x-client-app": "rubric" }
    }
  ]
}

Rules the config validator enforces:

  • Name. Lowercase letters / digits / dashes, 1-32 chars. openai, groq, openrouter, ollama are reserved.
  • baseUrl. Must be http:// or https://. No trailing slash.
  • Auth. Exactly one of keyEnv (env var name) or keyFile (path to a gitignored secrets file). Inline key is rejected with a loud error — we never want tokens living inside a config file that gets committed.
  • Headers. String → string map, merged into every request.
  • wireFormat. Optional; only "openai-chat" is supported in v1.1. That covers any gateway that speaks the OpenAI Chat Completions API.

Picking keyEnv vs keyFile

  • keyEnv is the right call when your shell already exports the token (1Password op, direnv, CI secret).

  • keyFile is the right call for local dev when you don't want the token in your shell environment. Gitignore the path:

    echo '.secrets/' >> .gitignore
    mkdir -p .secrets && chmod 700 .secrets
    echo -n "$TOKEN" > .secrets/corp-proxy.key

    Then reference it:

    { "name": "corp-proxy", "baseUrl": "...", "keyFile": ".secrets/corp-proxy.key" }

    Relative paths resolve against the config file's directory; ~/... expands to $HOME. Trailing whitespace is trimmed so echo / editor newlines don't poison the token.

Worked example — Expedia's internal proxy

{
  "providers": [
    {
      "name":    "corp-proxy",
      "baseUrl": "https://generative-ai-proxy.rcp.us-east-1.data.corp.exp-aws.net/v1/proxy/openai/v1",
      "keyEnv":  "RUBRIC_CORP_PROXY_KEY",
      "headers": { "x-client-app": "generative-ai-proxy" }
    }
  ],
  "models": ["corp-proxy/gpt-5.1"],
  "judge":  { "model": "corp-proxy/gpt-5.1", "criteria": "default" }
}

Smoke-test before burning a full run:

export RUBRIC_CORP_PROXY_KEY="$(op read 'op://Private/corp-proxy/token')"
rubric providers test corp-proxy
# rubric providers test
#   provider: corp-proxy
#   model:    gpt-5.1
#   baseUrl:  https://generative-ai-proxy.rcp.us-east-1.data.corp.exp-aws.net/v1/proxy/openai/v1
#   auth:     env RUBRIC_CORP_PROXY_KEY
#   headers:  {"x-client-app":"generative-ai-proxy"}
#   prompt:   "Reply in one short sentence: what is 2 + 2?"
#
#   response (412ms):
#     2 + 2 equals 4.

The authorization header is always injected from the resolved key; it is never echoed in the smoke-test output, logs, or error messages. Any header name matching /auth|token|key|secret/i is redacted in diagnostic output.


Exit codes and CI gating

rubric run exit codes:

Code Meaning
0 Pass.
1 Judge errored on at least one cell (fail-loud).
2 Candidate regressed — only emitted with --fail-on-regress.

Use --fail-on-regress in CI to turn a judged loss into a red build. Omit it for "report but don't block" trial runs.


Cost & safety caps

rubric run enforces these on the CLI side so bad datasets can't spend surprise money:

  • --max-prompt-chars N — fail if baseline.md or candidate.md exceed N characters.
  • --max-cases N — fail if the dataset has more than N rows.
  • --scan-pii — warn (non-fatal) on case input/expected that looks like PII. Good smoke check before posting the dataset publicly.
  • --cost-csv <path> — write per-cell costUsd + latencyMs as CSV for spreadsheet analysis.

Hosted-sandbox-level caps (per-IP rate limits, $/day ceiling, upstream moderation) are not yet wired; those belong to the future rubric.dev surface.


Common recipes

"I want a quick sanity-check without spending tokens."

rubric serve --mock                         # interactive
rubric run   --mock --report report.html    # headless

"I want to compare two models at a fixed prompt."

Point baseline.md and candidate.md at the same file and list the two models in models[]:

// rubric.config.json
{
  "prompts":  { "baseline": "prompts/shared.md", "candidate": "prompts/shared.md" },
  "dataset":  "data/cases.jsonl",
  "models":   ["openai/gpt-4o-mini", "openrouter/anthropic/claude-3.5-sonnet"],
  "judge":    { "model": "openai/gpt-4o", "criteria": "default" }
}

Every model-vs-model pair becomes its own cell in the grid. The dedicated compare-models mode and model-comparison rubric were cut in v2.2 — this pattern covers the same ground with fewer moving parts.

"My eval produces JSON — grade it without an LLM."

"judge": { "model": "openai/gpt-4o", "criteria": "structural-json" }

Each case needs expected set to the canonical JSON string. Deep-equal with key-order tolerance; ```json code fences on either side are stripped.

"I want to replay an old run against my current prompts."

rubric runs list                        # find the run id
rubric runs rerun r-20260425-abc123     # re-execute with current prompts

The rerun reuses the original config — dataset, models, judge, and seed — so the only thing that changes is what's in prompts/ on disk. Good for "did my latest edit fix the case that regressed last week?"


Troubleshooting

no provider accepts judge.model "..." The prefix on judge.model doesn't match any configured provider. Check spelling and that the matching env var is set.

structural-json judge always picks tie One or both sides failed to parse as JSON, or case.expected is missing. The grader ties when it can't tell.

Judge keeps returning ties you disagree with Either your prompts are too close for the judge to pick cleanly, or the judge model isn't strong enough. Try a stronger judge.model, or swap judge.criteria to a custom prose rubric ({ "custom": "..." }) that spells out what "better" means for your domain. Log the disagreements via rubric disagree — v2.3 will train a calibration model on exactly that signal.

Drift workflow opens a new issue every run Check that GITHUB_TOKEN has issues: write and that the marker (RUBRIC_DRIFT_MARKER) is stable across runs. The upsert uses the GitHub Search API to find existing issues by marker; if search indexing lags, the first run after the issue closes may create a duplicate.

Typecheck emits a pile of TS5097 errors Fixed in v2.2.0 — tsconfig.base.json now sets allowImportingTsExtensions: true / noEmit: true. If you've pulled v2.2 and still see this, you're on a stale checkout; git pull and re-run.