Skip to content

Experiments as first-class evidence source for intent authorization #192

@marcus-sa

Description

@marcus-sa

Context

Evidence-backed intent authorization (DISCUSS wave in docs/feature/intent-evidence/discuss/) requires evidence_refs on intents pointing to graph records. Experiments (#188) produce high-trust evidence through a governed lifecycle. This issue connects the two: concluded experiments and their outputs become a privileged evidence class for intent authorization.

Related issues:

Problem

The evidence verification pipeline treats all evidence_refs equally — a standalone observation created by any agent has the same weight as a decision that emerged from a month-long experiment with human-approved budget, success criteria, and concluded results. This is wrong.

A supply chain team approving a new vendor sourcing strategy based on a 2-week procurement experiment (hypothesis tested, budget approved, results measured) should carry more weight than an agent's ad-hoc observation that "supplier X looks cheaper." The evidence verification pipeline has no way to distinguish these today.

Design

Evidence quality tiers

The intent authorizer assigns trust weight based on evidence provenance:

Tier Source Trust weight Why
Tier 1 Concluded experiment output (produced edge from experiment with status: concluded) Highest Human-approved hypothesis, bounded budget, explicit success criteria, governed lifecycle
Tier 2 Confirmed decision / resolved observation (independent authorship) High Went through confirmation by a different identity than the requester
Tier 3 Provisional decision / open observation Medium Exists in the graph but not yet validated by an independent party
Tier 4 Standalone entity with same author as intent requester Low No independence, no external validation

Experiment-to-evidence provenance chain

Experiment (proposed → approved → running → concluded → absorbed)
  │
  ├── produced → Decision D-1 (confirmed)     ← Tier 1 evidence
  ├── produced → Observation O-3 (resolved)    ← Tier 1 evidence
  ├── produced → Learning L-5 (active)         ← Tier 1 evidence (informational)
  │
  └── The experiment record itself              ← Tier 1 evidence (proves structured inquiry happened)

When an intent references a decision as evidence, the verification pipeline checks:

  1. Does this decision have a produced edge from an experiment?
  2. Is that experiment concluded or absorbed?
  3. Was the experiment approved (human gate passed)?

If yes → Tier 1 trust weight. If no → fall through to Tier 2/3 based on decision status and authorship.

Verification pipeline additions

Add to the deterministic pre-LLM verification:

  1. Experiment provenance check — for each evidence_ref, query SELECT <-produced<-experiment WHERE status IN ['concluded', 'absorbed']. If found, tag as experiment-backed.
  2. Experiment status gate — reject evidence from experiments still in proposed or running status (results not yet available).
  3. Budget compliance check — flag evidence from experiments that exceeded their approved budget (trust discount, not rejection).
  4. Absorption check — evidence from concluded (not yet absorbed) experiments gets a soft warning: results exist but haven't been formally converted to decisions/learnings yet.

Risk router integration

Experiment-backed evidence lowers effective risk:

  • Intent with 3 Tier 1 evidence refs (all from concluded experiments) → risk score discount of 15-20 points
  • Intent mixing Tier 1 and Tier 3 evidence → standard risk scoring
  • Intent with only Tier 4 evidence → risk score premium

This means well-evidenced intents from concluded experiments are more likely to auto-approve, while poorly-evidenced intents face higher scrutiny. The system rewards structured inquiry.

LLM evaluator context

When the LLM evaluator runs (high-risk intents), include experiment context:

  • Experiment hypothesis and success criteria
  • Whether results confirmed or rejected the hypothesis
  • Budget utilization (within bounds = trustworthy)
  • Time from experiment start to conclusion (rushed experiments are less trustworthy)

Observer integration

New Observer scan patterns:

  • Evidence without experiment: High-risk intent approved with no experiment-backed evidence → suggest running an experiment first
  • Experiment results unused: Concluded experiment with decisions/observations that have never been referenced as evidence → the knowledge exists but isn't being applied
  • Repeated evidence patterns: Same evidence refs used across many intents → suggest formalizing as a policy rather than re-verifying each time

Examples

Vendor sourcing decision

  1. Procurement team proposes experiment: "Test whether Supplier B can meet SLA within 10% cost reduction"
  2. Human approves with 2-week budget and success criteria
  3. Experiment runs → tasks execute → observations logged → experiment concluded with results
  4. Agent creates intent: "Switch primary supplier to Supplier B for component X"
  5. Intent evidence_refs: experiment record + produced decision ("Supplier B met SLA in trial") + produced observation ("Cost reduction confirmed at 12%")
  6. Verification pipeline: all Tier 1 (experiment-backed, human-approved, concluded) → risk discount applied → auto-approve

Contrast: without experiment

  1. Agent creates observation: "Supplier B seems cheaper based on public pricing"
  2. Agent creates intent: "Switch primary supplier to Supplier B for component X"
  3. Intent evidence_refs: single observation by same agent
  4. Verification pipeline: Tier 4 (same author, no independence) → risk premium → veto window or rejection

Dependencies

Phase

After #188 experiments are implemented and evidence-backed intents are in soft/hard enforcement mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions