Context
Evidence-backed intent authorization (DISCUSS wave in docs/feature/intent-evidence/discuss/) requires evidence_refs on intents pointing to graph records. Experiments (#188) produce high-trust evidence through a governed lifecycle. This issue connects the two: concluded experiments and their outputs become a privileged evidence class for intent authorization.
Related issues:
Problem
The evidence verification pipeline treats all evidence_refs equally — a standalone observation created by any agent has the same weight as a decision that emerged from a month-long experiment with human-approved budget, success criteria, and concluded results. This is wrong.
A supply chain team approving a new vendor sourcing strategy based on a 2-week procurement experiment (hypothesis tested, budget approved, results measured) should carry more weight than an agent's ad-hoc observation that "supplier X looks cheaper." The evidence verification pipeline has no way to distinguish these today.
Design
Evidence quality tiers
The intent authorizer assigns trust weight based on evidence provenance:
| Tier |
Source |
Trust weight |
Why |
| Tier 1 |
Concluded experiment output (produced edge from experiment with status: concluded) |
Highest |
Human-approved hypothesis, bounded budget, explicit success criteria, governed lifecycle |
| Tier 2 |
Confirmed decision / resolved observation (independent authorship) |
High |
Went through confirmation by a different identity than the requester |
| Tier 3 |
Provisional decision / open observation |
Medium |
Exists in the graph but not yet validated by an independent party |
| Tier 4 |
Standalone entity with same author as intent requester |
Low |
No independence, no external validation |
Experiment-to-evidence provenance chain
Experiment (proposed → approved → running → concluded → absorbed)
│
├── produced → Decision D-1 (confirmed) ← Tier 1 evidence
├── produced → Observation O-3 (resolved) ← Tier 1 evidence
├── produced → Learning L-5 (active) ← Tier 1 evidence (informational)
│
└── The experiment record itself ← Tier 1 evidence (proves structured inquiry happened)
When an intent references a decision as evidence, the verification pipeline checks:
- Does this decision have a
produced edge from an experiment?
- Is that experiment
concluded or absorbed?
- Was the experiment
approved (human gate passed)?
If yes → Tier 1 trust weight. If no → fall through to Tier 2/3 based on decision status and authorship.
Verification pipeline additions
Add to the deterministic pre-LLM verification:
- Experiment provenance check — for each
evidence_ref, query SELECT <-produced<-experiment WHERE status IN ['concluded', 'absorbed']. If found, tag as experiment-backed.
- Experiment status gate — reject evidence from experiments still in
proposed or running status (results not yet available).
- Budget compliance check — flag evidence from experiments that exceeded their approved budget (trust discount, not rejection).
- Absorption check — evidence from
concluded (not yet absorbed) experiments gets a soft warning: results exist but haven't been formally converted to decisions/learnings yet.
Risk router integration
Experiment-backed evidence lowers effective risk:
- Intent with 3 Tier 1 evidence refs (all from concluded experiments) → risk score discount of 15-20 points
- Intent mixing Tier 1 and Tier 3 evidence → standard risk scoring
- Intent with only Tier 4 evidence → risk score premium
This means well-evidenced intents from concluded experiments are more likely to auto-approve, while poorly-evidenced intents face higher scrutiny. The system rewards structured inquiry.
LLM evaluator context
When the LLM evaluator runs (high-risk intents), include experiment context:
- Experiment hypothesis and success criteria
- Whether results confirmed or rejected the hypothesis
- Budget utilization (within bounds = trustworthy)
- Time from experiment start to conclusion (rushed experiments are less trustworthy)
Observer integration
New Observer scan patterns:
- Evidence without experiment: High-risk intent approved with no experiment-backed evidence → suggest running an experiment first
- Experiment results unused: Concluded experiment with decisions/observations that have never been referenced as evidence → the knowledge exists but isn't being applied
- Repeated evidence patterns: Same evidence refs used across many intents → suggest formalizing as a policy rather than re-verifying each time
Examples
Vendor sourcing decision
- Procurement team proposes experiment: "Test whether Supplier B can meet SLA within 10% cost reduction"
- Human approves with 2-week budget and success criteria
- Experiment runs → tasks execute → observations logged → experiment concluded with results
- Agent creates intent: "Switch primary supplier to Supplier B for component X"
- Intent
evidence_refs: experiment record + produced decision ("Supplier B met SLA in trial") + produced observation ("Cost reduction confirmed at 12%")
- Verification pipeline: all Tier 1 (experiment-backed, human-approved, concluded) → risk discount applied → auto-approve
Contrast: without experiment
- Agent creates observation: "Supplier B seems cheaper based on public pricing"
- Agent creates intent: "Switch primary supplier to Supplier B for component X"
- Intent
evidence_refs: single observation by same agent
- Verification pipeline: Tier 4 (same author, no independence) → risk premium → veto window or rejection
Dependencies
Phase
After #188 experiments are implemented and evidence-backed intents are in soft/hard enforcement mode.
Context
Evidence-backed intent authorization (DISCUSS wave in
docs/feature/intent-evidence/discuss/) requiresevidence_refson intents pointing to graph records. Experiments (#188) produce high-trust evidence through a governed lifecycle. This issue connects the two: concluded experiments and their outputs become a privileged evidence class for intent authorization.Related issues:
docs/feature/intent-evidence/)Problem
The evidence verification pipeline treats all
evidence_refsequally — a standalone observation created by any agent has the same weight as a decision that emerged from a month-long experiment with human-approved budget, success criteria, and concluded results. This is wrong.A supply chain team approving a new vendor sourcing strategy based on a 2-week procurement experiment (hypothesis tested, budget approved, results measured) should carry more weight than an agent's ad-hoc observation that "supplier X looks cheaper." The evidence verification pipeline has no way to distinguish these today.
Design
Evidence quality tiers
The intent authorizer assigns trust weight based on evidence provenance:
producededge from experiment withstatus: concluded)Experiment-to-evidence provenance chain
When an intent references a decision as evidence, the verification pipeline checks:
producededge from an experiment?concludedorabsorbed?approved(human gate passed)?If yes → Tier 1 trust weight. If no → fall through to Tier 2/3 based on decision status and authorship.
Verification pipeline additions
Add to the deterministic pre-LLM verification:
evidence_ref, querySELECT <-produced<-experiment WHERE status IN ['concluded', 'absorbed']. If found, tag as experiment-backed.proposedorrunningstatus (results not yet available).concluded(not yetabsorbed) experiments gets a soft warning: results exist but haven't been formally converted to decisions/learnings yet.Risk router integration
Experiment-backed evidence lowers effective risk:
This means well-evidenced intents from concluded experiments are more likely to auto-approve, while poorly-evidenced intents face higher scrutiny. The system rewards structured inquiry.
LLM evaluator context
When the LLM evaluator runs (high-risk intents), include experiment context:
Observer integration
New Observer scan patterns:
Examples
Vendor sourcing decision
evidence_refs: experiment record + produced decision ("Supplier B met SLA in trial") + produced observation ("Cost reduction confirmed at 12%")Contrast: without experiment
evidence_refs: single observation by same agentDependencies
producededges)Phase
After #188 experiments are implemented and evidence-backed intents are in soft/hard enforcement mode.