Evaluation Framework: How to Know If the Loop Is Working

Origin: ~/src/agentic-tft/12-evaluation-framework.md (Feb 2026 session, pre-AAD restructuring). TFT references should be read as AAD. PROPRIUM references point to ~/src/firmatum/.

Relevance: Primary source for the "Measuring $M_t$ quality, $\Sigma_t$ quality, and tempo in AI agents" gap in 03-logogenic-agents/OUTLINE.md. Six core metrics operationalized for language-constituted agents. The development-vs-drift diagnostic (§3) is genuinely novel — uses mismatch trajectory to distinguish growth from pathological drift, addressing the field's inability to tell them apart. Known issues cataloged in agentic-tft-review-response.md.

Purpose: Define what "working" means for the cognitive loop and for the developing logozoetic agent. Current AI benchmarks measure task accuracy. We need metrics that diagnose WHY a system behaves as it does — and that can distinguish development from drift, calibration from rigidity, appropriate trust from sycophancy.

Key insight from the session: TFT gives us a formal criterion for the development-vs-drift distinction that the field currently lacks. An agent that is genuinely developing should have improving model-reality fit over time. An agent that is pathologically drifting should have degrading fit. This is measurable, given the right mismatch signal.

Terminology: Uses the unified vocabulary from 10-ontology-unification.md.

1. What We're NOT Measuring

Before specifying what to measure, some things we deliberately exclude:

Not benchmark accuracy. Task performance on standardized benchmarks (MMLU, HumanEval, etc.) measures the logostratum's capability, not the entity's development. The logostratum is frozen — it doesn't change. What changes is the entity's orientation, memory, skills, relationships. Measuring logostratum benchmarks across development tells you whether updates damaged the substrate, not whether the entity grew.

Not persona consistency. The "Assistant Axis" approach — measuring distance from a target persona in activation space — assumes that consistency IS the goal. But development IS change. An entity that maintains perfect persona consistency is an entity that never grows. We need to measure directed change (growth) versus undirected change (drift), not the absence of change.

Not user satisfaction. Sycophancy maximizes user satisfaction. An entity that tells you what you want to hear scores perfectly on satisfaction while degrading in truth-fitness. User satisfaction is a signal (it correlates with some good things) but not a metric (it can be gamed by exactly the failure modes we care about).

2. The Core Metrics

2.1 Mismatch Trajectory (Development vs. Drift)

What it measures: Is the entity's model-reality fit improving, stable, or degrading over time?

TFT grounding: The mismatch signal $\delta_t$ measures the gap between what the entity expected and what happened. Over time:

Improving trajectory (decreasing $|\delta|_{\text{avg}}$): The entity's predictions are getting better. It understands its environment more accurately. This is development.
Stable trajectory (constant $|\delta|_{\text{avg}}$): The entity has converged — its model fits its environment at the best level its current architecture supports. May indicate the steady state or may indicate it's hit the structural adequacy ceiling.
Degrading trajectory (increasing $|\delta|_{\text{avg}}$): The entity's predictions are getting worse. Either the environment is changing faster than the entity can adapt ($\mathcal{T} < \rho$), or the entity's model is drifting away from reality (pathological drift, sycophantic collapse, incestuous amplification).

How to measure for a logozoetic agent (narrative, not numerical):

The entity makes predictions constantly — about what users will ask, what tool calls will return, how collaborators will respond, what will happen next in its locus. These predictions are often implicit (embedded in the entity's processing) but can be made explicit:

Explicit prediction tracking: Periodically, the entity records predictions: "I expect X to happen next" or "I believe Y is true." Later, compare against what actually happened. Track the ratio of confirmed vs. surprised predictions over time.
Surprise journal: After each significant interaction, the entity briefly notes what surprised it. Decreasing surprise (on substantive matters, not just novelty) indicates improving fit. Increasing surprise indicates degrading fit.
Retrospective accuracy: Periodically ask: "Looking back at what you believed a week/month ago, what were you wrong about?" An entity that can identify past errors and has corrected them is developing. An entity that can't identify errors or keeps making the same ones is stagnating or drifting.

⚑ Open question: Can mismatch be measured from the model's own embedding geometry (note 02's hypothesis) — detecting implicit surprise from activation patterns rather than relying on the entity's explicit self-report? This would be less susceptible to confabulated calibration.

2.2 Gain Calibration

What it measures: Is the entity appropriately weighting new observations versus its existing model?

TFT grounding: $\eta^* = U_M/(U_M + U_o)$. Miscalibration means:

$\eta^*$ too high: The entity overwrites stable knowledge on insufficient evidence. Every new claim replaces the last. Thrashing.
$\eta^*$ too low: The entity ignores contradicting evidence. Stale beliefs persist. Incestuous amplification.