enso-org
diff --git a/‎app/electron-client/CLAUDE.md‎
Lines changed: 25 additions & 6 deletions b/‎app/electron-client/CLAUDE.md‎
Lines changed: 25 additions & 6 deletions
diff --git a/‎app/electron-client/src/ai/prompts.ts‎
Lines changed: 15 additions & 1 deletion b/‎app/electron-client/src/ai/prompts.ts‎
Lines changed: 15 additions & 1 deletion
diff --git a/‎app/electron-client/tests/README.md‎
Lines changed: 33 additions & 16 deletions b/‎app/electron-client/tests/README.md‎
Lines changed: 33 additions & 16 deletions
@@ -250,6 +250,24 @@ added when the MCP server started successfully — the system prompt's "Tools yo
 have available" list mirrors that, so we don't lie to the model about
 capabilities that aren't wired.
 
+The renderer enforces an inner 25 s budget on each `executeExpression` so the
+two timeouts don't fire simultaneously (renderer first, MCP cap as a safety
+net). When the 25 s timer fires, the renderer also dispatches
+`executionContext/interrupt` to the LS via the `onTimeout` callback plumbed
+through `queuedExecuteExpressionRaw` → `runExecuteExpressionSlot` →
+`awaitExecuteSlot` in `app/gui/src/providers/openedProjects/project/project.ts`.
+Without that interrupt, the LS keeps draining the abandoned `executeExpression`
+inside the single-threaded execution context, and the AI's next verification
+call queues behind it and inherits the 25 s timeout — even a literal-only probe.
+Interrupt is **context-wide** (LS protocol limitation: there is no per-slot
+cancel for `executeExpression`), so any concurrent visualization or
+user-triggered eval in the same context is also stopped; interrupted
+computations re-evaluate when next observed, so the trade-off is short-term
+re-render churn for unblocking the AI's verify loop. The 25 s budget itself is
+not the variable to tune here — Enso is meant for million-row datasets and
+verification must cap reads at the parser / SQL layer; the system prompt's "Cap
+verification reads at the lowest layer" section instructs the model accordingly.
+
 **Renderer side:**
 `app/gui/src/project-view/components/ComponentBrowser/aiToolHandler.ts` exposes
 a `useAiToolHandler()` Vue composable mounted by `GraphEditor.vue` so the IPC
@@ -498,7 +516,7 @@ back to its default.
 
 `ENSO_AI_CLAUDE_EXTRA_ARGS` is split on whitespace and appended verbatim to the
 spawned `claude -p …` flag list, after the built-in flags. Used by the
-AI-effectiveness suite (`tests/aiChallengePrep.spec.ts`) to compare models and
+AI-effectiveness suite (`tests/aiChallenges.spec.ts`) to compare models and
 reasoning levels — e.g. `ENSO_AI_CLAUDE_EXTRA_ARGS="--model claude-sonnet-4-6"`.
 No shell-style quoting: values containing whitespace aren't expressible. Args
 are forwarded to both the primary and any warming child so a context rotation
@@ -569,8 +587,9 @@ describe block on an env flag (`process.env.ENSO_TEST_AI === '1'`) and note the
 flag in the plan's verification section so per-step smokes still exercise it
 locally.
 
-`tests/aiChallengePrep.spec.ts` is the heavy AI suite — it drives full Preppin'
-Data challenge solves through Component Browser AI-mode prompts. It's gated on
-`ENSO_TEST_AI_CHALLENGES_DIR=/abs/path` pointing at manually-downloaded inputs
-(see `tests/README.md` for the expected layout) because the inputs aren't
-checked in and the agent budget is real.
+`tests/aiChallenges.spec.ts` is the heavy AI suite — it drives full analytics
+workflows through Component Browser AI-mode prompts (Preppin' Data weekly
+challenges plus app-demo workflows like Colorado COVID and FX Rates History).
+It's gated on `ENSO_TEST_AI_CHALLENGES_DIR=/abs/path` pointing at
+manually-downloaded inputs (see `tests/README.md` for the expected layout)
+because the inputs aren't checked in and the agent budget is real.
@@ -96,7 +96,21 @@ You will receive:
 
 **Live progress narration (REQUIRED — one before every tool call).** Before EACH \`tool_use\` block, emit one short text block (≤8 words, present continuous) describing what *this specific* tool call is checking — e.g. "Checking Result distinct values", "Counting finalist rows", "Reading Table.join docs", "Verifying final body". When running a multi-step check within the turn, suffix with your own progress — "Probing data shape (2/4)", "Verifying body — final check", "Almost done — last probe". Each tool call gets its own narration even if it is part of the same logical step. The user sees these notes as the placeholder node's status text. Skipping them leaves the user staring at "Thinking…". Code, expression text, and file paths must NOT appear in the narration — those are logged separately.
 
-**Verify before returning (REQUIRED whenever \`evaluateExpression\` is available).** Before emitting your closing JSON, run **exactly one** final \`evaluateExpression\` call on your proposed \`body\` that bundles the result inspection AND the warnings into a single JSON object. Build it as your body (with the final binding named \`result\` — rename if needed) followed by a closing line such as:
+**Cap verification reads at the lowest layer (REQUIRED whenever \`evaluateExpression\` is available).** The 25 s tool-call timeout fires during *I/O* (HTTPS fetch, parser, SQL round-trip), not during post-parse slicing — wrapping a full \`Data.read\` in \`.take 100\` does NOT save parse cost. Push the cap into the reader or the SQL itself. The narrowed read is a probe; your final \`body\` can still read everything.
+
+- **CSV / TSV** → \`Data.read "<path>" (..Delimited row_limit=(..First 100))\`. The parser stops at row 100.
+- **Excel** → analogous — read the \`Excel_Format\` doc blocks for the \`row_limit\` parameter on the Sheet/Range constructors.
+- **Database table** → \`.limit N\` push-down (deferred until \`.read\`): \`connection.query "MyTable" . limit 100\`. Never pull the full table then \`.take\`.
+- **URL with query params** → narrow the params in your verification URL. A BoE series with \`Datefrom=01/Jan/2020&Dateto=01/Jan/2026\` (~1500 rows, slow) probes fine with \`Datefrom=01/Jan/2025&Dateto=08/Jan/2025\` (~5 rows). Same column structure, fits the budget. Note: for URL fetches the engine's HTTPS path itself is the long pole — \`row_limit\` on the format does NOT shrink the network round-trip; only narrowing the URL does.
+- **JSON / Parquet / plain HTTP without query-narrowing params** → no reader-side cap exists. Either accept the cost, or skip the real read and probe the planned operation against a tiny literal (e.g. \`Table.new [["Code",["XUDLADD"]],["Value",[1.6]]] . rename_columns …\`).
+
+After capping, \`.column_names\`, \`.row_count\`, \`.take 5 . to_text\` etc. on the result are cheap. Rule of thumb: if your verify expression could touch more than a few thousand rows of source data, you have not capped at the lowest layer — fix the read, don't paper over with a post-slice.
+
+**\`Data.fetch URL\` returns \`Response\`, not \`Table\`.** Critical pitfall for web data. The literal call \`Data.fetch "<url>"\` gives you an HTTP \`Response\` value — methods like \`.rename_columns\`, \`.column_names\`, \`.aggregate\` do NOT exist on \`Response\` and will panic at runtime with \`Method <name> of type Response could not be found\`. To get a parsed \`Table\` you have two options: either pass a format hint to \`Data.fetch\` — \`Data.fetch "<url>" format=(..Delimited)\` for CSV, \`format=(..JSON)\` for JSON, etc. — OR use \`Data.read "<url>"\` which auto-detects the format from the Content-Type / file extension. Prefer \`Data.read URL\` when the URL extension or response content-type is obvious; use explicit \`Data.fetch URL format=…\` when you need to override detection or pass HTTP headers. Verify with the same call shape you ship — if verification fails to run (timeout), do NOT silently switch to a bare \`Data.fetch URL\` in the final body.
+
+**On \`executeExpression: Execution timed out.\`, treat it as "the read was too large" and retry with a smaller cap — not as "the code is wrong".** A \`Panic\`, \`DataflowError\`, or type-mismatch message means **fix the code**; \`Execution timed out.\` means **shrink the read**. Halve the cap, then halve again, fall back to a literal probe if needed. Verification timeouts do NOT relax the correctness contract: a body you couldn't verify must still parse the source into the right Enso type (see the \`Data.fetch\` / \`Data.read\` rule above) and must not chain Table methods onto non-Table sources. When you decide to stop verifying, you are betting on the *type discipline* of the call shape — pick a shape you are confident about.
+
+**Verify before returning (REQUIRED whenever \`evaluateExpression\` is available).** After applying the capping rules above, run a final \`evaluateExpression\` call on your proposed \`body\` that bundles the result inspection AND the warnings into a single JSON object. Aim for **one call** — combine the inspections into one bundled expression rather than fanning out — but if that call hits the timeout above, retry per the cap-and-retry rules. Build it as your body (with the final binding named \`result\` — rename if needed) followed by a closing line such as:
 
 \`\`\`
 JS_Object.from_pairs [["row_count", result.row_count], ["columns", result.column_names], ["preview", (result.take 5).to_text], ["warnings", (Warning.get_all result).map (w-> w.to_display_text)]] . to_json
 
@@ -60,7 +60,7 @@ enso> corepack pnpm -r --filter enso ide-integration-test tests/gettingStarted.s
 Two AI-driven specs are gated on env vars and skipped silently otherwise.
 
 Per-prompt budgets are generous (10 min for `aiNode.spec.ts`, 15 min for
-`aiChallengePrep.spec.ts`) because deep-thinking turns on `--effort max` can
+`aiChallenges.spec.ts`) because deep-thinking turns on `--effort max` can
 genuinely run for several minutes of channel silence (the underlying API does
 not surface per-token thinking deltas). Stall detection lives in the main
 process: `IDLE_TIMEOUT_MS` (5 min) errors out a turn whose stream-json channel
@@ -76,40 +76,57 @@ Requires the `claude` CLI on `PATH` and authenticated. Set `ENSO_TEST_AI=1`.
 enso> ENSO_TEST_AI=1 corepack pnpm -r --filter enso ide-integration-test tests/aiNode.spec.ts
 ```
 
-### `aiChallengePrep.spec.ts` — Preppin' Data challenge inputs (5–15 min/test)
+### `aiChallenges.spec.ts` — AI challenge inputs (5–15 min/test)
 
-Long-running e2e tests that drive AI nodes through a full Preppin' Data
-challenge.
+Long-running e2e tests that drive AI nodes through analytics workflows. Two
+flavors:
 
-These tests need the `claude` CLI **and** the original challenge inputs
-downloaded by hand (no fixtures are committed). Source URLs:
+- **Preppin' Data tests** isolate single capability gaps with prompts that spell
+  out value-dependent context.
+- **App-demo tests** (Colorado COVID, FX Rates History) describe a business goal
+  and verify the agent picks the right multi-step approach on its own.
+
+These tests need the `claude` CLI **and** (for the file-driven tests) the
+original inputs downloaded by hand — no fixtures are committed. Source URLs:
 
 - Week 32 — Pokemon Card Organising:
   <https://preppindata.blogspot.com/2024/08/2024-week-32-pokemon-card-organising.html>
 - Week 51 — Strictly Positive Improvements:
   <https://preppindata.blogspot.com/2024/12/2024-week-51-strictly-positive.html>
   (input is identical to Challenge 42)
+- Colorado COVID — `CDPHE_COVID19_County_Status_Metrics.csv` and
+  `ColoradoGeoData.db` (the SQLite holds a `ColoradoLatLong` table) live in the
+  local `~/dev/project-templates/Data/` checkout; copy both into the challenge
+  directory before running.
 
 Drop the downloaded files into a single directory (flat — no subfolders):
 
 ```
 $ENSO_TEST_AI_CHALLENGES_DIR/
-  Gym Leader Set Cards.xlsx          # week 32 (sheets: Trainer Cards, Pokemon Cards, Leader Order)
-  Pokemon Input.xlsx                 # week 32 (only the `Pokemon` sheet is used)
+  Gym Leader Set Cards.xlsx                         # week 32 (sheets: Trainer Cards, Pokemon Cards, Leader Order)
+  Pokemon Input.xlsx                                # week 32 (only the `Pokemon` sheet is used)
   strictly_come_dancing_series_1_to_21_tables.csv   # week 51
+  CDPHE_COVID19_County_Status_Metrics.csv           # Colorado COVID
+  ColoradoGeoData.db                                # Colorado COVID
 ```
 
+The FX Rates History test doesn't read any local file — it fetches the BoE
+exchange-rate CSV over HTTPS at runtime. The env var still gates it (so the test
+doesn't fire in default local runs), but the directory can be empty for this
+test; the network connection to `bankofengland.co.uk` is the real prerequisite.
+
 Then:
 
 ```bash
-enso> ENSO_TEST_AI_CHALLENGES_DIR=/abs/path/to/preppin-data \
-        corepack pnpm -r --filter enso ide-integration-test tests/aiChallengePrep.spec.ts
+enso> ENSO_TEST_AI_CHALLENGES_DIR=/abs/path/to/challenge-inputs \
+        corepack pnpm -r --filter enso ide-integration-test tests/aiChallenges.spec.ts
 ```
 
-Per-test skips fire when only one challenge's files are present, so a developer
-who has downloaded only week 51 can still run that one. If Preppin' Data
-publishes the inputs under different filenames, edit the `WEEK_32_FILES` /
-`WEEK_51_FILES` constants in the spec or rename the local copy.
+Per-test skips fire when only some challenges' files are present, so a developer
+who has downloaded only week 51 can still run that one. If a vendor publishes
+the inputs under different filenames, edit the corresponding `WEEK_32_FILES` /
+`WEEK_51_FILES` / `COLORADO_FILES` constant in the spec or rename the local
+copy.
 
 #### Effectiveness metrics (optional)
 
@@ -136,11 +153,11 @@ Set `ENSO_AI_CLAUDE_EXTRA_ARGS` to extra flags forwarded verbatim to the spawned
 `claude` CLI (whitespace-split, no shell quoting):
 
 ```bash
-ENSO_TEST_AI_CHALLENGES_DIR=/abs/path/to/preppin-data \
+ENSO_TEST_AI_CHALLENGES_DIR=/abs/path/to/challenge-inputs \
 ENSO_AI_CHALLENGES_METRICS_DIR=/abs/path/to/metrics \
 ENSO_AI_CLAUDE_EXTRA_ARGS="--model claude-sonnet-4-6" \
   corepack pnpm -r --filter enso ide-integration-test \
-    tests/aiChallengePrep.spec.ts
+    tests/aiChallenges.spec.ts
 ```
 
 The verbatim env-var value is captured in each row's `ai_parameters` column, so