Skip to content

Commit 4ae25f0

Browse files
authored
Ai verify improvements (#15015)
* aiChallenges: add Colorado COVID + FX Rates History capability tests Add more tests for testing AI capabilities, and on this occasion specify verification way when operating on big data.
1 parent d31ec0f commit 4ae25f0

8 files changed

Lines changed: 352 additions & 54 deletions

File tree

app/electron-client/CLAUDE.md

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,24 @@ added when the MCP server started successfully — the system prompt's "Tools yo
250250
have available" list mirrors that, so we don't lie to the model about
251251
capabilities that aren't wired.
252252

253+
The renderer enforces an inner 25 s budget on each `executeExpression` so the
254+
two timeouts don't fire simultaneously (renderer first, MCP cap as a safety
255+
net). When the 25 s timer fires, the renderer also dispatches
256+
`executionContext/interrupt` to the LS via the `onTimeout` callback plumbed
257+
through `queuedExecuteExpressionRaw``runExecuteExpressionSlot`
258+
`awaitExecuteSlot` in `app/gui/src/providers/openedProjects/project/project.ts`.
259+
Without that interrupt, the LS keeps draining the abandoned `executeExpression`
260+
inside the single-threaded execution context, and the AI's next verification
261+
call queues behind it and inherits the 25 s timeout — even a literal-only probe.
262+
Interrupt is **context-wide** (LS protocol limitation: there is no per-slot
263+
cancel for `executeExpression`), so any concurrent visualization or
264+
user-triggered eval in the same context is also stopped; interrupted
265+
computations re-evaluate when next observed, so the trade-off is short-term
266+
re-render churn for unblocking the AI's verify loop. The 25 s budget itself is
267+
not the variable to tune here — Enso is meant for million-row datasets and
268+
verification must cap reads at the parser / SQL layer; the system prompt's "Cap
269+
verification reads at the lowest layer" section instructs the model accordingly.
270+
253271
**Renderer side:**
254272
`app/gui/src/project-view/components/ComponentBrowser/aiToolHandler.ts` exposes
255273
a `useAiToolHandler()` Vue composable mounted by `GraphEditor.vue` so the IPC
@@ -498,7 +516,7 @@ back to its default.
498516

499517
`ENSO_AI_CLAUDE_EXTRA_ARGS` is split on whitespace and appended verbatim to the
500518
spawned `claude -p …` flag list, after the built-in flags. Used by the
501-
AI-effectiveness suite (`tests/aiChallengePrep.spec.ts`) to compare models and
519+
AI-effectiveness suite (`tests/aiChallenges.spec.ts`) to compare models and
502520
reasoning levels — e.g. `ENSO_AI_CLAUDE_EXTRA_ARGS="--model claude-sonnet-4-6"`.
503521
No shell-style quoting: values containing whitespace aren't expressible. Args
504522
are forwarded to both the primary and any warming child so a context rotation
@@ -569,8 +587,9 @@ describe block on an env flag (`process.env.ENSO_TEST_AI === '1'`) and note the
569587
flag in the plan's verification section so per-step smokes still exercise it
570588
locally.
571589

572-
`tests/aiChallengePrep.spec.ts` is the heavy AI suite — it drives full Preppin'
573-
Data challenge solves through Component Browser AI-mode prompts. It's gated on
574-
`ENSO_TEST_AI_CHALLENGES_DIR=/abs/path` pointing at manually-downloaded inputs
575-
(see `tests/README.md` for the expected layout) because the inputs aren't
576-
checked in and the agent budget is real.
590+
`tests/aiChallenges.spec.ts` is the heavy AI suite — it drives full analytics
591+
workflows through Component Browser AI-mode prompts (Preppin' Data weekly
592+
challenges plus app-demo workflows like Colorado COVID and FX Rates History).
593+
It's gated on `ENSO_TEST_AI_CHALLENGES_DIR=/abs/path` pointing at
594+
manually-downloaded inputs (see `tests/README.md` for the expected layout)
595+
because the inputs aren't checked in and the agent budget is real.

app/electron-client/src/ai/prompts.ts

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,21 @@ You will receive:
9696
9797
**Live progress narration (REQUIRED — one before every tool call).** Before EACH \`tool_use\` block, emit one short text block (≤8 words, present continuous) describing what *this specific* tool call is checking — e.g. "Checking Result distinct values", "Counting finalist rows", "Reading Table.join docs", "Verifying final body". When running a multi-step check within the turn, suffix with your own progress — "Probing data shape (2/4)", "Verifying body — final check", "Almost done — last probe". Each tool call gets its own narration even if it is part of the same logical step. The user sees these notes as the placeholder node's status text. Skipping them leaves the user staring at "Thinking…". Code, expression text, and file paths must NOT appear in the narration — those are logged separately.
9898
99-
**Verify before returning (REQUIRED whenever \`evaluateExpression\` is available).** Before emitting your closing JSON, run **exactly one** final \`evaluateExpression\` call on your proposed \`body\` that bundles the result inspection AND the warnings into a single JSON object. Build it as your body (with the final binding named \`result\` — rename if needed) followed by a closing line such as:
99+
**Cap verification reads at the lowest layer (REQUIRED whenever \`evaluateExpression\` is available).** The 25 s tool-call timeout fires during *I/O* (HTTPS fetch, parser, SQL round-trip), not during post-parse slicing — wrapping a full \`Data.read\` in \`.take 100\` does NOT save parse cost. Push the cap into the reader or the SQL itself. The narrowed read is a probe; your final \`body\` can still read everything.
100+
101+
- **CSV / TSV** → \`Data.read "<path>" (..Delimited row_limit=(..First 100))\`. The parser stops at row 100.
102+
- **Excel** → analogous — read the \`Excel_Format\` doc blocks for the \`row_limit\` parameter on the Sheet/Range constructors.
103+
- **Database table** → \`.limit N\` push-down (deferred until \`.read\`): \`connection.query "MyTable" . limit 100\`. Never pull the full table then \`.take\`.
104+
- **URL with query params** → narrow the params in your verification URL. A BoE series with \`Datefrom=01/Jan/2020&Dateto=01/Jan/2026\` (~1500 rows, slow) probes fine with \`Datefrom=01/Jan/2025&Dateto=08/Jan/2025\` (~5 rows). Same column structure, fits the budget. Note: for URL fetches the engine's HTTPS path itself is the long pole — \`row_limit\` on the format does NOT shrink the network round-trip; only narrowing the URL does.
105+
- **JSON / Parquet / plain HTTP without query-narrowing params** → no reader-side cap exists. Either accept the cost, or skip the real read and probe the planned operation against a tiny literal (e.g. \`Table.new [["Code",["XUDLADD"]],["Value",[1.6]]] . rename_columns …\`).
106+
107+
After capping, \`.column_names\`, \`.row_count\`, \`.take 5 . to_text\` etc. on the result are cheap. Rule of thumb: if your verify expression could touch more than a few thousand rows of source data, you have not capped at the lowest layer — fix the read, don't paper over with a post-slice.
108+
109+
**\`Data.fetch URL\` returns \`Response\`, not \`Table\`.** Critical pitfall for web data. The literal call \`Data.fetch "<url>"\` gives you an HTTP \`Response\` value — methods like \`.rename_columns\`, \`.column_names\`, \`.aggregate\` do NOT exist on \`Response\` and will panic at runtime with \`Method <name> of type Response could not be found\`. To get a parsed \`Table\` you have two options: either pass a format hint to \`Data.fetch\` — \`Data.fetch "<url>" format=(..Delimited)\` for CSV, \`format=(..JSON)\` for JSON, etc. — OR use \`Data.read "<url>"\` which auto-detects the format from the Content-Type / file extension. Prefer \`Data.read URL\` when the URL extension or response content-type is obvious; use explicit \`Data.fetch URL format=…\` when you need to override detection or pass HTTP headers. Verify with the same call shape you ship — if verification fails to run (timeout), do NOT silently switch to a bare \`Data.fetch URL\` in the final body.
110+
111+
**On \`executeExpression: Execution timed out.\`, treat it as "the read was too large" and retry with a smaller cap — not as "the code is wrong".** A \`Panic\`, \`DataflowError\`, or type-mismatch message means **fix the code**; \`Execution timed out.\` means **shrink the read**. Halve the cap, then halve again, fall back to a literal probe if needed. Verification timeouts do NOT relax the correctness contract: a body you couldn't verify must still parse the source into the right Enso type (see the \`Data.fetch\` / \`Data.read\` rule above) and must not chain Table methods onto non-Table sources. When you decide to stop verifying, you are betting on the *type discipline* of the call shape — pick a shape you are confident about.
112+
113+
**Verify before returning (REQUIRED whenever \`evaluateExpression\` is available).** After applying the capping rules above, run a final \`evaluateExpression\` call on your proposed \`body\` that bundles the result inspection AND the warnings into a single JSON object. Aim for **one call** — combine the inspections into one bundled expression rather than fanning out — but if that call hits the timeout above, retry per the cap-and-retry rules. Build it as your body (with the final binding named \`result\` — rename if needed) followed by a closing line such as:
100114
101115
\`\`\`
102116
JS_Object.from_pairs [["row_count", result.row_count], ["columns", result.column_names], ["preview", (result.take 5).to_text], ["warnings", (Warning.get_all result).map (w-> w.to_display_text)]] . to_json

app/electron-client/tests/README.md

Lines changed: 33 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ enso> corepack pnpm -r --filter enso ide-integration-test tests/gettingStarted.s
6060
Two AI-driven specs are gated on env vars and skipped silently otherwise.
6161

6262
Per-prompt budgets are generous (10 min for `aiNode.spec.ts`, 15 min for
63-
`aiChallengePrep.spec.ts`) because deep-thinking turns on `--effort max` can
63+
`aiChallenges.spec.ts`) because deep-thinking turns on `--effort max` can
6464
genuinely run for several minutes of channel silence (the underlying API does
6565
not surface per-token thinking deltas). Stall detection lives in the main
6666
process: `IDLE_TIMEOUT_MS` (5 min) errors out a turn whose stream-json channel
@@ -76,40 +76,57 @@ Requires the `claude` CLI on `PATH` and authenticated. Set `ENSO_TEST_AI=1`.
7676
enso> ENSO_TEST_AI=1 corepack pnpm -r --filter enso ide-integration-test tests/aiNode.spec.ts
7777
```
7878

79-
### `aiChallengePrep.spec.ts`Preppin' Data challenge inputs (5–15 min/test)
79+
### `aiChallenges.spec.ts`AI challenge inputs (5–15 min/test)
8080

81-
Long-running e2e tests that drive AI nodes through a full Preppin' Data
82-
challenge.
81+
Long-running e2e tests that drive AI nodes through analytics workflows. Two
82+
flavors:
8383

84-
These tests need the `claude` CLI **and** the original challenge inputs
85-
downloaded by hand (no fixtures are committed). Source URLs:
84+
- **Preppin' Data tests** isolate single capability gaps with prompts that spell
85+
out value-dependent context.
86+
- **App-demo tests** (Colorado COVID, FX Rates History) describe a business goal
87+
and verify the agent picks the right multi-step approach on its own.
88+
89+
These tests need the `claude` CLI **and** (for the file-driven tests) the
90+
original inputs downloaded by hand — no fixtures are committed. Source URLs:
8691

8792
- Week 32 — Pokemon Card Organising:
8893
<https://preppindata.blogspot.com/2024/08/2024-week-32-pokemon-card-organising.html>
8994
- Week 51 — Strictly Positive Improvements:
9095
<https://preppindata.blogspot.com/2024/12/2024-week-51-strictly-positive.html>
9196
(input is identical to Challenge 42)
97+
- Colorado COVID — `CDPHE_COVID19_County_Status_Metrics.csv` and
98+
`ColoradoGeoData.db` (the SQLite holds a `ColoradoLatLong` table) live in the
99+
local `~/dev/project-templates/Data/` checkout; copy both into the challenge
100+
directory before running.
92101

93102
Drop the downloaded files into a single directory (flat — no subfolders):
94103

95104
```
96105
$ENSO_TEST_AI_CHALLENGES_DIR/
97-
Gym Leader Set Cards.xlsx # week 32 (sheets: Trainer Cards, Pokemon Cards, Leader Order)
98-
Pokemon Input.xlsx # week 32 (only the `Pokemon` sheet is used)
106+
Gym Leader Set Cards.xlsx # week 32 (sheets: Trainer Cards, Pokemon Cards, Leader Order)
107+
Pokemon Input.xlsx # week 32 (only the `Pokemon` sheet is used)
99108
strictly_come_dancing_series_1_to_21_tables.csv # week 51
109+
CDPHE_COVID19_County_Status_Metrics.csv # Colorado COVID
110+
ColoradoGeoData.db # Colorado COVID
100111
```
101112

113+
The FX Rates History test doesn't read any local file — it fetches the BoE
114+
exchange-rate CSV over HTTPS at runtime. The env var still gates it (so the test
115+
doesn't fire in default local runs), but the directory can be empty for this
116+
test; the network connection to `bankofengland.co.uk` is the real prerequisite.
117+
102118
Then:
103119

104120
```bash
105-
enso> ENSO_TEST_AI_CHALLENGES_DIR=/abs/path/to/preppin-data \
106-
corepack pnpm -r --filter enso ide-integration-test tests/aiChallengePrep.spec.ts
121+
enso> ENSO_TEST_AI_CHALLENGES_DIR=/abs/path/to/challenge-inputs \
122+
corepack pnpm -r --filter enso ide-integration-test tests/aiChallenges.spec.ts
107123
```
108124

109-
Per-test skips fire when only one challenge's files are present, so a developer
110-
who has downloaded only week 51 can still run that one. If Preppin' Data
111-
publishes the inputs under different filenames, edit the `WEEK_32_FILES` /
112-
`WEEK_51_FILES` constants in the spec or rename the local copy.
125+
Per-test skips fire when only some challenges' files are present, so a developer
126+
who has downloaded only week 51 can still run that one. If a vendor publishes
127+
the inputs under different filenames, edit the corresponding `WEEK_32_FILES` /
128+
`WEEK_51_FILES` / `COLORADO_FILES` constant in the spec or rename the local
129+
copy.
113130

114131
#### Effectiveness metrics (optional)
115132

@@ -136,11 +153,11 @@ Set `ENSO_AI_CLAUDE_EXTRA_ARGS` to extra flags forwarded verbatim to the spawned
136153
`claude` CLI (whitespace-split, no shell quoting):
137154

138155
```bash
139-
ENSO_TEST_AI_CHALLENGES_DIR=/abs/path/to/preppin-data \
156+
ENSO_TEST_AI_CHALLENGES_DIR=/abs/path/to/challenge-inputs \
140157
ENSO_AI_CHALLENGES_METRICS_DIR=/abs/path/to/metrics \
141158
ENSO_AI_CLAUDE_EXTRA_ARGS="--model claude-sonnet-4-6" \
142159
corepack pnpm -r --filter enso ide-integration-test \
143-
tests/aiChallengePrep.spec.ts
160+
tests/aiChallenges.spec.ts
144161
```
145162

146163
The verbatim env-var value is captured in each row's `ai_parameters` column, so

0 commit comments

Comments
 (0)