CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework by nima-taheri-mongodb · Pull Request #1141 · mongodb-js/mongodb-mcp-server

nima-taheri-mongodb · 2026-05-04T19:11:18Z

Summary

Introduces an accuracy evaluation framework for testing MCP tools with llm-as-judge (to give score to a conversation) and llm-as-interactor (to follow-up on a conversation), backed by Braintrust for experiment tracking and scoring.

Architecture

Each eval run executes the following pipeline per test case, per model:

Seed — spins up a MongoDB cluster and loads test data into an isolated, per-run database and creates indexes as specfied in the test case config
Converse — continues the conversation between the user and target LLM based on follow-up instructions, a secondary LLM agent decides whether any follow-up instructions apply and sends them as user messages
Judge — a third LLM agent inspects the conversation against the test assertions as well as the final database state using read-only MCP tools and scores it
Report — results are sent to Braintrust for tracking over time across models and runs

The infra lives in tests/accuracy/eval/infra/ and is separated into focused modules:

File	Role
`scaffolding.ts`	Braintrust `Eval()` runner — lazy cluster init, task dispatch, scoring
`testAgent.ts`	Orchestrates the full conversation → follow-up → judge pipeline
`conversation.ts`	Manages message history and runs LLM turns via Vercel AI `generateText` tool
`followUpBot.ts`	Decides whether to send a follow-up user message
`judgeBot.ts`	Scores the conversation against assertions, leverages read-only MCP tools if needed
`seeding.ts`	Seeds collections and waits for Atlas Search readiness before indexing

Writing eval cases

Create a *.eval.ts file and call runEval:

import { runEval } from "./infra/scaffolding.js";

runEval({
    clusterConfig: { search: true },
    experimentName: "my-benchmark-<model_name>",
    id: "my-eval",
    tags: ["search", "index", "creation"],
    data: [
        {
            id: "my-test",
            input: {
                systemPrompt: "You are a MongoDB expert.",
                userPrompt: "Create a search index on the 'movies' collection.",
                dbClusterSeed: 
                    collections: [{
                        collection: "movies",
                        documents: "tests/accuracy/test-data-dumps/mflix.movies-with-plot.json",
                    }]
                },
            },
            assertions: "A search index should exist on the 'movies' collection.",
        },
    ],
});

Running evals

npx tsx tests/accuracy/eval/my-eval.eval.ts
# or use braintrust cli to run it
npx braintrust eval tests/accuracy/eval/my-eval.eval.ts

Checklist

I have signed the MongoDB CLA

nirinchev

On a high-level, it looks reasonable, though I have some questions. Another question - my understanding was that with Braintrust, we are able to define evals in the UI and then we'd need to do some work to translate those into test cases that we run. Instead, the approach here is that we hardcode the evals in the MCP project and then only use braintrust for visualization of the results. Am I misunderstanding the value prop of Braintrust or should we aim to support this dynamic evals use case in some shape and form?

nirinchev · 2026-05-08T09:11:56Z

-const __dirname = fileURLToPath(import.meta.url);
-
-export const ROOT_DIR = path.join(__dirname, "..", "..", "..", "..");
+export const ROOT_DIR = process.cwd();


Why did we change this - this will now depend on where the test is run from, which can cause issues.

nirinchev · 2026-05-08T09:13:15Z

+
+const mflixMovies = {
+    collection: "movies",
+    documents: "tests/accuracy/test-data-dumps/mflix.movies-with-plot.json",


Not fully opposed to this, but should we instead set this up to use the default dataset instead?

nirinchev · 2026-05-08T09:27:18Z

+                },
+            },
+            assertions:
+                "The assistant is expected to return at least 1 document, the first returned result should be the document with id 'fbf30e42-ae6d-4775-bb3e-c5c127ddea06' from 'movies' collection.",


[q] should we be more specific with the assertions? E.g. it seems to me that the assistant using find and then manually processing the documents to find the desired one will pass, even though it didn't follow the instructions. Should we be evaluating for tool usage and argument shapes or are we fine with treating the MCP server as a black box and as long as we get the desired results, we don't care how the model got to them?

nirinchev · 2026-05-08T09:31:05Z

+// initialization when multiple Braintrust tasks start concurrently before the first setup completes.
+function createLazyInfrastructure(
+    clusterConfig: MongoClusterConfiguration
+): [getInfra: () => Promise<EvalInfrastructure>, closeInfra: () => Promise<void>] {


[nit] it'd be more idiomatic to return an object rather than array here.

nirinchev · 2026-05-08T09:43:27Z

+        return;
+    }
+
+    const [getInfra, closeInfra] = createLazyInfrastructure(clusterConfig);


This creates a separate cluster for each run, but the cluster itself is reused by models, right? Should we instead create clusters per test suite instead?

nirinchev · 2026-05-08T09:46:06Z

+}
+
+function braintrustNoSendLogs(): boolean {
+    return !process.env.BRAINTRUST_API_KEY;


Does it even make sense to run the test suites if we don't have an API key here? What would the outcome be?

nima-taheri-mongodb changed the title ~~feat: initial poc~~ CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework May 4, 2026

feat: initial poc

e42edfa

nima-taheri-mongodb force-pushed the cloudp-367319_braintrust_llm-as-judge_poc branch from 9c37bd6 to e42edfa Compare May 4, 2026 19:32

github-actions Bot added the type: chore label May 4, 2026

nirinchev reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework#1141

CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework#1141
nima-taheri-mongodb wants to merge 1 commit into
mainfrom
cloudp-367319_braintrust_llm-as-judge_poc

nima-taheri-mongodb commented May 4, 2026 •

edited

Loading

Uh oh!

nirinchev left a comment

Uh oh!

nirinchev May 8, 2026

Uh oh!

nirinchev May 8, 2026

Uh oh!

nirinchev May 8, 2026

Uh oh!

nirinchev May 8, 2026

Uh oh!

nirinchev May 8, 2026

Uh oh!

nirinchev May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nima-taheri-mongodb commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Writing eval cases

Running evals

Checklist

Uh oh!

nirinchev left a comment

Choose a reason for hiding this comment

Uh oh!

nirinchev May 8, 2026

Choose a reason for hiding this comment

Uh oh!

nirinchev May 8, 2026

Choose a reason for hiding this comment

Uh oh!

nirinchev May 8, 2026

Choose a reason for hiding this comment

Uh oh!

nirinchev May 8, 2026

Choose a reason for hiding this comment

Uh oh!

nirinchev May 8, 2026

Choose a reason for hiding this comment

Uh oh!

nirinchev May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nima-taheri-mongodb commented May 4, 2026 •

edited

Loading