Skip to content

CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework#1141

Draft
nima-taheri-mongodb wants to merge 1 commit into
mainfrom
cloudp-367319_braintrust_llm-as-judge_poc
Draft

CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework#1141
nima-taheri-mongodb wants to merge 1 commit into
mainfrom
cloudp-367319_braintrust_llm-as-judge_poc

Conversation

@nima-taheri-mongodb
Copy link
Copy Markdown
Collaborator

@nima-taheri-mongodb nima-taheri-mongodb commented May 4, 2026

Summary

Introduces an accuracy evaluation framework for testing MCP tools with llm-as-judge (to give score to a conversation) and llm-as-interactor (to follow-up on a conversation), backed by Braintrust for experiment tracking and scoring.

Architecture

Each eval run executes the following pipeline per test case, per model:

  1. Seed — spins up a MongoDB cluster and loads test data into an isolated, per-run database and creates indexes as specfied in the test case config
  2. Converse — continues the conversation between the user and target LLM based on follow-up instructions, a secondary LLM agent decides whether any follow-up instructions apply and sends them as user messages
  3. Judge — a third LLM agent inspects the conversation against the test assertions as well as the final database state using read-only MCP tools and scores it
  4. Report — results are sent to Braintrust for tracking over time across models and runs

The infra lives in tests/accuracy/eval/infra/ and is separated into focused modules:

File Role
scaffolding.ts Braintrust Eval() runner — lazy cluster init, task dispatch, scoring
testAgent.ts Orchestrates the full conversation → follow-up → judge pipeline
conversation.ts Manages message history and runs LLM turns via Vercel AI generateText tool
followUpBot.ts Decides whether to send a follow-up user message
judgeBot.ts Scores the conversation against assertions, leverages read-only MCP tools if needed
seeding.ts Seeds collections and waits for Atlas Search readiness before indexing

Writing eval cases

Create a *.eval.ts file and call runEval:

import { runEval } from "./infra/scaffolding.js";

runEval({
    clusterConfig: { search: true },
    experimentName: "my-benchmark-<model_name>",
    id: "my-eval",
    tags: ["search", "index", "creation"],
    data: [
        {
            id: "my-test",
            input: {
                systemPrompt: "You are a MongoDB expert.",
                userPrompt: "Create a search index on the 'movies' collection.",
                dbClusterSeed: 
                    collections: [{
                        collection: "movies",
                        documents: "tests/accuracy/test-data-dumps/mflix.movies-with-plot.json",
                    }]
                },
            },
            assertions: "A search index should exist on the 'movies' collection.",
        },
    ],
});

Running evals

npx tsx tests/accuracy/eval/my-eval.eval.ts
# or use braintrust cli to run it
npx braintrust eval tests/accuracy/eval/my-eval.eval.ts

Checklist

@nima-taheri-mongodb nima-taheri-mongodb changed the title feat: initial poc CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework May 4, 2026
@nima-taheri-mongodb nima-taheri-mongodb force-pushed the cloudp-367319_braintrust_llm-as-judge_poc branch from 9c37bd6 to e42edfa Compare May 4, 2026 19:32
Copy link
Copy Markdown
Collaborator

@nirinchev nirinchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a high-level, it looks reasonable, though I have some questions. Another question - my understanding was that with Braintrust, we are able to define evals in the UI and then we'd need to do some work to translate those into test cases that we run. Instead, the approach here is that we hardcode the evals in the MCP project and then only use braintrust for visualization of the results. Am I misunderstanding the value prop of Braintrust or should we aim to support this dynamic evals use case in some shape and form?

const __dirname = fileURLToPath(import.meta.url);

export const ROOT_DIR = path.join(__dirname, "..", "..", "..", "..");
export const ROOT_DIR = process.cwd();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change this - this will now depend on where the test is run from, which can cause issues.


const mflixMovies = {
collection: "movies",
documents: "tests/accuracy/test-data-dumps/mflix.movies-with-plot.json",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully opposed to this, but should we instead set this up to use the default dataset instead?

},
},
assertions:
"The assistant is expected to return at least 1 document, the first returned result should be the document with id 'fbf30e42-ae6d-4775-bb3e-c5c127ddea06' from 'movies' collection.",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[q] should we be more specific with the assertions? E.g. it seems to me that the assistant using find and then manually processing the documents to find the desired one will pass, even though it didn't follow the instructions. Should we be evaluating for tool usage and argument shapes or are we fine with treating the MCP server as a black box and as long as we get the desired results, we don't care how the model got to them?

// initialization when multiple Braintrust tasks start concurrently before the first setup completes.
function createLazyInfrastructure(
clusterConfig: MongoClusterConfiguration
): [getInfra: () => Promise<EvalInfrastructure>, closeInfra: () => Promise<void>] {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] it'd be more idiomatic to return an object rather than array here.

return;
}

const [getInfra, closeInfra] = createLazyInfrastructure(clusterConfig);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a separate cluster for each run, but the cluster itself is reused by models, right? Should we instead create clusters per test suite instead?

}

function braintrustNoSendLogs(): boolean {
return !process.env.BRAINTRUST_API_KEY;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it even make sense to run the test suites if we don't have an API key here? What would the outcome be?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants