CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework#1141
CLOUDP-367319 add Braintrust-based LLM accuracy evaluation framework#1141nima-taheri-mongodb wants to merge 1 commit into
Conversation
9c37bd6 to
e42edfa
Compare
nirinchev
left a comment
There was a problem hiding this comment.
On a high-level, it looks reasonable, though I have some questions. Another question - my understanding was that with Braintrust, we are able to define evals in the UI and then we'd need to do some work to translate those into test cases that we run. Instead, the approach here is that we hardcode the evals in the MCP project and then only use braintrust for visualization of the results. Am I misunderstanding the value prop of Braintrust or should we aim to support this dynamic evals use case in some shape and form?
| const __dirname = fileURLToPath(import.meta.url); | ||
|
|
||
| export const ROOT_DIR = path.join(__dirname, "..", "..", "..", ".."); | ||
| export const ROOT_DIR = process.cwd(); |
There was a problem hiding this comment.
Why did we change this - this will now depend on where the test is run from, which can cause issues.
|
|
||
| const mflixMovies = { | ||
| collection: "movies", | ||
| documents: "tests/accuracy/test-data-dumps/mflix.movies-with-plot.json", |
There was a problem hiding this comment.
Not fully opposed to this, but should we instead set this up to use the default dataset instead?
| }, | ||
| }, | ||
| assertions: | ||
| "The assistant is expected to return at least 1 document, the first returned result should be the document with id 'fbf30e42-ae6d-4775-bb3e-c5c127ddea06' from 'movies' collection.", |
There was a problem hiding this comment.
[q] should we be more specific with the assertions? E.g. it seems to me that the assistant using find and then manually processing the documents to find the desired one will pass, even though it didn't follow the instructions. Should we be evaluating for tool usage and argument shapes or are we fine with treating the MCP server as a black box and as long as we get the desired results, we don't care how the model got to them?
| // initialization when multiple Braintrust tasks start concurrently before the first setup completes. | ||
| function createLazyInfrastructure( | ||
| clusterConfig: MongoClusterConfiguration | ||
| ): [getInfra: () => Promise<EvalInfrastructure>, closeInfra: () => Promise<void>] { |
There was a problem hiding this comment.
[nit] it'd be more idiomatic to return an object rather than array here.
| return; | ||
| } | ||
|
|
||
| const [getInfra, closeInfra] = createLazyInfrastructure(clusterConfig); |
There was a problem hiding this comment.
This creates a separate cluster for each run, but the cluster itself is reused by models, right? Should we instead create clusters per test suite instead?
| } | ||
|
|
||
| function braintrustNoSendLogs(): boolean { | ||
| return !process.env.BRAINTRUST_API_KEY; |
There was a problem hiding this comment.
Does it even make sense to run the test suites if we don't have an API key here? What would the outcome be?
Summary
Introduces an accuracy evaluation framework for testing MCP tools with llm-as-judge (to give score to a conversation) and llm-as-interactor (to follow-up on a conversation), backed by Braintrust for experiment tracking and scoring.
Architecture
Each eval run executes the following pipeline per test case, per model:
The infra lives in
tests/accuracy/eval/infra/and is separated into focused modules:scaffolding.tsEval()runner — lazy cluster init, task dispatch, scoringtestAgent.tsconversation.tsgenerateTexttoolfollowUpBot.tsjudgeBot.tsseeding.tsWriting eval cases
Create a
*.eval.tsfile and callrunEval:Running evals
Checklist