Skip to content

Evaluation pipeline performance bottlenecks: BC benchmarks 3-5x slower than necessary #137

@wangbinluo

Description

@wangbinluo

Summary

BrowseComp (BC-EN/BC-ZH) evaluations are significantly slower than other benchmarks. A single BC-ZH task averages 48.6 minutes (median 37.3 min), with tail tasks reaching 265 minutes (4.4 hours). BC-EN with 1266 tasks takes 60-70 hours per run.

Profiling shows LLM inference is the dominant bottleneck (~80%+ of total task time), followed by tool execution overhead. Both sides need optimization to achieve the target of 2x+ overall speedup.


1. LLM Inference Optimization (Biggest Impact) 🔴

LLM inference accounts for the vast majority of evaluation time. A BC task runs ~300-400 turns, each requiring a full LLM call with growing context (up to tens of thousands of tokens). Conservative estimate: 20-60 min per task on inference alone, often much more in practice.

1.1 Prefix Caching

Current state: All evaluation tasks share the same system prompt (several thousand tokens of agent instructions). Currently, every request recomputes the full KV cache from scratch.

Optimization: Enable prefix caching (RadixTree) on the sglang server so the system prompt KV cache is computed once and reused across all concurrent requests.

Additional benefit: Within a single task's multi-turn conversation, each turn's context is a prefix of the next. With cache-aware routing (routing same-task requests to the same worker), KV cache from previous turns can be reused.

Expected impact: Significant reduction in prefill compute, especially for later turns with long context. Estimated 30-50% inference time reduction.

1.2 Chunked Prefill with Pipeline Parallelism

Current state: Basic sglang launch with --tp 8, no chunked prefill.

Optimization: Enable --chunked-prefill-size 4096 --enable-dynamic-chunking to pipeline long prefills. SGLang benchmarks show up to 3.3x prefill throughput and 67.9% TTFT reduction for long contexts.

Why this matters for evaluation: BC tasks accumulate long conversation histories (10K-60K+ tokens). Chunked prefill directly reduces the time spent on these long-context prefill operations.

1.3 Prefill-Decode Disaggregation (Multi-Node)

Current state: Single-node TP8 serving both prefill and decode.

Optimization: When multiple nodes are available, separate prefill-heavy and decode-heavy workloads into dedicated workers. Evaluation workloads are prefill-dominant (long contexts, relatively short generations), making this particularly beneficial.

Expected impact: Better GPU utilization and throughput when scaling beyond single node.

1.4 Current vs Optimized sglang Config

Current (basic):

python3 -m sglang.launch_server \
    --model-path <path> --tp 8 --host 0.0.0.0 --port 1234 \
    --trust-remote-code --enable-metrics --mem-fraction-static 0.9

Optimized (recommended):

python3 -m sglang.launch_server \
    --model-path <path> --tp 8 --host 0.0.0.0 --port 1234 \
    --trust-remote-code --enable-metrics --mem-fraction-static 0.9 \
    --chunked-prefill-size 4096 --enable-dynamic-chunking

Our internal infra framework already supports these advanced sglang features (prefix caching, chunked pipeline, PD disaggregation, cache-aware routing). Integrating them into the evaluation serving setup is the highest-leverage optimization.


2. Evaluation Pipeline Optimization

2.1 MCP Tool Server Parallel Initialization ✅ Fixed

Location: libs/miroflow-tools/src/miroflow_tools/manager.pyget_all_tool_definitions()

Each task initialized 3 MCP tool servers sequentially. Under high concurrency: avg 234s, max 945s.

Fix: asyncio.gather() to parallelize. Measured 13.8x speedup (234s → 17s avg).

Status: ✅ Implemented in PR #139.


2.2 MCP Server Connection Not Reused Across Tool Calls 🔴

Location: libs/miroflow-tools/src/miroflow_tools/manager.pyexecute_tool_call()

Every single tool call spawns a new MCP server subprocess, performs stdio handshake, executes, then destroys:

# Called ~400 times per BC task!
async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()  # Handshake every time
        tool_result = await session.call_tool(tool_name, arguments)
# Process destroyed here

Note: playwright already does connection reuse correctly (manager.py:247-252).

Proposed fix: Keep MCP server sessions alive for the lifetime of a task.

Expected impact: Eliminate ~400 process spawns per task. Estimated 2-5 min saved per task.


2.3 httpx Connection Pooling ✅ Fixed

Location: libs/miroflow-tools/src/miroflow_tools/dev_mcp_servers/search_and_scrape_webpage.py

Each search call created a new TCP connection. Now reuses shared httpx.AsyncClient.

Status: ✅ Implemented in PR #139.


2.4 max_turns Reduced from 400 to 300 ✅ Fixed

Algorithm team confirmed tasks not solved within 300 turns essentially never succeed in turns 300-400. New config mirothinker_1.7_keep5_max300.yaml saves ~25% wasted compute on tail tasks.

Status: ✅ Already on main.


2.5 Concurrency Overloading (No Backpressure)

NUM_RUNS=2, MAX_CONCURRENT=60 → peak 120 processes simultaneously. Causes E2B sandbox init to spike from 33s to 631s under contention.

Proposed fix: Shared semaphore across runs, or adaptive concurrency.


Won't Fix (confirmed with algorithm team)

  • scrape_and_extract_info optimization — Jina + LLM extraction are both necessary, cannot be shortened without accuracy loss.
  • LLM retry parameters (base_wait=30s, max_retries=10) — Required for reliability under high load.

Benchmark-Specific Impact

Benchmark Tasks Avg Task Time Main Bottleneck
BC-EN 1266 ~43 min LLM inference (long context) + scrape calls
BC-ZH 289 ~49 min LLM inference + high turn count
HLE 500 ~15 min E2B sandbox latency
GAIA 103 ~20 min Mixed tools

Progress Tracker

Priority Optimization Expected Impact Status
P0 LLM inference: prefix caching + chunked prefill 30-50% inference speedup 🔴 To do
P0 Parallel tool server init 13.8x init speedup (234s → 17s) ✅ PR #139
P0 MCP server connection reuse Save 2-5 min/task 🔴 To do
P1 httpx connection pooling Reduce TCP overhead ✅ PR #139
P1 max_turns 400 → 300 ~25% less wasted compute ✅ On main
P2 Concurrency backpressure Reduce E2B init spike ⬚ To do

Environment

  • Agent config: mirothinker_1.7_keep5_max300 (previously v1.5 max400)
  • Typical: 30B model, 8×GPU sglang, MAX_CONCURRENT=60, NUM_RUNS=2
  • Profiled on: BC-ZH/BC-EN completed evaluations (xxg and lxx checkpoints)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions