Skip to content

perf: parallelize tool server init and reduce LLM retry overhead#139

Open
wangbinluo wants to merge 4 commits intoMiroMindAI:mainfrom
wangbinluo:dev_wbl
Open

perf: parallelize tool server init and reduce LLM retry overhead#139
wangbinluo wants to merge 4 commits intoMiroMindAI:mainfrom
wangbinluo:dev_wbl

Conversation

@wangbinluo
Copy link
Collaborator

@wangbinluo wangbinluo commented Mar 17, 2026

Summary

Addresses evaluation pipeline performance bottlenecks identified in #137.

Three changes targeting the slowest parts of BC benchmark evaluation.

Closes #137

Changes

  • P0: Parallelize MCP tool server initializationmanager.py get_all_tool_definitions() now uses asyncio.gather() instead of sequential for loop.
  • P0: Reduce LLM retry overheadopenai_client.py base_wait_time 30s → 10s, max_retries 10 → 5. Prevents 60-90s wasted on retries.
  • P1: httpx connection poolingsearch_and_scrape_webpage.py reuses a shared httpx.AsyncClient instead of creating a new one per request (~346 search calls per BC task).

Benchmark Results

Tool server initialization time (3 runs on dev_wbl)

Run dev_wbl (parallel)
Run 1 (cold start) 29.3s
Run 2 12.1s
Run 3 9.6s
Average 17.0s

Comparison with main branch (extracted from existing evaluation logs)

main (sequential) dev_wbl (parallel) Speedup
Average 234s 17.0s 13.8x
Min 22s 9.6s 2.3x
Max 945s (15.8 min) 29.3s 32.3x

Note: main branch baseline was extracted from BC-ZH evaluation logs (qwen_xxg_negative_r10_new1_step50, 30 tasks). The high average (234s) includes E2B sandbox queueing under high concurrency (MAX_CONCURRENT=60). Per-server breakdown on main: tool-python avg=145s (max=631s), search avg=69s, jina avg=19s.

Test plan

  • Run bench_init_time.py on dev_wbl — 17.0s avg vs main baseline 234s avg
  • Run BC-ZH evaluation end-to-end and verify no regression in accuracy
  • Verify LLM retry behavior works correctly with reduced parameters
  • Check Serper API connection pooling doesn't cause stale connection issues

…oMindAI#137)

- P0: Parallelize MCP tool server initialization with asyncio.gather()
  (saves ~40-50s per task, previously ~71s sequential)
- P0: Reduce LLM retry base_wait_time from 30s to 10s, max_retries from 10 to 5
- P1: Add httpx connection pooling for Serper API requests
  (reuse TCP connections across ~346 search calls per task)

Ref: MiroMindAI#137

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wangbinluo
Copy link
Collaborator Author

wangbinluo commented Mar 17, 2026

@shawnlimn @xingxuanli Could you review this PR when you get a chance?

This PR addresses the performance bottlenecks identified in #137 with two optimizations:

  1. Parallel tool server init (asyncio.gather) — measured 13.8x speedup (234s → 17s avg)
  2. httpx connection pooling — reuse TCP connections for Serper API (~346 calls per BC task)

Note: PR #138 by @JasonOA888 covers only item 1; this PR is a superset with additional optimization and benchmark data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wangbinluo and others added 2 commits March 17, 2026 14:54
Algorithm team confirmed these values are needed for reliability under
high load with self-hosted sglang servers. Ref: MiroMindAI#137

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…iroMindAI#137)

Previously every tool call (except playwright) spawned a new MCP server
subprocess, initialized it, called the tool, then killed it.  BC tasks
average ~300+ tool calls, so the spawn/teardown overhead adds up.

Introduce PersistentMCPSession that keeps the subprocess alive for the
entire task lifetime.  On connection failure it transparently reconnects
once.  Sessions are cleaned up via close_all_sessions() at task end.

This is the P0 "MCP server connection reuse" item from MiroMindAI#137, estimated
to save 2-5 min per task.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation pipeline performance bottlenecks: BC benchmarks 3-5x slower than necessary

1 participant