Skip to content

Investigate validator throughput ceiling and reshape dispatch hot path#494

Merged
HudsonGraeme merged 2 commits into
testnetfrom
investigate/validator-bottlenecks
May 13, 2026
Merged

Investigate validator throughput ceiling and reshape dispatch hot path#494
HudsonGraeme merged 2 commits into
testnetfrom
investigate/validator-bottlenecks

Conversation

@HudsonGraeme
Copy link
Copy Markdown
Member

@HudsonGraeme HudsonGraeme commented May 13, 2026

Context

Mainnet profiling captured a validator running 26h:

  • one main-loop thread pinned at 100%, fifteen tokio workers idle (load avg ~1 on a 16-core box, 94% CPU idle overall)
  • 26 GB RSS + 11 GB swap; [heap] region at 16 GB in pmap
  • stacked_dslice_queue health log showed ~31k entries (each a DSliceRequest holding a cloned Circuit plus a serde_json::Value input tree)
  • throughput ~17 proofs/sec against ~243 queryable miners; ~30% of miner queries failing with reconnect-in-flight transport errors

The ceiling was the validator itself: dispatch_ceiling = verification_concurrency * 2 = 32 in-flight tasks across the entire metagraph, and dispatch_requests() was cloning the full NeuronInfo Vec plus running flat_map+sort over every miner's history on each call (triggered by every task/verify completion and every tick).

What this changes

Queue payload shapeDSliceRequest.circuit: CircuitArc<Circuit>, and inputs: serde_json::Value / outputs: Option<serde_json::Value>bytes::Bytes / Option<Bytes> containing msgpack-encoded values built once at queue-insert time via a new input_data_payload helper. The dispatch path decodes once per dispatched request (decode_msgpack_to_json) to keep the existing DSliceProofGenerationDataModel wire compat. A 30k-entry queue drops from hundreds of MB of fragmented enum trees to a few MB of contiguous blobs plus 16-byte Arc refs. Dead Request struct removed.

Dispatch hot path — new DispatchCache (capacities, adaptive_timeout, api_eligible) refreshed lazily with a 2 s TTL, so the per-call flat_map+sort over miner history and the snapshot HashMap clones run at most once every two seconds instead of hundreds of times per second. Per-dispatch full NeuronInfo clone replaced with a Vec<u16> of UIDs + index shuffle; NeuronInfo resolved just-in-time via the existing O(1) uid_to_idx index. spawn_miner_task now takes owned (ip, port, hotkey) instead of &NeuronInfo, removing the aliased self-borrow that forced the full clone.

Throughput ceilingdispatch_ceiling bumped from verification_concurrency * 2 to * 8 (32 → 128 on a 16-core box). The verification backpressure check was conflating in-flight verification (CPU-bound, must stay near verification_concurrency) with pending-but-not-yet-verifying results (memory-bound, can buffer). They now have independent caps, so I/O fanout to miners is no longer throttled by CPU-bound proof verification draining.

Deferred to follow-ups

  • Eliminate the dispatch-time msgpack→JSON transcode (push rmpv::Value through DSliceProofGenerationDataModel and miner handlers). Requires coordinated miner release.
  • Custom global allocator (mimalloc or tikv-jemallocator). Re-measure RSS 24h after this lands; if [heap] still climbs after the queue compaction, do it.
  • Lift verification_concurrency itself once the CPU headroom freed by the dispatch-cache changes is confirmed in production.

Verification

cargo check --workspace, cargo clippy --workspace --tests -- -D warnings, cargo fmt --check, cargo test --workspace --lib, cargo build -p sn2-validator --release all clean.

Summary by CodeRabbit

  • New Features

    • MessagePack support for tensor inputs and validation against optional input schemas
    • Exposed utilities for encoding/decoding MessagePack payloads
  • Bug Fixes & Improvements

    • Switched request payloads from JSON to compact binary format for lower overhead
    • Request dispatch caching and improved miner selection with adaptive timeout handling

Review Change Stack

Profiling on mainnet showed one main-loop thread saturated at 100% while
fifteen tokio workers idled, a thirty-thousand-entry dslice queue
materialised as serde_json::Value trees with by-value Circuit clones, and
in-flight task fanout capped at thirty-two against ~240 queryable miners.

Replace queue payloads with Arc<Circuit> and msgpack bytes so each entry
collapses from kilobytes of nested Value enums to a single contiguous
allocation. Introduce a DispatchCache that memoises miner capacities,
adaptive timeout, and the api-eligible set with a two-second TTL so the
per-call flat_map+sort and HashMap clones no longer pin a core. Replace
the per-dispatch full NeuronInfo clone with a Vec<u16> of UIDs plus
just-in-time lookup through the existing uid_to_idx index. Raise the
dispatch ceiling from 2x to 8x verification_concurrency and decouple
the pending_verifications cap from the verify_tasks cap so I/O fanout
is no longer gated by CPU-bound proof verification.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a9857c8c-2462-42e7-bd74-150073d52644

📥 Commits

Reviewing files that changed from the base of the PR and between 3bbc8ee and 69dbb71.

📒 Files selected for processing (4)
  • Cargo.toml
  • crates/sn2-types/src/miner_response.rs
  • crates/sn2-validator/src/validator_loop/dispatch.rs
  • crates/sn2-validator/src/validator_loop/mod.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/sn2-validator/src/validator_loop/dispatch.rs

Walkthrough

This PR migrates tensor inputs from JSON to MessagePack bytes, adds MessagePack tensor codec helpers and Circuit msgpack validation, updates request types to use bytes::Bytes and Arc<Circuit>, refactors the dslice staging/dispatch pipeline to use binary payloads, and adds a dispatch cache used during miner selection.

Changes

MessagePack codec and DSlice byte-oriented pipeline

Layer / File(s) Summary
MessagePack codec foundation and dependencies
Cargo.toml, crates/sn2-types/Cargo.toml, crates/sn2-validator/Cargo.toml, crates/sn2-types/src/tensor_codec.rs, crates/sn2-types/src/lib.rs, crates/sn2-types/src/circuit.rs
Adds rmpv, rmp-serde, and bytes deps; implements arrayd_to_msgpack_value, encode_msgpack_value, input_data_payload, decode_msgpack_value, decode_msgpack_to_json; re-exports codec helpers and adds Circuit::validate_inputs_msgpack.
DSliceRequest and MinerResponse type migration to bytes and Arc
crates/sn2-types/src/request.rs, crates/sn2-types/src/miner_response.rs, Cargo.toml
Removes old Request; updates DSliceRequest to use Arc<Circuit>, bytes::Bytes for inputs/outputs and drops Serde derives; updates MinerResponse.circuit to Option<Arc<Circuit>>; enables serde rc feature in workspace.
DSlice submission pipeline: Arc and bytes migration
crates/sn2-validator/src/validator_loop/dslice.rs
Refactors enqueue/staging to accept &Arc<Circuit>, wraps circuits in Arc, converts tiled and non-tiled tile payloads to bytes::Bytes via input_data_payload, and updates tile request builder and benchmark enqueue call sites.
Dispatch cache and selection adjustments
crates/sn2-validator/src/validator_loop/dispatch.rs, crates/sn2-validator/src/validator_loop/mod.rs
Adds DispatchCache with TTL, capacity lookup, adaptive timeout, and api_eligible set; refreshes cache when stale; decodes msgpack inputs to JSON during dispatch selection; refactors spawn_miner_task signature; removes old neuron-based compute_api_eligible.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

run-build

Poem

🐰 Hopping bytes instead of JSON down the lane,
Arc-wrapped circuits share without the strain,
Msgpack carrots packed—compact and neat,
Dispatch cache hums, keeping miners fleet,
A rabbit cheers: small bytes, big gain!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the PR's main objective: investigating validator throughput bottlenecks and optimizing the dispatch hot path with caching, binary payloads, and higher concurrency ceilings.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch investigate/validator-bottlenecks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
crates/sn2-validator/src/validator_loop/dispatch.rs (1)

265-266: 🏗️ Heavy lift

Avoid cloning the full Circuit back into each dispatched dslice.

Arc<Circuit> shrinks the queue, but Some((*dslice.circuit).clone()) reintroduces one full circuit allocation per in-flight request. With the higher dispatch ceiling, that can eat into the memory win from this PR. If the verification path can accept shared ownership, carry Arc<Circuit> through DispatchedRequest/MinerResponse instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sn2-validator/src/validator_loop/dispatch.rs` around lines 265 - 266,
The code is cloning the full Circuit into each dispatched dslice via
Some((*dslice.circuit).clone()), undoing the Arc memory benefit; change the
field types for task_circuit in DispatchedRequest and MinerResponse to
Option<Arc<Circuit>> and stop cloning the inner Circuit—set task_circuit to
Some(dslice.circuit.clone()) (cloning the Arc, not the Circuit) and update all
downstream consumers (verification path) to accept Arc<Circuit> shared ownership
instead of owned Circuit so no full Circuit allocations are reintroduced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sn2-validator/src/validator_loop/dispatch.rs`:
- Around line 226-230: The call to sn2_types::decode_msgpack_to_json currently
uses unwrap_or_default on dslice.inputs which hides decoding errors; replace
that with explicit error handling: call decode_msgpack_to_json(&dslice.inputs)
and match the Result, and on Err log or propagate the decoding error (including
the error details and uid/dslice id) and drop/fail the request instead of
constructing dslice_model and calling self.pipeline.prepare_dslice_request; only
continue to call prepare_dslice_request on Ok(inputs_json). Ensure references in
the fix are to decode_msgpack_to_json, dslice.inputs, prepare_dslice_request and
remove the unwrap_or_default usage.

---

Nitpick comments:
In `@crates/sn2-validator/src/validator_loop/dispatch.rs`:
- Around line 265-266: The code is cloning the full Circuit into each dispatched
dslice via Some((*dslice.circuit).clone()), undoing the Arc memory benefit;
change the field types for task_circuit in DispatchedRequest and MinerResponse
to Option<Arc<Circuit>> and stop cloning the inner Circuit—set task_circuit to
Some(dslice.circuit.clone()) (cloning the Arc, not the Circuit) and update all
downstream consumers (verification path) to accept Arc<Circuit> shared ownership
instead of owned Circuit so no full Circuit allocations are reintroduced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e08dd6e5-d208-44d2-b16b-68347da1eae3

📥 Commits

Reviewing files that changed from the base of the PR and between 4926e40 and 3bbc8ee.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • Cargo.toml
  • crates/sn2-types/Cargo.toml
  • crates/sn2-types/src/circuit.rs
  • crates/sn2-types/src/lib.rs
  • crates/sn2-types/src/request.rs
  • crates/sn2-types/src/tensor_codec.rs
  • crates/sn2-validator/Cargo.toml
  • crates/sn2-validator/src/validator_loop/dispatch.rs
  • crates/sn2-validator/src/validator_loop/dslice.rs
  • crates/sn2-validator/src/validator_loop/mod.rs

Comment thread crates/sn2-validator/src/validator_loop/dispatch.rs Outdated
The previous Arc<Circuit> rework still cloned the inner Circuit at the
dispatch boundary (Some((*dslice.circuit).clone())) and silently masked
msgpack decode errors via unwrap_or_default on inputs, both flagged in
review.

Propagate Arc<Circuit> through DispatchedRequest.task_circuit and
MinerResponse.circuit so dispatch clones only the Arc handle; enable
serde rc feature so the existing derives accept Arc transparently. RWR
path retains a single Arc::new wrap because ensure_circuit still returns
an owned Circuit (out of scope for this change). Replace unwrap_or_default
on decode_msgpack_to_json with explicit Err logging keyed on uid /
run_uid / slice_num / tile_idx; drop the request on decode failure
rather than constructing a request with Null inputs.
@HudsonGraeme HudsonGraeme merged commit 0291f33 into testnet May 13, 2026
18 checks passed
@HudsonGraeme HudsonGraeme deleted the investigate/validator-bottlenecks branch May 13, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant