Router multiget high latency test #2477

huangminchn · 2026-02-12T00:09:14Z

Problem Statement

Solution

Code changes

Added new code behind a config. If so list the config names and their default values in the PR description.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.

…pikes Prior investigation proved that pipeline_latency (HTTP decode to scatter-gather handler entry) accounts for ~97% of the 1-second P99 multiget_streaming latency spikes. However, pipeline_latency is a monolithic span with no sub-breakdown, making it impossible to distinguish between three remaining root cause candidates: 1. EventLoop write contention from slow client I/O 2. Venice handler chain intermittent slowness (throttle/ACL/metadata) 3. EventLoop task queue buildup This change adds targeted diagnostic metrics to differentiate these causes: New latency sub-breakdown (P50/P95/P99): - pre_handler_latency: decode -> first Venice handler entry (captures EventLoop queuing and chunk aggregation time) - handler_chain_latency: first Venice handler -> last Venice handler (captures Venice handler processing time: throttle, ACL, etc.) New infrastructure gauges: - eventloop_pending_tasks_avg/max: pending tasks across worker EventLoop threads (detects EventLoop task queue buildup) - unwritable_channel_count: channels with full write buffers (detects slow client write backpressure) Key diagnostic relationships: - pre_handler_latency spike -> EventLoop starvation - handler_chain_latency spike -> Venice handler slowness - eventloop_pending_tasks_max spike -> task queue buildup - unwritable_channel_count spike -> client write backpressure Co-Authored-By: Claude Opus 4.6 <[email protected]>

huangminchn and others added 2 commits February 11, 2026 16:08

Even more metrics

72919b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Router multiget high latency test #2477

Router multiget high latency test #2477

Uh oh!

huangminchn commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Router multiget high latency test #2477

Are you sure you want to change the base?

Router multiget high latency test #2477

Uh oh!

Conversation

huangminchn commented Feb 12, 2026

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant