-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
Feat/sliding window metrics — Related to #22480 #22488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat/sliding window metrics — Related to #22480 #22488
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces sliding window metrics for enhanced performance monitoring, which is a valuable addition. However, I've identified two critical issues that need to be addressed. First, there's a bug in SlidingWindowMetric
where the window size is hardcoded, which will cause incorrect behavior for any non-default configuration. Second, in PrometheusStatLogger
, variables are used before they are defined, which will lead to a NameError
at runtime. Please see my detailed comments for suggestions on how to fix these issues.
# Build a long-lived SlidingWindowStats to accumulate across iterations | ||
window_size = getattr(vllm_config.observability_config, | ||
"sliding_window_request_count", 100) | ||
self.sliding_window_stats = SlidingWindowStats(window_size=window_size) | ||
|
||
# Exporter for sliding window averages as gauges | ||
# We use raw prometheus_client here to align with the existing pattern | ||
# and avoid extra module-level registration. | ||
self._sliding_latency_ms = self._gauge_cls( | ||
name="vllm:sliding_window_avg_latency_ms", | ||
documentation= | ||
"Average latency (ms) over last N finished requests (request-count window)", | ||
multiprocess_mode="mostrecent", | ||
labelnames=labelnames + ["window_size"], | ||
) | ||
self._sliding_throughput = self._gauge_cls( | ||
name="vllm:sliding_window_avg_throughput_tokens_per_sec", | ||
documentation= | ||
"Average throughput (tokens/s) over last N finished requests", | ||
multiprocess_mode="mostrecent", | ||
labelnames=labelnames + ["window_size"], | ||
) | ||
self._sliding_ttft_ms = self._gauge_cls( | ||
name="vllm:sliding_window_avg_ttft_ms", | ||
documentation= | ||
"Average time-to-first-token (ms) over last N finished requests", | ||
multiprocess_mode="mostrecent", | ||
labelnames=labelnames + ["window_size"], | ||
) | ||
self._sliding_queued_ms = self._gauge_cls( | ||
name="vllm:sliding_window_avg_queued_time_ms", | ||
documentation="Average queued time (ms) over last N finished requests", | ||
multiprocess_mode="mostrecent", | ||
labelnames=labelnames + ["window_size"], | ||
) | ||
self._sliding_prefill_ms = self._gauge_cls( | ||
name="vllm:sliding_window_avg_prefill_time_ms", | ||
documentation="Average prefill time (ms) over last N finished requests", | ||
multiprocess_mode="mostrecent", | ||
labelnames=labelnames + ["window_size"], | ||
) | ||
self._sliding_decode_ms = self._gauge_cls( | ||
name="vllm:sliding_window_avg_decode_time_ms", | ||
documentation="Average decode time (ms) over last N finished requests", | ||
multiprocess_mode="mostrecent", | ||
labelnames=labelnames + ["window_size"], | ||
) | ||
# Per-engine label helpers for these gauges | ||
self._sliding_labels = { | ||
idx: { | ||
"latency": self._sliding_latency_ms.labels( | ||
model_name, str(idx), str(window_size)), | ||
"throughput": self._sliding_throughput.labels( | ||
model_name, str(idx), str(window_size)), | ||
"ttft": self._sliding_ttft_ms.labels( | ||
model_name, str(idx), str(window_size)), | ||
"queued": self._sliding_queued_ms.labels( | ||
model_name, str(idx), str(window_size)), | ||
"prefill": self._sliding_prefill_ms.labels( | ||
model_name, str(idx), str(window_size)), | ||
"decode": self._sliding_decode_ms.labels( | ||
model_name, str(idx), str(window_size)), | ||
} | ||
for idx in self.engine_indexes | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables labelnames
and model_name
are used in this block before they are defined (on lines 235-236). This will cause a NameError
at runtime.
To fix this, I've added their definitions at the beginning of this block in the suggestion. Please make sure to remove the original definitions on lines 235-236 to avoid having them defined too late.
# Build a long-lived SlidingWindowStats to accumulate across iterations
labelnames = ["model_name", "engine"]
model_name = vllm_config.model_config.served_model_name
window_size = getattr(vllm_config.observability_config,
"sliding_window_request_count", 100)
self.sliding_window_stats = SlidingWindowStats(window_size=window_size)
# Exporter for sliding window averages as gauges
# We use raw prometheus_client here to align with the existing pattern
# and avoid extra module-level registration.
self._sliding_latency_ms = self._gauge_cls(
name="vllm:sliding_window_avg_latency_ms",
documentation=
"Average latency (ms) over last N finished requests (request-count window)",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
self._sliding_throughput = self._gauge_cls(
name="vllm:sliding_window_avg_throughput_tokens_per_sec",
documentation=
"Average throughput (tokens/s) over last N finished requests",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
self._sliding_ttft_ms = self._gauge_cls(
name="vllm:sliding_window_avg_ttft_ms",
documentation=
"Average time-to-first-token (ms) over last N finished requests",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
self._sliding_queued_ms = self._gauge_cls(
name="vllm:sliding_window_avg_queued_time_ms",
documentation="Average queued time (ms) over last N finished requests",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
self._sliding_prefill_ms = self._gauge_cls(
name="vllm:sliding_window_avg_prefill_time_ms",
documentation="Average prefill time (ms) over last N finished requests",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
self._sliding_decode_ms = self._gauge_cls(
name="vllm:sliding_window_avg_decode_time_ms",
documentation="Average decode time (ms) over last N finished requests",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
# Per-engine label helpers for these gauges
self._sliding_labels = {
idx: {
"latency": self._sliding_latency_ms.labels(
model_name, str(idx), str(window_size)),
"throughput": self._sliding_throughput.labels(
model_name, str(idx), str(window_size)),
"ttft": self._sliding_ttft_ms.labels(
model_name, str(idx), str(window_size)),
"queued": self._sliding_queued_ms.labels(
model_name, str(idx), str(window_size)),
"prefill": self._sliding_prefill_ms.labels(
model_name, str(idx), str(window_size)),
"decode": self._sliding_decode_ms.labels(
model_name, str(idx), str(window_size)),
}
for idx in self.engine_indexes
}
vllm/v1/metrics/stats.py
Outdated
class SlidingWindowMetric: | ||
"""Represents a single sliding window metric.""" | ||
window_size: int | ||
values: deque = field(default_factory=lambda: deque(maxlen=100)) | ||
|
||
def add_value(self, value: float) -> None: | ||
"""Add a new value to the sliding window.""" | ||
self.values.append(value) | ||
|
||
@property | ||
def average(self) -> float: | ||
"""Calculate the average of values in the window.""" | ||
if not self.values: | ||
return 0.0 | ||
return sum(self.values) / len(self.values) | ||
|
||
@property | ||
def count(self) -> int: | ||
"""Return the number of values in the window.""" | ||
return len(self.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values
field in the SlidingWindowMetric
dataclass is initialized with a deque
that has a hardcoded maxlen
of 100. This ignores the window_size
parameter passed during initialization, causing the sliding window to behave incorrectly for any window size other than 100. This is a critical bug.
To fix this, you should use __post_init__
to initialize the deque
with the correct maxlen
based on window_size
.
class SlidingWindowMetric:
"""Represents a single sliding window metric."""
window_size: int
values: deque = field(init=False)
def __post_init__(self):
self.values = deque(maxlen=self.window_size)
def add_value(self, value: float) -> None:
"""Add a new value to the sliding window."""
self.values.append(value)
@property
def average(self) -> float:
"""Calculate the average of values in the window."""
if not self.values:
return 0.0
return sum(self.values) / len(self.values)
@property
def count(self) -> int:
"""Return the number of values in the window."""
return len(self.values)
…st-count-based rolling averages; add tests\n\nImplements sliding window averages for latency, throughput, TTFT, queued/prefill/decode times. Exposes gauges labeled by window_size. Integrates into IterationStats update path. Adds unit tests. Related to vllm-project#22480 Signed-off-by: Jeff Wan <[email protected]>
…cle and add observability config for window size; remove module-level exporter\n\n- Create long-lived SlidingWindowStats in PrometheusStatLogger so values persist across iterations\n- Add ObservabilityConfig.sliding_window_request_count (default 100)\n- Update gauges for latency/throughput/queued/prefill/decode averages labeled by window_size\n- Remove v1/metrics/sliding_window_metrics.py and in-iteration updates Signed-off-by: Jeff Wan <[email protected]>
…atLogger; fix SlidingWindowMetric to honor window_size via __post_init__ Signed-off-by: Jeff Wan <[email protected]>
Signed-off-by: Jeff Wan <[email protected]>
8f42bd3
to
f7dd2c1
Compare
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Add request-count-based sliding-window moving averages to improve stability and interpretability of performance monitoring (e.g., average latency, throughput, and phase times). Metrics update on each finished request and keep the average of the most recent N finished requests, where N is configurable.
Related to #22480.
Key changes:
Introduce a long-lived SlidingWindowStats integrated into PrometheusStatLogger lifecycle (persists across iterations; no per-iteration reset).
Add ObservabilityConfig.sliding_window_request_count (default 100) to configure window size.
Expose Prometheus gauges (with window_size label):
vllm:sliding_window_avg_latency_ms
vllm:sliding_window_avg_throughput_tokens_per_sec
vllm:sliding_window_avg_queued_time_ms
vllm:sliding_window_avg_prefill_time_ms
vllm:sliding_window_avg_decode_time_ms
Note:
TTFT average gauge is reserved for follow-up once per-request TTFT is exposed consistently on finish path.
Test Plan
Unit tests:
Run only new tests:
pytest -q tests/v1/metrics/test_sliding_window_stats.py
Optional full test:
pytest -q
Test Result
Unit tests verify initialization, single-request update, sliding eviction behavior, and empty-window behavior.
Manual: Gauges updated per finished request; values reflect averages over last N completed requests; window_size label matches configuration.
(Optional) Documentation Update
Add an observability/metrics section documenting:
The new gauges and their semantics.
How to configure ObservabilityConfig.sliding_window_request_count.
Reference:
Issue [Feature]: Add Moving Average Statistics for Better Performance Monitoring #22480