Feat/sliding window metrics — Related to #22480 #22488

NumberWan · 2025-08-08T03:29:57Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Add request-count-based sliding-window moving averages to improve stability and interpretability of performance monitoring (e.g., average latency, throughput, and phase times). Metrics update on each finished request and keep the average of the most recent N finished requests, where N is configurable.
Related to #22480.
Key changes:
Introduce a long-lived SlidingWindowStats integrated into PrometheusStatLogger lifecycle (persists across iterations; no per-iteration reset).
Add ObservabilityConfig.sliding_window_request_count (default 100) to configure window size.
Expose Prometheus gauges (with window_size label):
vllm:sliding_window_avg_latency_ms
vllm:sliding_window_avg_throughput_tokens_per_sec
vllm:sliding_window_avg_queued_time_ms
vllm:sliding_window_avg_prefill_time_ms
vllm:sliding_window_avg_decode_time_ms
Note:
TTFT average gauge is reserved for follow-up once per-request TTFT is exposed consistently on finish path.

Test Plan

Unit tests:
Run only new tests:
pytest -q tests/v1/metrics/test_sliding_window_stats.py
Optional full test:
pytest -q

Test Result

Unit tests verify initialization, single-request update, sliding eviction behavior, and empty-window behavior.
Manual: Gauges updated per finished request; values reflect averages over last N completed requests; window_size label matches configuration.

(Optional) Documentation Update

Add an observability/metrics section documenting:
The new gauges and their semantics.
How to configure ObservabilityConfig.sliding_window_request_count.

Allow edits by maintainers.
Reference:
Issue [Feature]: Add Moving Average Statistics for Better Performance Monitoring #22480

gemini-code-assist

Code Review

This pull request introduces sliding window metrics for enhanced performance monitoring, which is a valuable addition. However, I've identified two critical issues that need to be addressed. First, there's a bug in SlidingWindowMetric where the window size is hardcoded, which will cause incorrect behavior for any non-default configuration. Second, in PrometheusStatLogger, variables are used before they are defined, which will lead to a NameError at runtime. Please see my detailed comments for suggestions on how to fix these issues.

gemini-code-assist · 2025-08-08T03:31:30Z

vllm/v1/metrics/loggers.py

+        # Build a long-lived SlidingWindowStats to accumulate across iterations
+        window_size = getattr(vllm_config.observability_config,
+                              "sliding_window_request_count", 100)
+        self.sliding_window_stats = SlidingWindowStats(window_size=window_size)
+
+        # Exporter for sliding window averages as gauges
+        # We use raw prometheus_client here to align with the existing pattern
+        # and avoid extra module-level registration.
+        self._sliding_latency_ms = self._gauge_cls(
+            name="vllm:sliding_window_avg_latency_ms",
+            documentation=
+            "Average latency (ms) over last N finished requests (request-count window)",
+            multiprocess_mode="mostrecent",
+            labelnames=labelnames + ["window_size"],
+        )
+        self._sliding_throughput = self._gauge_cls(
+            name="vllm:sliding_window_avg_throughput_tokens_per_sec",
+            documentation=
+            "Average throughput (tokens/s) over last N finished requests",
+            multiprocess_mode="mostrecent",
+            labelnames=labelnames + ["window_size"],
+        )
+        self._sliding_ttft_ms = self._gauge_cls(
+            name="vllm:sliding_window_avg_ttft_ms",
+            documentation=
+            "Average time-to-first-token (ms) over last N finished requests",
+            multiprocess_mode="mostrecent",
+            labelnames=labelnames + ["window_size"],
+        )
+        self._sliding_queued_ms = self._gauge_cls(
+            name="vllm:sliding_window_avg_queued_time_ms",
+            documentation="Average queued time (ms) over last N finished requests",
+            multiprocess_mode="mostrecent",
+            labelnames=labelnames + ["window_size"],
+        )
+        self._sliding_prefill_ms = self._gauge_cls(
+            name="vllm:sliding_window_avg_prefill_time_ms",
+            documentation="Average prefill time (ms) over last N finished requests",
+            multiprocess_mode="mostrecent",
+            labelnames=labelnames + ["window_size"],
+        )
+        self._sliding_decode_ms = self._gauge_cls(
+            name="vllm:sliding_window_avg_decode_time_ms",
+            documentation="Average decode time (ms) over last N finished requests",
+            multiprocess_mode="mostrecent",
+            labelnames=labelnames + ["window_size"],
+        )
+        # Per-engine label helpers for these gauges
+        self._sliding_labels = {
+            idx: {
+                "latency": self._sliding_latency_ms.labels(
+                    model_name, str(idx), str(window_size)),
+                "throughput": self._sliding_throughput.labels(
+                    model_name, str(idx), str(window_size)),
+                "ttft": self._sliding_ttft_ms.labels(
+                    model_name, str(idx), str(window_size)),
+                "queued": self._sliding_queued_ms.labels(
+                    model_name, str(idx), str(window_size)),
+                "prefill": self._sliding_prefill_ms.labels(
+                    model_name, str(idx), str(window_size)),
+                "decode": self._sliding_decode_ms.labels(
+                    model_name, str(idx), str(window_size)),
+            }
+            for idx in self.engine_indexes
+        }


The variables labelnames and model_name are used in this block before they are defined (on lines 235-236). This will cause a NameError at runtime.

To fix this, I've added their definitions at the beginning of this block in the suggestion. Please make sure to remove the original definitions on lines 235-236 to avoid having them defined too late.

# Build a long-lived SlidingWindowStats to accumulate across iterations labelnames = ["model_name", "engine"] model_name = vllm_config.model_config.served_model_name window_size = getattr(vllm_config.observability_config, "sliding_window_request_count", 100) self.sliding_window_stats = SlidingWindowStats(window_size=window_size) # Exporter for sliding window averages as gauges # We use raw prometheus_client here to align with the existing pattern # and avoid extra module-level registration. self._sliding_latency_ms = self._gauge_cls( name="vllm:sliding_window_avg_latency_ms", documentation= "Average latency (ms) over last N finished requests (request-count window)", multiprocess_mode="mostrecent", labelnames=labelnames + ["window_size"], ) self._sliding_throughput = self._gauge_cls( name="vllm:sliding_window_avg_throughput_tokens_per_sec", documentation= "Average throughput (tokens/s) over last N finished requests", multiprocess_mode="mostrecent", labelnames=labelnames + ["window_size"], ) self._sliding_ttft_ms = self._gauge_cls( name="vllm:sliding_window_avg_ttft_ms", documentation= "Average time-to-first-token (ms) over last N finished requests", multiprocess_mode="mostrecent", labelnames=labelnames + ["window_size"], ) self._sliding_queued_ms = self._gauge_cls( name="vllm:sliding_window_avg_queued_time_ms", documentation="Average queued time (ms) over last N finished requests", multiprocess_mode="mostrecent", labelnames=labelnames + ["window_size"], ) self._sliding_prefill_ms = self._gauge_cls( name="vllm:sliding_window_avg_prefill_time_ms", documentation="Average prefill time (ms) over last N finished requests", multiprocess_mode="mostrecent", labelnames=labelnames + ["window_size"], ) self._sliding_decode_ms = self._gauge_cls( name="vllm:sliding_window_avg_decode_time_ms", documentation="Average decode time (ms) over last N finished requests", multiprocess_mode="mostrecent", labelnames=labelnames + ["window_size"], ) # Per-engine label helpers for these gauges self._sliding_labels = { idx: { "latency": self._sliding_latency_ms.labels( model_name, str(idx), str(window_size)), "throughput": self._sliding_throughput.labels( model_name, str(idx), str(window_size)), "ttft": self._sliding_ttft_ms.labels( model_name, str(idx), str(window_size)), "queued": self._sliding_queued_ms.labels( model_name, str(idx), str(window_size)), "prefill": self._sliding_prefill_ms.labels( model_name, str(idx), str(window_size)), "decode": self._sliding_decode_ms.labels( model_name, str(idx), str(window_size)), } for idx in self.engine_indexes }

gemini-code-assist · 2025-08-08T03:31:30Z

vllm/v1/metrics/stats.py

+class SlidingWindowMetric:
+    """Represents a single sliding window metric."""
+    window_size: int
+    values: deque = field(default_factory=lambda: deque(maxlen=100))
+
+    def add_value(self, value: float) -> None:
+        """Add a new value to the sliding window."""
+        self.values.append(value)
+
+    @property
+    def average(self) -> float:
+        """Calculate the average of values in the window."""
+        if not self.values:
+            return 0.0
+        return sum(self.values) / len(self.values)
+
+    @property
+    def count(self) -> int:
+        """Return the number of values in the window."""
+        return len(self.values)


The values field in the SlidingWindowMetric dataclass is initialized with a deque that has a hardcoded maxlen of 100. This ignores the window_size parameter passed during initialization, causing the sliding window to behave incorrectly for any window size other than 100. This is a critical bug.

To fix this, you should use __post_init__ to initialize the deque with the correct maxlen based on window_size.

class SlidingWindowMetric: """Represents a single sliding window metric.""" window_size: int values: deque = field(init=False) def __post_init__(self): self.values = deque(maxlen=self.window_size) def add_value(self, value: float) -> None: """Add a new value to the sliding window.""" self.values.append(value) @property def average(self) -> float: """Calculate the average of values in the window.""" if not self.values: return 0.0 return sum(self.values) / len(self.values) @property def count(self) -> int: """Return the number of values in the window.""" return len(self.values)

…st-count-based rolling averages; add tests\n\nImplements sliding window averages for latency, throughput, TTFT, queued/prefill/decode times. Exposes gauges labeled by window_size. Integrates into IterationStats update path. Adds unit tests. Related to vllm-project#22480 Signed-off-by: Jeff Wan <[email protected]>

…cle and add observability config for window size; remove module-level exporter\n\n- Create long-lived SlidingWindowStats in PrometheusStatLogger so values persist across iterations\n- Add ObservabilityConfig.sliding_window_request_count (default 100)\n- Update gauges for latency/throughput/queued/prefill/decode averages labeled by window_size\n- Remove v1/metrics/sliding_window_metrics.py and in-iteration updates Signed-off-by: Jeff Wan <[email protected]>

…atLogger; fix SlidingWindowMetric to honor window_size via __post_init__ Signed-off-by: Jeff Wan <[email protected]>

Signed-off-by: Jeff Wan <[email protected]>

github-actions · 2025-08-08T04:14:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-08-11T02:39:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NumberWan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NumberWan requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, simon-mo, youkaichao, mgoin, tlrmchlsmth, houseroad and hmellor as code owners August 8, 2025 03:29

mergify bot added the v1 label Aug 8, 2025

gemini-code-assist bot reviewed Aug 8, 2025

View reviewed changes

NumberWan added 4 commits August 8, 2025 11:48

fix(metrics): define labelnames/model_name before use in PrometheusSt…

00870a0

…atLogger; fix SlidingWindowMetric to honor window_size via __post_init__ Signed-off-by: Jeff Wan <[email protected]>

ci: trigger pipelines

f7dd2c1

Signed-off-by: Jeff Wan <[email protected]>

NumberWan force-pushed the feat/sliding-window-metrics branch from 8f42bd3 to f7dd2c1 Compare August 8, 2025 03:48

mergify bot added the needs-rebase label Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feat/sliding window metrics — Related to #22480 #22488

Feat/sliding window metrics — Related to #22480 #22488

NumberWan commented Aug 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 8, 2025

Uh oh!

gemini-code-assist bot Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

mergify bot commented Aug 11, 2025

Uh oh!

Uh oh!

Uh oh!

Feat/sliding window metrics — Related to #22480 #22488

Are you sure you want to change the base?

Feat/sliding window metrics — Related to #22480 #22488

Conversation

NumberWan commented Aug 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

mergify bot commented Aug 11, 2025

Uh oh!

Uh oh!

NumberWan commented Aug 8, 2025 •

edited by github-actions bot

Loading