Skip to content

Feat/sliding window metrics — Related to #22480 #22488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

NumberWan
Copy link

@NumberWan NumberWan commented Aug 8, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Add request-count-based sliding-window moving averages to improve stability and interpretability of performance monitoring (e.g., average latency, throughput, and phase times). Metrics update on each finished request and keep the average of the most recent N finished requests, where N is configurable.
Related to #22480.
Key changes:
Introduce a long-lived SlidingWindowStats integrated into PrometheusStatLogger lifecycle (persists across iterations; no per-iteration reset).
Add ObservabilityConfig.sliding_window_request_count (default 100) to configure window size.
Expose Prometheus gauges (with window_size label):
vllm:sliding_window_avg_latency_ms
vllm:sliding_window_avg_throughput_tokens_per_sec
vllm:sliding_window_avg_queued_time_ms
vllm:sliding_window_avg_prefill_time_ms
vllm:sliding_window_avg_decode_time_ms
Note:
TTFT average gauge is reserved for follow-up once per-request TTFT is exposed consistently on finish path.

Test Plan

Unit tests:
Run only new tests:
pytest -q tests/v1/metrics/test_sliding_window_stats.py
Optional full test:
pytest -q

Test Result

Unit tests verify initialization, single-request update, sliding eviction behavior, and empty-window behavior.
Manual: Gauges updated per finished request; values reflect averages over last N completed requests; window_size label matches configuration.

(Optional) Documentation Update

Add an observability/metrics section documenting:
The new gauges and their semantics.
How to configure ObservabilityConfig.sliding_window_request_count.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces sliding window metrics for enhanced performance monitoring, which is a valuable addition. However, I've identified two critical issues that need to be addressed. First, there's a bug in SlidingWindowMetric where the window size is hardcoded, which will cause incorrect behavior for any non-default configuration. Second, in PrometheusStatLogger, variables are used before they are defined, which will lead to a NameError at runtime. Please see my detailed comments for suggestions on how to fix these issues.

Comment on lines +164 to +231
# Build a long-lived SlidingWindowStats to accumulate across iterations
window_size = getattr(vllm_config.observability_config,
"sliding_window_request_count", 100)
self.sliding_window_stats = SlidingWindowStats(window_size=window_size)

# Exporter for sliding window averages as gauges
# We use raw prometheus_client here to align with the existing pattern
# and avoid extra module-level registration.
self._sliding_latency_ms = self._gauge_cls(
name="vllm:sliding_window_avg_latency_ms",
documentation=
"Average latency (ms) over last N finished requests (request-count window)",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
self._sliding_throughput = self._gauge_cls(
name="vllm:sliding_window_avg_throughput_tokens_per_sec",
documentation=
"Average throughput (tokens/s) over last N finished requests",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
self._sliding_ttft_ms = self._gauge_cls(
name="vllm:sliding_window_avg_ttft_ms",
documentation=
"Average time-to-first-token (ms) over last N finished requests",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
self._sliding_queued_ms = self._gauge_cls(
name="vllm:sliding_window_avg_queued_time_ms",
documentation="Average queued time (ms) over last N finished requests",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
self._sliding_prefill_ms = self._gauge_cls(
name="vllm:sliding_window_avg_prefill_time_ms",
documentation="Average prefill time (ms) over last N finished requests",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
self._sliding_decode_ms = self._gauge_cls(
name="vllm:sliding_window_avg_decode_time_ms",
documentation="Average decode time (ms) over last N finished requests",
multiprocess_mode="mostrecent",
labelnames=labelnames + ["window_size"],
)
# Per-engine label helpers for these gauges
self._sliding_labels = {
idx: {
"latency": self._sliding_latency_ms.labels(
model_name, str(idx), str(window_size)),
"throughput": self._sliding_throughput.labels(
model_name, str(idx), str(window_size)),
"ttft": self._sliding_ttft_ms.labels(
model_name, str(idx), str(window_size)),
"queued": self._sliding_queued_ms.labels(
model_name, str(idx), str(window_size)),
"prefill": self._sliding_prefill_ms.labels(
model_name, str(idx), str(window_size)),
"decode": self._sliding_decode_ms.labels(
model_name, str(idx), str(window_size)),
}
for idx in self.engine_indexes
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variables labelnames and model_name are used in this block before they are defined (on lines 235-236). This will cause a NameError at runtime.

To fix this, I've added their definitions at the beginning of this block in the suggestion. Please make sure to remove the original definitions on lines 235-236 to avoid having them defined too late.

        # Build a long-lived SlidingWindowStats to accumulate across iterations
        labelnames = ["model_name", "engine"]
        model_name = vllm_config.model_config.served_model_name
        window_size = getattr(vllm_config.observability_config,
                              "sliding_window_request_count", 100)
        self.sliding_window_stats = SlidingWindowStats(window_size=window_size)

        # Exporter for sliding window averages as gauges
        # We use raw prometheus_client here to align with the existing pattern
        # and avoid extra module-level registration.
        self._sliding_latency_ms = self._gauge_cls(
            name="vllm:sliding_window_avg_latency_ms",
            documentation=
            "Average latency (ms) over last N finished requests (request-count window)",
            multiprocess_mode="mostrecent",
            labelnames=labelnames + ["window_size"],
        )
        self._sliding_throughput = self._gauge_cls(
            name="vllm:sliding_window_avg_throughput_tokens_per_sec",
            documentation=
            "Average throughput (tokens/s) over last N finished requests",
            multiprocess_mode="mostrecent",
            labelnames=labelnames + ["window_size"],
        )
        self._sliding_ttft_ms = self._gauge_cls(
            name="vllm:sliding_window_avg_ttft_ms",
            documentation=
            "Average time-to-first-token (ms) over last N finished requests",
            multiprocess_mode="mostrecent",
            labelnames=labelnames + ["window_size"],
        )
        self._sliding_queued_ms = self._gauge_cls(
            name="vllm:sliding_window_avg_queued_time_ms",
            documentation="Average queued time (ms) over last N finished requests",
            multiprocess_mode="mostrecent",
            labelnames=labelnames + ["window_size"],
        )
        self._sliding_prefill_ms = self._gauge_cls(
            name="vllm:sliding_window_avg_prefill_time_ms",
            documentation="Average prefill time (ms) over last N finished requests",
            multiprocess_mode="mostrecent",
            labelnames=labelnames + ["window_size"],
        )
        self._sliding_decode_ms = self._gauge_cls(
            name="vllm:sliding_window_avg_decode_time_ms",
            documentation="Average decode time (ms) over last N finished requests",
            multiprocess_mode="mostrecent",
            labelnames=labelnames + ["window_size"],
        )
        # Per-engine label helpers for these gauges
        self._sliding_labels = {
            idx: {
                "latency": self._sliding_latency_ms.labels(
                    model_name, str(idx), str(window_size)),
                "throughput": self._sliding_throughput.labels(
                    model_name, str(idx), str(window_size)),
                "ttft": self._sliding_ttft_ms.labels(
                    model_name, str(idx), str(window_size)),
                "queued": self._sliding_queued_ms.labels(
                    model_name, str(idx), str(window_size)),
                "prefill": self._sliding_prefill_ms.labels(
                    model_name, str(idx), str(window_size)),
                "decode": self._sliding_decode_ms.labels(
                    model_name, str(idx), str(window_size)),
            }
            for idx in self.engine_indexes
        }

Comment on lines 190 to 213
class SlidingWindowMetric:
"""Represents a single sliding window metric."""
window_size: int
values: deque = field(default_factory=lambda: deque(maxlen=100))

def add_value(self, value: float) -> None:
"""Add a new value to the sliding window."""
self.values.append(value)

@property
def average(self) -> float:
"""Calculate the average of values in the window."""
if not self.values:
return 0.0
return sum(self.values) / len(self.values)

@property
def count(self) -> int:
"""Return the number of values in the window."""
return len(self.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The values field in the SlidingWindowMetric dataclass is initialized with a deque that has a hardcoded maxlen of 100. This ignores the window_size parameter passed during initialization, causing the sliding window to behave incorrectly for any window size other than 100. This is a critical bug.

To fix this, you should use __post_init__ to initialize the deque with the correct maxlen based on window_size.

class SlidingWindowMetric:
    """Represents a single sliding window metric."""
    window_size: int
    values: deque = field(init=False)

    def __post_init__(self):
        self.values = deque(maxlen=self.window_size)

    def add_value(self, value: float) -> None:
        """Add a new value to the sliding window."""
        self.values.append(value)

    @property
    def average(self) -> float:
        """Calculate the average of values in the window."""
        if not self.values:
            return 0.0
        return sum(self.values) / len(self.values)

    @property
    def count(self) -> int:
        """Return the number of values in the window."""
        return len(self.values)

…st-count-based rolling averages; add tests\n\nImplements sliding window averages for latency, throughput, TTFT, queued/prefill/decode times. Exposes gauges labeled by window_size. Integrates into IterationStats update path. Adds unit tests. Related to vllm-project#22480

Signed-off-by: Jeff Wan <[email protected]>
…cle and add observability config for window size; remove module-level exporter\n\n- Create long-lived SlidingWindowStats in PrometheusStatLogger so values persist across iterations\n- Add ObservabilityConfig.sliding_window_request_count (default 100)\n- Update gauges for latency/throughput/queued/prefill/decode averages labeled by window_size\n- Remove v1/metrics/sliding_window_metrics.py and in-iteration updates

Signed-off-by: Jeff Wan <[email protected]>
…atLogger; fix SlidingWindowMetric to honor window_size via __post_init__

Signed-off-by: Jeff Wan <[email protected]>
Signed-off-by: Jeff Wan <[email protected]>
@NumberWan NumberWan force-pushed the feat/sliding-window-metrics branch from 8f42bd3 to f7dd2c1 Compare August 8, 2025 03:48
Copy link

github-actions bot commented Aug 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

mergify bot commented Aug 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NumberWan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant