Expand monitoring.html for the Vespa Cloud metrics dashboard by bjormel · Pull Request #4710 · vespa-engine/documentation

bjormel · 2026-05-15T15:21:27Z

Documents the new Feed-tab rows (Persistence Engine, Per-Document-Type Feed, Memory Index Pressure), the CPU-IOWait / docsum / document-store-cache chain, multi-group correctness for replicated document counts, the dashboard's threshold/colour scheme, and the cardinality-reduction behaviour readers will encounter in the Vespa Cloud metric tier. Refreshes the dashboard / health indicators / JVM memory / container thread-pool screenshots and adds the landing dashboard image. Inline forward-links between tab summary and dedicated sections; em-dash density reduced to a normal technical-prose level.

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

Documents the new Feed-tab rows (Persistence Engine, Per-Document-Type Feed, Memory Index Pressure), the CPU-IOWait / docsum / document-store-cache chain, multi-group correctness for replicated document counts, the dashboard's threshold/colour scheme, and the cardinality-reduction behaviour readers will encounter in the Vespa Cloud metric tier. Refreshes the dashboard / health indicators / JVM memory / container thread-pool screenshots and adds the landing dashboard image. Inline forward-links between tab summary and dedicated sections; em-dash density reduced to a normal technical-prose level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Expands the Vespa Cloud monitoring dashboard documentation with new sections describing recently added Feed-tab rows, the IOWait/docsum/document-store-cache relationship, multi-group correctness for replicated document counts, the dashboard's threshold/colour scheme, and the Vespa Cloud metric-tier cardinality-reduction behaviour. Also refreshes screenshots, normalizes em-dash usage to colons in bullet/cell intros, and adds forward links between tab summaries and dedicated sections.

Changes:

New subsections: Stability/Cluster availability/Resource pressure splits in Overview; Persistence Engine, Per-Document-Type Feed, Memory Index pressure in Feed tab; CPU IOWait, Saturation thresholds at a glance, GC panels, Requests per HTTP Connection in Resources tab.
New "A note on metric cardinality in Vespa Cloud" section explaining the cloud-side label-stripping policy and where it forces panel choices.
Stylistic refresh: em-dash → colon in many list/table cells, yellow → orange for dashed max line, anchor links added across tab summary table and inline references, screenshot swap.

Comments suppressed due to low confidence (3)

en/operations/monitoring.html:192

The threshold values stated here for container thread pools conflict with the "Saturation thresholds at a glance" table further down. This row says "search + document-api" goes orange at 80% / red at 95%, "search only" 80% / 95%, and "document-api only" 90% / 98%. The summary table at lines 837–838 says search-handler thread pool & queue util is 90% / 95% and "Other thread pools (default-handler, feedapi-handler)" 80% / 90%. The two tables disagree on both the per-pool breakpoints (e.g. document-api-only is 98% red here vs. 90% red in the summary table) and the search-handler warning level (80% vs. 90%). Please reconcile so readers don't see conflicting numbers for the same metric family.

    <tr><td><strong>Container Thread Saturation &mdash; search + document-api</strong></td>
        <td>Per container cluster (with both <code>&lt;search&gt;</code> and <code>&lt;document-api&gt;</code>): worst <code>active / size</code> ratio across all JDisc thread pools</td>
        <td>&lt; 80% (green); 80&ndash;95% orange; &ge; 95% red: search-handler saturation directly degrades query latency</td></tr>
    <tr><td><strong>Container Thread Saturation &mdash; search only</strong></td>
        <td>Same as above, for clusters with only <code>&lt;search&gt;</code></td>
        <td>Same thresholds (80% / 95%): latency-critical</td></tr>
    <tr><td><strong>Container Thread Saturation &mdash; document-api only</strong></td>
        <td>For clusters with only <code>&lt;document-api&gt;</code></td>
        <td>&lt; 90% (green); 90&ndash;98% orange; &ge; 98% red: later warning since feed delays don't surface as user-visible query failures</td></tr>

en/operations/monitoring.html:841

The Memory utilization row in this summary table (80% / 90%) disagrees with the "Typical healthy values" table just above (lines 813), where Memory is documented as < 70% healthy / 70–80% watch / approaching feed-block limit critical. Either the dashboard uses different thresholds for node-memory utilization than the doc claimed above, or one of the two tables is wrong. Please reconcile.

    <tr><td>Memory utilization (node)</td><td>80%</td><td>90%</td></tr>

en/operations/monitoring.html:159

The indicator is now titled "Container: % Nodes Down" (implying a percentage), but the "What it counts" description still reads "Active container nodes where some service isn't running" — describing a raw count of nodes, not a percentage. Either the title should drop the "%", or the description should clarify that the value is the percentage of active container nodes with a service down.

    <tr><td><strong>Container: % Nodes Down</strong></td>
        <td>Active container nodes where some service isn't running</td>
        <td>0 during steady state; brief spikes during deployments are expected</td></tr>

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        <td>NNS distance computations, visit efficiency</td>
-        <td>Tuning HNSW parameters (hidden when not in use)</td></tr>
-    <tr><td><strong>Content Node</strong></td>
+        <td>Tuning HNSW parameters (<a href="#nns-tab">hidden when not in use</a>)</td></tr>


bjormel requested a review from Copilot May 15, 2026 15:21

Copilot started reviewing on behalf of bjormel May 15, 2026 15:22 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread en/operations/monitoring.html

<td>NNS distance computations, visit efficiency</td>

<td>Tuning HNSW parameters (hidden when not in use)</td></tr>

<tr><td><strong>Content Node</strong></td>

<td>Tuning HNSW parameters (<a href="#nns-tab">hidden when not in use</a>)</td></tr>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand monitoring.html for the Vespa Cloud metrics dashboard#4710

Expand monitoring.html for the Vespa Cloud metrics dashboard#4710
bjormel wants to merge 1 commit into
masterfrom
monitoring-dashboard-doc-updates

bjormel commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bjormel commented May 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants