Skip to content

Expand monitoring.html for the Vespa Cloud metrics dashboard#4710

Open
bjormel wants to merge 1 commit into
masterfrom
monitoring-dashboard-doc-updates
Open

Expand monitoring.html for the Vespa Cloud metrics dashboard#4710
bjormel wants to merge 1 commit into
masterfrom
monitoring-dashboard-doc-updates

Conversation

@bjormel
Copy link
Copy Markdown
Member

@bjormel bjormel commented May 15, 2026

Documents the new Feed-tab rows (Persistence Engine, Per-Document-Type Feed, Memory Index Pressure), the CPU-IOWait / docsum / document-store-cache chain, multi-group correctness for replicated document counts, the dashboard's threshold/colour scheme, and the cardinality-reduction behaviour readers will encounter in the Vespa Cloud metric tier. Refreshes the dashboard / health indicators / JVM memory / container thread-pool screenshots and adds the landing dashboard image. Inline forward-links between tab summary and dedicated sections; em-dash density reduced to a normal technical-prose level.

I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.

Documents the new Feed-tab rows (Persistence Engine, Per-Document-Type Feed,
Memory Index Pressure), the CPU-IOWait / docsum / document-store-cache chain,
multi-group correctness for replicated document counts, the dashboard's
threshold/colour scheme, and the cardinality-reduction behaviour readers will
encounter in the Vespa Cloud metric tier. Refreshes the dashboard / health
indicators / JVM memory / container thread-pool screenshots and adds the
landing dashboard image. Inline forward-links between tab summary and dedicated
sections; em-dash density reduced to a normal technical-prose level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Expands the Vespa Cloud monitoring dashboard documentation with new sections describing recently added Feed-tab rows, the IOWait/docsum/document-store-cache relationship, multi-group correctness for replicated document counts, the dashboard's threshold/colour scheme, and the Vespa Cloud metric-tier cardinality-reduction behaviour. Also refreshes screenshots, normalizes em-dash usage to colons in bullet/cell intros, and adds forward links between tab summaries and dedicated sections.

Changes:

  • New subsections: Stability/Cluster availability/Resource pressure splits in Overview; Persistence Engine, Per-Document-Type Feed, Memory Index pressure in Feed tab; CPU IOWait, Saturation thresholds at a glance, GC panels, Requests per HTTP Connection in Resources tab.
  • New "A note on metric cardinality in Vespa Cloud" section explaining the cloud-side label-stripping policy and where it forces panel choices.
  • Stylistic refresh: em-dash → colon in many list/table cells, yellow → orange for dashed max line, anchor links added across tab summary table and inline references, screenshot swap.
Comments suppressed due to low confidence (3)

en/operations/monitoring.html:192

  • The threshold values stated here for container thread pools conflict with the "Saturation thresholds at a glance" table further down. This row says "search + document-api" goes orange at 80% / red at 95%, "search only" 80% / 95%, and "document-api only" 90% / 98%. The summary table at lines 837–838 says search-handler thread pool & queue util is 90% / 95% and "Other thread pools (default-handler, feedapi-handler)" 80% / 90%. The two tables disagree on both the per-pool breakpoints (e.g. document-api-only is 98% red here vs. 90% red in the summary table) and the search-handler warning level (80% vs. 90%). Please reconcile so readers don't see conflicting numbers for the same metric family.
    <tr><td><strong>Container Thread Saturation &mdash; search + document-api</strong></td>
        <td>Per container cluster (with both <code>&lt;search&gt;</code> and <code>&lt;document-api&gt;</code>): worst <code>active / size</code> ratio across all JDisc thread pools</td>
        <td>&lt; 80% (green); 80&ndash;95% orange; &ge; 95% red: search-handler saturation directly degrades query latency</td></tr>
    <tr><td><strong>Container Thread Saturation &mdash; search only</strong></td>
        <td>Same as above, for clusters with only <code>&lt;search&gt;</code></td>
        <td>Same thresholds (80% / 95%): latency-critical</td></tr>
    <tr><td><strong>Container Thread Saturation &mdash; document-api only</strong></td>
        <td>For clusters with only <code>&lt;document-api&gt;</code></td>
        <td>&lt; 90% (green); 90&ndash;98% orange; &ge; 98% red: later warning since feed delays don't surface as user-visible query failures</td></tr>

en/operations/monitoring.html:841

  • The Memory utilization row in this summary table (80% / 90%) disagrees with the "Typical healthy values" table just above (lines 813), where Memory is documented as < 70% healthy / 70–80% watch / approaching feed-block limit critical. Either the dashboard uses different thresholds for node-memory utilization than the doc claimed above, or one of the two tables is wrong. Please reconcile.
    <tr><td>Memory utilization (node)</td><td>80%</td><td>90%</td></tr>

en/operations/monitoring.html:159

  • The indicator is now titled "Container: % Nodes Down" (implying a percentage), but the "What it counts" description still reads "Active container nodes where some service isn't running" — describing a raw count of nodes, not a percentage. Either the title should drop the "%", or the description should clarify that the value is the percentage of active container nodes with a service down.
    <tr><td><strong>Container: % Nodes Down</strong></td>
        <td>Active container nodes where some service isn't running</td>
        <td>0 during steady state; brief spikes during deployments are expected</td></tr>

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

<td>NNS distance computations, visit efficiency</td>
<td>Tuning HNSW parameters (hidden when not in use)</td></tr>
<tr><td><strong>Content Node</strong></td>
<td>Tuning HNSW parameters (<a href="#nns-tab">hidden when not in use</a>)</td></tr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants