Expand monitoring.html for the Vespa Cloud metrics dashboard#4710
Open
bjormel wants to merge 1 commit into
Open
Expand monitoring.html for the Vespa Cloud metrics dashboard#4710bjormel wants to merge 1 commit into
bjormel wants to merge 1 commit into
Conversation
Documents the new Feed-tab rows (Persistence Engine, Per-Document-Type Feed, Memory Index Pressure), the CPU-IOWait / docsum / document-store-cache chain, multi-group correctness for replicated document counts, the dashboard's threshold/colour scheme, and the cardinality-reduction behaviour readers will encounter in the Vespa Cloud metric tier. Refreshes the dashboard / health indicators / JVM memory / container thread-pool screenshots and adds the landing dashboard image. Inline forward-links between tab summary and dedicated sections; em-dash density reduced to a normal technical-prose level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Expands the Vespa Cloud monitoring dashboard documentation with new sections describing recently added Feed-tab rows, the IOWait/docsum/document-store-cache relationship, multi-group correctness for replicated document counts, the dashboard's threshold/colour scheme, and the Vespa Cloud metric-tier cardinality-reduction behaviour. Also refreshes screenshots, normalizes em-dash usage to colons in bullet/cell intros, and adds forward links between tab summaries and dedicated sections.
Changes:
- New subsections: Stability/Cluster availability/Resource pressure splits in Overview; Persistence Engine, Per-Document-Type Feed, Memory Index pressure in Feed tab; CPU IOWait, Saturation thresholds at a glance, GC panels, Requests per HTTP Connection in Resources tab.
- New "A note on metric cardinality in Vespa Cloud" section explaining the cloud-side label-stripping policy and where it forces panel choices.
- Stylistic refresh: em-dash → colon in many list/table cells, yellow → orange for dashed max line, anchor links added across tab summary table and inline references, screenshot swap.
Comments suppressed due to low confidence (3)
en/operations/monitoring.html:192
- The threshold values stated here for container thread pools conflict with the "Saturation thresholds at a glance" table further down. This row says "search + document-api" goes orange at 80% / red at 95%, "search only" 80% / 95%, and "document-api only" 90% / 98%. The summary table at lines 837–838 says
search-handlerthread pool & queue util is 90% / 95% and "Other thread pools (default-handler, feedapi-handler)" 80% / 90%. The two tables disagree on both the per-pool breakpoints (e.g. document-api-only is 98% red here vs. 90% red in the summary table) and the search-handler warning level (80% vs. 90%). Please reconcile so readers don't see conflicting numbers for the same metric family.
<tr><td><strong>Container Thread Saturation — search + document-api</strong></td>
<td>Per container cluster (with both <code><search></code> and <code><document-api></code>): worst <code>active / size</code> ratio across all JDisc thread pools</td>
<td>< 80% (green); 80–95% orange; ≥ 95% red: search-handler saturation directly degrades query latency</td></tr>
<tr><td><strong>Container Thread Saturation — search only</strong></td>
<td>Same as above, for clusters with only <code><search></code></td>
<td>Same thresholds (80% / 95%): latency-critical</td></tr>
<tr><td><strong>Container Thread Saturation — document-api only</strong></td>
<td>For clusters with only <code><document-api></code></td>
<td>< 90% (green); 90–98% orange; ≥ 98% red: later warning since feed delays don't surface as user-visible query failures</td></tr>
en/operations/monitoring.html:841
- The Memory utilization row in this summary table (80% / 90%) disagrees with the "Typical healthy values" table just above (lines 813), where Memory is documented as
< 70%healthy /70–80%watch / approaching feed-block limit critical. Either the dashboard uses different thresholds for node-memory utilization than the doc claimed above, or one of the two tables is wrong. Please reconcile.
<tr><td>Memory utilization (node)</td><td>80%</td><td>90%</td></tr>
en/operations/monitoring.html:159
- The indicator is now titled "Container: % Nodes Down" (implying a percentage), but the "What it counts" description still reads "Active container nodes where some service isn't running" — describing a raw count of nodes, not a percentage. Either the title should drop the "%", or the description should clarify that the value is the percentage of active container nodes with a service down.
<tr><td><strong>Container: % Nodes Down</strong></td>
<td>Active container nodes where some service isn't running</td>
<td>0 during steady state; brief spikes during deployments are expected</td></tr>
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| <td>NNS distance computations, visit efficiency</td> | ||
| <td>Tuning HNSW parameters (hidden when not in use)</td></tr> | ||
| <tr><td><strong>Content Node</strong></td> | ||
| <td>Tuning HNSW parameters (<a href="#nns-tab">hidden when not in use</a>)</td></tr> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Documents the new Feed-tab rows (Persistence Engine, Per-Document-Type Feed, Memory Index Pressure), the CPU-IOWait / docsum / document-store-cache chain, multi-group correctness for replicated document counts, the dashboard's threshold/colour scheme, and the cardinality-reduction behaviour readers will encounter in the Vespa Cloud metric tier. Refreshes the dashboard / health indicators / JVM memory / container thread-pool screenshots and adds the landing dashboard image. Inline forward-links between tab summary and dedicated sections; em-dash density reduced to a normal technical-prose level.
I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner.