Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add storage.wal.fsync.latency and other metrics #19425

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/current/_includes/v25.1/essential-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ The **Usage** column explains why each metric is important to visualize in a cus
| <a id="capacity"></a>capacity | {% if include.deployment == 'self-hosted' %}capacity.total |{% elsif include.deployment == 'advanced' %}capacity |{% endif %} Total storage capacity | This metric gives total storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| <a id="capacity-available"></a>capacity.available | capacity.available | Available storage capacity | This metric gives available storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| capacity.used | capacity.used | Used storage capacity | This metric gives used storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| <a id="storage-wal-fsync-latency"></a>storage.wal.fsync.latency | {% if include.deployment == 'self-hosted' %}storage.wal.fsync.latency |{% elsif include.deployment == 'advanced' %}storage.wal.fsync.latency |{% endif %} This metric reports the latency of writes to the [WAL]({% link {{ page.version.version }}/architecture/storage-layer.md %}#memtable-and-write-ahead-log). | If this value is greater than `100ms`, it is an indication of a [disk stall]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls). To mitigate the effects of disk stalls, consider deploying your cluster with [WAL failover]({% link {{ page.version.version }}/wal-failover.md %}) configured. |
| <a id="storage-write-stalls"></a>storage.write-stalls | {% if include.deployment == 'self-hosted' %}storage.write.stalls |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of instances of intentional write stalls to backpressure incoming writes | This metric reports actual disk stall events. Ideally, investigate all reports of disk stalls. As a pratical guideline, one stall per minute is not likely to have a material impact on workload beyond an occasional increase in response time. However one stall per second should be viewed as problematic and investigated actively. It is particularly problematic if the rate persists over an extended period of time, and worse, if it is increasing. |
| rocksdb.compactions | rocksdb.compactions.total | Number of SST compactions | This metric reports the number of a node's [LSM compactions]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#lsm-health). If the number of compactions remains elevated while the LSM health does not improve, compactions are not keeping up with the workload. If the condition persists for an extended period, the cluster will initially exhibit performance issues that will eventually escalate into stability issues. |
| rocksdb.block.cache.hits | rocksdb.block.cache.hits | Count of block cache hits | This metric gives hits to block cache which is reserved memory. It is allocated upon the start of a node process by the [`--cache` flag]({% link {{ page.version.version }}/cockroach-start.md %}#general) and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload. |
Expand Down Expand Up @@ -92,7 +93,12 @@ The **Usage** column explains why each metric is important to visualize in a cus
| <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage |
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates more rebalancing activity is taking place due to load imbalance between stores. |
| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates more rebalancing activity is taking place due to load imbalance between stores. |
| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter for the kv-level requests received per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram which maintains buckets so you can query, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also: `rebalancing_replicas_cpunanospersecond`. |
| rebalancing_replicas_cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.cpunanospersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter for the CPU nanoseconds of execution time per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram which maintains buckets so you can query, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also the non-histogram variant: `rebalancing.cpunanospersecond`. |
| rebalancing.queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.queriespersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions. | This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities. |
| rebalancing.cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.cpunanospersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Non-histogram variant of `rebalancing_replicas_cpunanospersecond`. | See usage of `rebalancing_replicas_cpunanospersecond`. |
| ranges | ranges | Number of ranges | This metric provides a measure of the scale of the data size. |
| replicas | {% if include.deployment == 'self-hosted' %}replicas.total |{% elsif include.deployment == 'advanced' %}replicas |{% endif %} Number of replicas | This metric provides an essential characterization of the data distribution across cluster nodes. |
| replicas.leaseholders | replicas.leaseholders | Number of lease holders | This metric provides an essential characterization of the data processing points across cluster nodes. |
Expand Down
Loading