Add storage.wal.fsync.latency and other metrics

rmloveland · rmloveland · commit 78c68380b6ff · 2025-03-10T11:31:05.000-04:00
Fixes DOC-11996 Adds the following metrics to the docs: - storage.wal.fsync.latency - rebalancing.range.rebalances - rebalancing.replicas.queriespersecond
diff --git a/src/current/_includes/v25.1/essential-metrics.md b/src/current/_includes/v25.1/essential-metrics.md
@@ -35,6 +35,7 @@ The **Usage** column explains why each metric is important to visualize in a cus
 | <a id="capacity"></a>capacity                                            | {% if include.deployment == 'self-hosted' %}capacity.total  |{% elsif include.deployment == 'advanced' %}capacity |{% endif %} Total storage capacity                                       | This metric gives total storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
 | <a id="capacity-available"></a>capacity.available                                  | capacity.available                                           | Available storage capacity                                   | This metric gives available storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
 | capacity.used                                       | capacity.used                                                | Used storage capacity                                        | This metric gives used storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
+| <a id="storage-wal-fsync-latency"></a>storage.wal.fsync.latency | {% if include.deployment == 'self-hosted' %}storage.wal.fsync.latency |{% elsif include.deployment == 'advanced' %}storage.wal.fsync.latency |{% endif %} This metric reports the latency of writes to the [WAL]({% link {{ page.version.version }}/architecture/storage-layer.md %}#memtable-and-write-ahead-log). | If this value is greater than `100ms`, it is an indication of a [disk stall]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls). To mitigate the effects of disk stalls, consider deploying your cluster with [WAL failover]({% link {{ page.version.version }}/wal-failover.md %}) configured. |
 | <a id="storage-write-stalls"></a>storage.write-stalls                                | {% if include.deployment == 'self-hosted' %}storage.write.stalls |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of instances of intentional write stalls to backpressure incoming writes | This metric reports actual disk stall events. Ideally, investigate all reports of disk stalls. As a pratical guideline, one stall per minute is not likely to have a material impact on workload beyond an occasional increase in response time. However one stall per second should be viewed as problematic and investigated actively.  It is particularly problematic if the rate persists over an extended period of time, and worse, if it is increasing. |
 | rocksdb.compactions                                 | rocksdb.compactions.total                                    | Number of SST compactions                                    | This metric reports the number of a node's [LSM compactions]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#lsm-health). If the number of compactions remains elevated while the LSM health does not improve, compactions are not keeping up with the workload. If the condition persists for an extended period, the cluster will initially exhibit performance issues that will eventually escalate into stability issues. |
 | rocksdb.block.cache.hits                            | rocksdb.block.cache.hits                                     | Count of block cache hits                                    | This metric gives hits to block cache which is reserved memory. It is allocated upon the start of a node process by the [`--cache` flag]({% link {{ page.version.version }}/cockroach-start.md %}#general) and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload. |
@@ -92,6 +93,9 @@ The **Usage** column explains why each metric is important to visualize in a cus
 | <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage |
 | ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
 | leases.transfers.success                            | leases.transfers.success                                     | Number of successful lease transfers                         | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
+| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. | [XXX](): USAGE??? |
+| rebalancing_range_rebalances | rebalancing.range.rebalances | Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing) motivated by store-level load imbalances. | [XXX](): USAGE??? |
+| rebalancing_replicas_queriespersecond | rebalancing.replicas.queriespersecond | Counter for the kv-level requests received per second by a given [replica]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-replica). | A high value of this metric for a particular replica could indicate that the replica is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). |
 | rebalancing.queriespersecond                        | {% if include.deployment == 'self-hosted' %}rebalancing.queriespersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions. | This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities. |
 | ranges                                              | ranges                                                       | Number of ranges                                             | This metric provides a measure of the scale of the data size.                  |
 | replicas                                            | {% if include.deployment == 'self-hosted' %}replicas.total |{% elsif include.deployment == 'advanced' %}replicas |{% endif %} Number of replicas                                           | This metric provides an essential characterization of the data distribution across cluster nodes. |