From 78c68380b6ffa005d6640dfe8a0a22cb632d5c02 Mon Sep 17 00:00:00 2001
From: Rich Loveland <rich@cockroachlabs.com>
Date: Thu, 6 Mar 2025 11:42:02 -0500
Subject: [PATCH 1/3] Add `storage.wal.fsync.latency` and other metrics

Fixes DOC-11996

Adds the following metrics to the docs:

- storage.wal.fsync.latency
- rebalancing.range.rebalances
- rebalancing.replicas.queriespersecond
---
 src/current/_includes/v25.1/essential-metrics.md | 4 ++++
 1 file changed, 4 insertions(+)
diff --git a/src/current/_includes/v25.1/essential-metrics.md b/src/current/_includes/v25.1/essential-metrics.md
index 41039d31a75..8a4ae347c7c 100644
--- a/src/current/_includes/v25.1/essential-metrics.md
+++ b/src/current/_includes/v25.1/essential-metrics.md
@@ -35,6 +35,7 @@ The **Usage** column explains why each metric is important to visualize in a cus
 | <a id="capacity"></a>capacity                                            | {% if include.deployment == 'self-hosted' %}capacity.total  |{% elsif include.deployment == 'advanced' %}capacity |{% endif %} Total storage capacity                                       | This metric gives total storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
 | <a id="capacity-available"></a>capacity.available                                  | capacity.available                                           | Available storage capacity                                   | This metric gives available storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
 | capacity.used                                       | capacity.used                                                | Used storage capacity                                        | This metric gives used storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
+| <a id="storage-wal-fsync-latency"></a>storage.wal.fsync.latency | {% if include.deployment == 'self-hosted' %}storage.wal.fsync.latency |{% elsif include.deployment == 'advanced' %}storage.wal.fsync.latency |{% endif %} This metric reports the latency of writes to the [WAL]({% link {{ page.version.version }}/architecture/storage-layer.md %}#memtable-and-write-ahead-log). | If this value is greater than `100ms`, it is an indication of a [disk stall]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls). To mitigate the effects of disk stalls, consider deploying your cluster with [WAL failover]({% link {{ page.version.version }}/wal-failover.md %}) configured. |
 | <a id="storage-write-stalls"></a>storage.write-stalls                                | {% if include.deployment == 'self-hosted' %}storage.write.stalls |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of instances of intentional write stalls to backpressure incoming writes | This metric reports actual disk stall events. Ideally, investigate all reports of disk stalls. As a pratical guideline, one stall per minute is not likely to have a material impact on workload beyond an occasional increase in response time. However one stall per second should be viewed as problematic and investigated actively.  It is particularly problematic if the rate persists over an extended period of time, and worse, if it is increasing. |
 | rocksdb.compactions                                 | rocksdb.compactions.total                                    | Number of SST compactions                                    | This metric reports the number of a node's [LSM compactions]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#lsm-health). If the number of compactions remains elevated while the LSM health does not improve, compactions are not keeping up with the workload. If the condition persists for an extended period, the cluster will initially exhibit performance issues that will eventually escalate into stability issues. |
 | rocksdb.block.cache.hits                            | rocksdb.block.cache.hits                                     | Count of block cache hits                                    | This metric gives hits to block cache which is reserved memory. It is allocated upon the start of a node process by the [`--cache` flag]({% link {{ page.version.version }}/cockroach-start.md %}#general) and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload. |
@@ -92,6 +93,9 @@ The **Usage** column explains why each metric is important to visualize in a cus
 | <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage |
 | ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
 | leases.transfers.success                            | leases.transfers.success                                     | Number of successful lease transfers                         | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
+| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. | [XXX](): USAGE??? |
+| rebalancing_range_rebalances | rebalancing.range.rebalances | Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing) motivated by store-level load imbalances. | [XXX](): USAGE??? |
+| rebalancing_replicas_queriespersecond | rebalancing.replicas.queriespersecond | Counter for the kv-level requests received per second by a given [replica]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-replica). | A high value of this metric for a particular replica could indicate that the replica is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). |
 | rebalancing.queriespersecond                        | {% if include.deployment == 'self-hosted' %}rebalancing.queriespersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions. | This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities. |
 | ranges                                              | ranges                                                       | Number of ranges                                             | This metric provides a measure of the scale of the data size.                  |
 | replicas                                            | {% if include.deployment == 'self-hosted' %}replicas.total |{% elsif include.deployment == 'advanced' %}replicas |{% endif %} Number of replicas                                           | This metric provides an essential characterization of the data distribution across cluster nodes. |

From f4986ce99e54478323ffad9bc55331edaea08d6d Mon Sep 17 00:00:00 2001
From: Rich Loveland <rich@cockroachlabs.com>
Date: Thu, 13 Mar 2025 14:10:49 -0400
Subject: [PATCH 2/3] Update with kvoli feedback (1)

---
 src/current/_includes/v25.1/essential-metrics.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/current/_includes/v25.1/essential-metrics.md b/src/current/_includes/v25.1/essential-metrics.md
index 8a4ae347c7c..dfc1f5c96aa 100644
--- a/src/current/_includes/v25.1/essential-metrics.md
+++ b/src/current/_includes/v25.1/essential-metrics.md
@@ -93,10 +93,12 @@ The **Usage** column explains why each metric is important to visualize in a cus
 | <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage |
 | ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
 | leases.transfers.success                            | leases.transfers.success                                     | Number of successful lease transfers                         | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
-| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. | [XXX](): USAGE??? |
-| rebalancing_range_rebalances | rebalancing.range.rebalances | Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing) motivated by store-level load imbalances. | [XXX](): USAGE??? |
-| rebalancing_replicas_queriespersecond | rebalancing.replicas.queriespersecond | Counter for the kv-level requests received per second by a given [replica]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-replica). | A high value of this metric for a particular replica could indicate that the replica is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). |
+| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it can indicate thrashing where stores continuously move replicas and leases among each other without actually improving the overall load distribution in the medium to long term. If there is no activity on this metric and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue. |
+| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it can indicate thrashing where stores continuously move replicas and leases among each other without actually improving the overall load distribution in the medium to long term. If there is no activity on this metric and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue. |
+| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter for the kv-level requests received per second by a given [replica]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-replica). This metric is a histogram which maintains buckets so you can query, e.g., the P95 replica's QPS or CPU. | A high value of this metric for a particular replica could indicate that the replica is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also: `rebalancing_replicas_cpunanospersecond`. |
+| rebalancing_replicas_cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.cpunanospersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter for the CPU nanoseconds of execution time per second by a given [replica]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-replica). This metric is a histogram which maintains buckets so you can query, e.g., the P95 replica's CPU. | A high value of this metric for a particular replica could indicate that the replica is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also the non-histogram variant: `rebalancing.cpunanospersecond`. |
 | rebalancing.queriespersecond                        | {% if include.deployment == 'self-hosted' %}rebalancing.queriespersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions. | This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities. |
+| rebalancing.cpunanospersecond                        | {% if include.deployment == 'self-hosted' %}rebalancing.cpunanospersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Non-histogram variant of `rebalancing_replicas_cpunanospersecond`. | See usage of `rebalancing_replicas_cpunanospersecond`. |
 | ranges                                              | ranges                                                       | Number of ranges                                             | This metric provides a measure of the scale of the data size.                  |
 | replicas                                            | {% if include.deployment == 'self-hosted' %}replicas.total |{% elsif include.deployment == 'advanced' %}replicas |{% endif %} Number of replicas                                           | This metric provides an essential characterization of the data distribution across cluster nodes. |
 | replicas.leaseholders                               | replicas.leaseholders                                        | Number of lease holders                                      | This metric provides an essential characterization of the data processing points across cluster nodes. |

From 41d6b0d6672ddfcc6d5066c19bd367296ef664ab Mon Sep 17 00:00:00 2001
From: Rich Loveland <rich@cockroachlabs.com>
Date: Tue, 18 Mar 2025 11:09:09 -0400
Subject: [PATCH 3/3] Update with kvoli feedback (2)

---
 src/current/_includes/v25.1/essential-metrics.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/current/_includes/v25.1/essential-metrics.md b/src/current/_includes/v25.1/essential-metrics.md
index dfc1f5c96aa..89120615172 100644
--- a/src/current/_includes/v25.1/essential-metrics.md
+++ b/src/current/_includes/v25.1/essential-metrics.md
@@ -93,10 +93,10 @@ The **Usage** column explains why each metric is important to visualize in a cus
 | <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage |
 | ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
 | leases.transfers.success                            | leases.transfers.success                                     | Number of successful lease transfers                         | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
-| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it can indicate thrashing where stores continuously move replicas and leases among each other without actually improving the overall load distribution in the medium to long term. If there is no activity on this metric and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue. |
-| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it can indicate thrashing where stores continuously move replicas and leases among each other without actually improving the overall load distribution in the medium to long term. If there is no activity on this metric and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue. |
-| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter for the kv-level requests received per second by a given [replica]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-replica). This metric is a histogram which maintains buckets so you can query, e.g., the P95 replica's QPS or CPU. | A high value of this metric for a particular replica could indicate that the replica is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also: `rebalancing_replicas_cpunanospersecond`. |
-| rebalancing_replicas_cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.cpunanospersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter for the CPU nanoseconds of execution time per second by a given [replica]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-replica). This metric is a histogram which maintains buckets so you can query, e.g., the P95 replica's CPU. | A high value of this metric for a particular replica could indicate that the replica is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also the non-histogram variant: `rebalancing.cpunanospersecond`. |
+| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates more rebalancing activity is taking place due to load imbalance between stores. |
+| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates more rebalancing activity is taking place due to load imbalance between stores. |
+| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter for the kv-level requests received per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram which maintains buckets so you can query, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also: `rebalancing_replicas_cpunanospersecond`. |
+| rebalancing_replicas_cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.cpunanospersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter for the CPU nanoseconds of execution time per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram which maintains buckets so you can query, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also the non-histogram variant: `rebalancing.cpunanospersecond`. |
 | rebalancing.queriespersecond                        | {% if include.deployment == 'self-hosted' %}rebalancing.queriespersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions. | This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities. |
 | rebalancing.cpunanospersecond                        | {% if include.deployment == 'self-hosted' %}rebalancing.cpunanospersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Non-histogram variant of `rebalancing_replicas_cpunanospersecond`. | See usage of `rebalancing_replicas_cpunanospersecond`. |
 | ranges                                              | ranges                                                       | Number of ranges                                             | This metric provides a measure of the scale of the data size.                  |