diff --git a/src/current/_includes/v24.1/essential-metrics.md b/src/current/_includes/v24.1/essential-metrics.md
index ec2e607f932..5f5754d17a9 100644
--- a/src/current/_includes/v24.1/essential-metrics.md
+++ b/src/current/_includes/v24.1/essential-metrics.md
@@ -35,6 +35,7 @@ The **Usage** column explains why each metric is important to visualize in a cus
| capacity | {% if include.deployment == 'self-hosted' %}capacity.total |{% elsif include.deployment == 'advanced' %}capacity |{% endif %} Total storage capacity | This metric gives total storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| capacity.available | capacity.available | Available storage capacity | This metric gives available storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| capacity.used | capacity.used | Used storage capacity | This metric gives used storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
+| storage.wal.fsync.latency | {% if include.deployment == 'self-hosted' %}storage.wal.fsync.latency |{% elsif include.deployment == 'advanced' %}storage.wal.fsync.latency |{% endif %} This metric reports the latency of writes to the [WAL]({% link {{ page.version.version }}/architecture/storage-layer.md %}#memtable-and-write-ahead-log). | If this value is greater than `100ms`, it is an indication of a [disk stall]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls). To mitigate the effects of disk stalls, consider deploying your cluster with [WAL failover]({% link {{ page.version.version }}/cockroach-start.md %}#flag-wal-failover) configured. |
| storage.write-stalls | {% if include.deployment == 'self-hosted' %}storage.write.stalls |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of instances of intentional write stalls to backpressure incoming writes | This metric reports actual disk stall events. Ideally, investigate all reports of disk stalls. As a pratical guideline, one stall per minute is not likely to have a material impact on workload beyond an occasional increase in response time. However one stall per second should be viewed as problematic and investigated actively. It is particularly problematic if the rate persists over an extended period of time, and worse, if it is increasing. |
| rocksdb.compactions | rocksdb.compactions.total | Number of SST compactions | This metric reports the number of a node's [LSM compactions]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#lsm-health). If the number of compactions remains elevated while the LSM health does not improve, compactions are not keeping up with the workload. If the condition persists for an extended period, the cluster will initially exhibit performance issues that will eventually escalate into stability issues. |
| rocksdb.block.cache.hits | rocksdb.block.cache.hits | Count of block cache hits | This metric gives hits to block cache which is reserved memory. It is allocated upon the start of a node process by the [`--cache` flag]({% link {{ page.version.version }}/cockroach-start.md %}#general) and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload. |
@@ -92,7 +93,12 @@ The **Usage** column explains why each metric is important to visualize in a cus
|
CockroachDB Metric Name
| {% if include.deployment == 'self-hosted' %}[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)
(add `cockroachdb.` prefix)
|{% elsif include.deployment == 'advanced' %}[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)
(add `crdb_dedicated.` prefix)
|{% endif %}Description
| Usage |
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
+| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component that looks for a [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
+| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component that looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
+| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the KV-level requests received per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a hot range. See also: `rebalancing_replicas_cpunanospersecond`. |
+| rebalancing_replicas_cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.cpunanospersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the CPU nanoseconds of execution time per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a hot range. See also the non-histogram variant: `rebalancing.cpunanospersecond`. |
| rebalancing.queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.queriespersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions. | This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities. |
+| rebalancing.cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.cpunanospersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Non-histogram variant of `rebalancing_replicas_cpunanospersecond`. | See usage of `rebalancing_replicas_cpunanospersecond`. |
| ranges | ranges | Number of ranges | This metric provides a measure of the scale of the data size. |
| replicas | {% if include.deployment == 'self-hosted' %}replicas.total |{% elsif include.deployment == 'advanced' %}replicas |{% endif %} Number of replicas | This metric provides an essential characterization of the data distribution across cluster nodes. |
| replicas.leaseholders | replicas.leaseholders | Number of lease holders | This metric provides an essential characterization of the data processing points across cluster nodes. |
diff --git a/src/current/_includes/v24.3/essential-metrics.md b/src/current/_includes/v24.3/essential-metrics.md
index 84cd57318f5..1b0aac55a0b 100644
--- a/src/current/_includes/v24.3/essential-metrics.md
+++ b/src/current/_includes/v24.3/essential-metrics.md
@@ -35,6 +35,7 @@ The **Usage** column explains why each metric is important to visualize in a cus
| capacity | {% if include.deployment == 'self-hosted' %}capacity.total |{% elsif include.deployment == 'advanced' %}capacity |{% endif %} Total storage capacity | This metric gives total storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| capacity.available | capacity.available | Available storage capacity | This metric gives available storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| capacity.used | capacity.used | Used storage capacity | This metric gives used storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
+| storage.wal.fsync.latency | {% if include.deployment == 'self-hosted' %}storage.wal.fsync.latency |{% elsif include.deployment == 'advanced' %}storage.wal.fsync.latency |{% endif %} This metric reports the latency of writes to the [WAL]({% link {{ page.version.version }}/architecture/storage-layer.md %}#memtable-and-write-ahead-log). | If this value is greater than `100ms`, it is an indication of a [disk stall]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls). To mitigate the effects of disk stalls, consider deploying your cluster with [WAL failover]({% link {{ page.version.version }}/wal-failover.md %}) configured. |
| storage.write-stalls | {% if include.deployment == 'self-hosted' %}storage.write.stalls |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of instances of intentional write stalls to backpressure incoming writes | This metric reports actual disk stall events. Ideally, investigate all reports of disk stalls. As a pratical guideline, one stall per minute is not likely to have a material impact on workload beyond an occasional increase in response time. However one stall per second should be viewed as problematic and investigated actively. It is particularly problematic if the rate persists over an extended period of time, and worse, if it is increasing. |
| rocksdb.compactions | rocksdb.compactions.total | Number of SST compactions | This metric reports the number of a node's [LSM compactions]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#lsm-health). If the number of compactions remains elevated while the LSM health does not improve, compactions are not keeping up with the workload. If the condition persists for an extended period, the cluster will initially exhibit performance issues that will eventually escalate into stability issues. |
| rocksdb.block.cache.hits | rocksdb.block.cache.hits | Count of block cache hits | This metric gives hits to block cache which is reserved memory. It is allocated upon the start of a node process by the [`--cache` flag]({% link {{ page.version.version }}/cockroach-start.md %}#general) and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload. |
@@ -92,7 +93,12 @@ The **Usage** column explains why each metric is important to visualize in a cus
| CockroachDB Metric Name
| {% if include.deployment == 'self-hosted' %}[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)
(add `cockroachdb.` prefix)
|{% elsif include.deployment == 'advanced' %}[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)
(add `crdb_dedicated.` prefix)
|{% endif %}Description
| Usage |
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
+| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component that looks for a [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
+| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component that looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
+| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the KV-level requests received per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a hot range. See also: `rebalancing_replicas_cpunanospersecond`. |
+| rebalancing_replicas_cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.cpunanospersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the CPU nanoseconds of execution time per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a hot range. See also the non-histogram variant: `rebalancing.cpunanospersecond`. |
| rebalancing.queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.queriespersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions. | This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities. |
+| rebalancing.cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.cpunanospersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Non-histogram variant of `rebalancing_replicas_cpunanospersecond`. | See usage of `rebalancing_replicas_cpunanospersecond`. |
| ranges | ranges | Number of ranges | This metric provides a measure of the scale of the data size. |
| replicas | {% if include.deployment == 'self-hosted' %}replicas.total |{% elsif include.deployment == 'advanced' %}replicas |{% endif %} Number of replicas | This metric provides an essential characterization of the data distribution across cluster nodes. |
| replicas.leaseholders | replicas.leaseholders | Number of lease holders | This metric provides an essential characterization of the data processing points across cluster nodes. |
diff --git a/src/current/_includes/v25.1/essential-metrics.md b/src/current/_includes/v25.1/essential-metrics.md
index 8ab0885e76a..8586b4f9390 100644
--- a/src/current/_includes/v25.1/essential-metrics.md
+++ b/src/current/_includes/v25.1/essential-metrics.md
@@ -35,6 +35,7 @@ The **Usage** column explains why each metric is important to visualize in a cus
| capacity | {% if include.deployment == 'self-hosted' %}capacity.total |{% elsif include.deployment == 'advanced' %}capacity |{% endif %} Total storage capacity | This metric gives total storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| capacity.available | capacity.available | Available storage capacity | This metric gives available storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| capacity.used | capacity.used | Used storage capacity | This metric gives used storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
+| storage.wal.fsync.latency | {% if include.deployment == 'self-hosted' %}storage.wal.fsync.latency |{% elsif include.deployment == 'advanced' %}storage.wal.fsync.latency |{% endif %} This metric reports the latency of writes to the [WAL]({% link {{ page.version.version }}/architecture/storage-layer.md %}#memtable-and-write-ahead-log). | If this value is greater than `100ms`, it is an indication of a [disk stall]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls). To mitigate the effects of disk stalls, consider deploying your cluster with [WAL failover]({% link {{ page.version.version }}/wal-failover.md %}) configured. |
| storage.write-stalls | {% if include.deployment == 'self-hosted' %}storage.write.stalls |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of instances of intentional write stalls to backpressure incoming writes | This metric reports actual disk stall events. Ideally, investigate all reports of disk stalls. As a pratical guideline, one stall per minute is not likely to have a material impact on workload beyond an occasional increase in response time. However one stall per second should be viewed as problematic and investigated actively. It is particularly problematic if the rate persists over an extended period of time, and worse, if it is increasing. |
| rocksdb.compactions | rocksdb.compactions.total | Number of SST compactions | This metric reports the number of a node's [LSM compactions]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#lsm-health). If the number of compactions remains elevated while the LSM health does not improve, compactions are not keeping up with the workload. If the condition persists for an extended period, the cluster will initially exhibit performance issues that will eventually escalate into stability issues. |
| rocksdb.block.cache.hits | rocksdb.block.cache.hits | Count of block cache hits | This metric gives hits to block cache which is reserved memory. It is allocated upon the start of a node process by the [`--cache` flag]({% link {{ page.version.version }}/cockroach-start.md %}#general) and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload. |
@@ -92,7 +93,12 @@ The **Usage** column explains why each metric is important to visualize in a cus
| CockroachDB Metric Name
| {% if include.deployment == 'self-hosted' %}[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)
(add `cockroachdb.` prefix)
|{% elsif include.deployment == 'advanced' %}[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)
(add `crdb_dedicated.` prefix)
|{% endif %}Description
| Usage |
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
+| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component that looks for a [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
+| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component that looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
+| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the KV-level requests received per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also: `rebalancing_replicas_cpunanospersecond`. |
+| rebalancing_replicas_cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.cpunanospersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the CPU nanoseconds of execution time per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also the non-histogram variant: `rebalancing.cpunanospersecond`. |
| rebalancing.queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.queriespersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions. | This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities. |
+| rebalancing.cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.cpunanospersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Non-histogram variant of `rebalancing_replicas_cpunanospersecond`. | See usage of `rebalancing_replicas_cpunanospersecond`. |
| ranges | ranges | Number of ranges | This metric provides a measure of the scale of the data size. |
| replicas | {% if include.deployment == 'self-hosted' %}replicas.total |{% elsif include.deployment == 'advanced' %}replicas |{% endif %} Number of replicas | This metric provides an essential characterization of the data distribution across cluster nodes. |
| replicas.leaseholders | replicas.leaseholders | Number of lease holders | This metric provides an essential characterization of the data processing points across cluster nodes. |
diff --git a/src/current/_includes/v25.2/essential-metrics.md b/src/current/_includes/v25.2/essential-metrics.md
index 2ff8bb6098c..82ce5f26c8c 100644
--- a/src/current/_includes/v25.2/essential-metrics.md
+++ b/src/current/_includes/v25.2/essential-metrics.md
@@ -35,6 +35,7 @@ The **Usage** column explains why each metric is important to visualize in a cus
| capacity | {% if include.deployment == 'self-hosted' %}capacity.total |{% elsif include.deployment == 'advanced' %}capacity |{% endif %} Total storage capacity | This metric gives total storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| capacity.available | capacity.available | Available storage capacity | This metric gives available storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
| capacity.used | capacity.used | Used storage capacity | This metric gives used storage capacity. Measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). |
+| storage.wal.fsync.latency | {% if include.deployment == 'self-hosted' %}storage.wal.fsync.latency |{% elsif include.deployment == 'advanced' %}storage.wal.fsync.latency |{% endif %} This metric reports the latency of writes to the [WAL]({% link {{ page.version.version }}/architecture/storage-layer.md %}#memtable-and-write-ahead-log). | If this value is greater than `100ms`, it is an indication of a [disk stall]({% link {{ page.version.version }}/cluster-setup-troubleshooting.md %}#disk-stalls). To mitigate the effects of disk stalls, consider deploying your cluster with [WAL failover]({% link {{ page.version.version }}/wal-failover.md %}) configured. |
| storage.write-stalls | {% if include.deployment == 'self-hosted' %}storage.write.stalls |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of instances of intentional write stalls to backpressure incoming writes | This metric reports actual disk stall events. Ideally, investigate all reports of disk stalls. As a pratical guideline, one stall per minute is not likely to have a material impact on workload beyond an occasional increase in response time. However one stall per second should be viewed as problematic and investigated actively. It is particularly problematic if the rate persists over an extended period of time, and worse, if it is increasing. |
| rocksdb.compactions | rocksdb.compactions.total | Number of SST compactions | This metric reports the number of a node's [LSM compactions]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#lsm-health). If the number of compactions remains elevated while the LSM health does not improve, compactions are not keeping up with the workload. If the condition persists for an extended period, the cluster will initially exhibit performance issues that will eventually escalate into stability issues. |
| rocksdb.block.cache.hits | rocksdb.block.cache.hits | Count of block cache hits | This metric gives hits to block cache which is reserved memory. It is allocated upon the start of a node process by the [`--cache` flag]({% link {{ page.version.version }}/cockroach-start.md %}#general) and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload. |
@@ -92,7 +93,12 @@ The **Usage** column explains why each metric is important to visualize in a cus
| CockroachDB Metric Name
| {% if include.deployment == 'self-hosted' %}[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)
(add `cockroachdb.` prefix)
|{% elsif include.deployment == 'advanced' %}[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)
(add `crdb_dedicated.` prefix)
|{% endif %}Description
| Usage |
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
+| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component that looks for a [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
+| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component that looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance of either QPS (`rebalancing.queriespersecond`) or CPU usage (`rebalancing.cpunanospersecond`), depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates that more rebalancing activity is taking place due to load imbalance between stores. |
+| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the KV-level requests received per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also: `rebalancing_replicas_cpunanospersecond`. |
+| rebalancing_replicas_cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.cpunanospersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the CPU nanoseconds of execution time per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram that maintains buckets that can be queried for, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also the non-histogram variant: `rebalancing.cpunanospersecond`. |
| rebalancing.queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.queriespersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions. | This metric shows hotspots along the queries per second (QPS) dimension. It provides insights into the ongoing rebalancing activities. |
+| rebalancing.cpunanospersecond | {% if include.deployment == 'self-hosted' %}rebalancing.cpunanospersecond |{% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Non-histogram variant of `rebalancing_replicas_cpunanospersecond`. | See usage of `rebalancing_replicas_cpunanospersecond`. |
| ranges | ranges | Number of ranges | This metric provides a measure of the scale of the data size. |
| replicas | {% if include.deployment == 'self-hosted' %}replicas.total |{% elsif include.deployment == 'advanced' %}replicas |{% endif %} Number of replicas | This metric provides an essential characterization of the data distribution across cluster nodes. |
| replicas.leaseholders | replicas.leaseholders | Number of lease holders | This metric provides an essential characterization of the data processing points across cluster nodes. |