Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add storage.wal.fsync.latency and other metrics #19425

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rmloveland
Copy link
Contributor

Fixes DOC-11996

@rmloveland rmloveland marked this pull request as draft March 6, 2025 16:43
Copy link

github-actions bot commented Mar 6, 2025

Files changed:

Copy link

netlify bot commented Mar 6, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit 41d6b0d
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/67d98cac27965f0008e00f04

Copy link

netlify bot commented Mar 6, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit 41d6b0d
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-api-docs/deploys/67d98caca888940008f9cf3f

Copy link

netlify bot commented Mar 6, 2025

Netlify Preview

Name Link
🔨 Latest commit 41d6b0d
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/67d98cacc67ffb0009ae7bb7
😎 Deploy Preview https://deploy-preview-19425--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@rmloveland rmloveland force-pushed the 20250306-DOC-11996-storage-essential-metrics branch from 2392785 to 5d8469e Compare March 10, 2025 15:16
@rmloveland rmloveland changed the title Add storage.wal.fsync.latency to metrics Add storage.wal.fsync.latency and other metrics Mar 10, 2025
Fixes DOC-11996

Adds the following metrics to the docs:

- storage.wal.fsync.latency
- rebalancing.range.rebalances
- rebalancing.replicas.queriespersecond
@rmloveland rmloveland force-pushed the 20250306-DOC-11996-storage-essential-metrics branch from 5d8469e to 78c6838 Compare March 10, 2025 15:32
@@ -92,6 +93,9 @@ The **Usage** column explains why each metric is important to visualize in a cus
| <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage |
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. | [XXX](): USAGE??? |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @kvoli

i've been asked to add some docs for the following metrics which appear to be replication-related (via https://cockroachlabs.atlassian.net/browse/DOC-11996):

  • rebalancing_lease_transfers
  • rebalancing_range_rebalances
  • rebalancing_replicas_queriespersecond

i've taken a shot at writing descriptions for each which i'd really appreciate your feedback on

there is also a 'usage' column which appears to explain "what is this metric for / why should i watch it?", in the case of rebalancing_replicas_queriespersecond i tried to write this based on an old slack convo of yours i found in glean - please let me know what you think, happy to update

for the other two metrics, I wasn't sure what to write w.r.t. usage, can you please explain what these metrics are used for / why a user should pay attention to them? my sense is something like: too much range rebalancing/lease transfer = bad, but i'm guessing there is more to it than that!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rmloveland,

For rebalancing_lease_transfers, these are lease transfers specific to a certain component which looks for store level load imbalance when looking at either QPS (rebalancing.queriespersecond) or CPU (rebalancing.cpunanospersecond) depending on what the kv.allocator.load_based_rebalancing.objective is set to (qps or cpu). Likewise for rebalancing_range_rebalances, except the action is moving the replicas of a range (potentially including the leaseholder), instead of just the lease.

For rebalancing_replicas_queriespersecond this is a histogram (also see rebalancing_replicas_cpunanospersecond) which instead of only maintaining a gauge of the stores currently reported QPS or CPU, it maintains buckets so a user could query the PXX replica's QPS, or CPU. If we don't already have docs for the non-histogram variantsrebalancing.queriespersecond, rebalancing.cpunanospersecond, I'd suggest including these instead.

I generally use and have seen the two rebalance action counter metrics rebalancing_(lease_transfers|range_rebalances) used to identify when there has been more rebalancing activity triggered by imbalance (QPS or CPU) between stores.

For the rebalancing load metrics, since these map directly to what the rebalancing algorithm is looking at when attempting to balance load, I have seen these used to identify the efficiency rebalancing and check for store level imbalances (in cases where there is a hardware level load imbalance).

my sense is something like: too much range rebalancing/lease transfer = bad, but i'm guessing there is more to it than that!

Partially answered above, if these are high (when the count is rated), it can indicate thrashing. Where stores continuously move replicas and leases among eachother without actually improving the overall load distribution in the medium-long term. If there is no activity and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rmloveland,

For rebalancing_lease_transfers, these are lease transfers specific to a certain component which looks for store level load imbalance when looking at either QPS (rebalancing.queriespersecond) or CPU (rebalancing.cpunanospersecond) depending on what the kv.allocator.load_based_rebalancing.objective is set to (qps or cpu). Likewise for rebalancing_range_rebalances, except the action is moving the replicas of a range (potentially including the leaseholder), instead of just the lease.

For rebalancing_replicas_queriespersecond this is a histogram (also see rebalancing_replicas_cpunanospersecond) which instead of only maintaining a gauge of the stores currently reported QPS or CPU, it maintains buckets so a user could query the PXX replica's QPS, or CPU. If we don't already have docs for the non-histogram variantsrebalancing.queriespersecond, rebalancing.cpunanospersecond, I'd suggest including these instead.

I generally use and have seen the two rebalance action counter metrics rebalancing_(lease_transfers|range_rebalances) used to identify when there has been more rebalancing activity triggered by imbalance (QPS or CPU) between stores.

For the rebalancing load metrics, since these map directly to what the rebalancing algorithm is looking at when attempting to balance load, I have seen these used to identify the efficiency rebalancing and check for store level imbalances (in cases where there is a hardware level load imbalance).

my sense is something like: too much range rebalancing/lease transfer = bad, but i'm guessing there is more to it than that!

Partially answered above, if these are high (when the count is rated), it can indicate thrashing. Where stores continuously move replicas and leases among eachother without actually improving the overall load distribution in the medium-long term. If there is no activity and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue to the user.

Thanks, in the latest commit I have tried to:

  • update rebalancing_lease_transfers and rebalancing_range_rebalances to incorporate info from your response above in both the 'Description' and 'Usage' columns, as well as adding a "see also: the non-histogram variant X" to the metrics that had such variants
  • make sure we had docs for the non-histogram variants you mentioned rebalancing.queriespersecond (already existed) and rebalancing.cpunanospersecond (wrote new)

PTAL and let me know what you think

PS i'm updating this to a "real PR" now since I think these will be the bulk of the changes (the WAL failover metric i'll get a separate review from an engineer on Storage team once we're happy with these replication-related metrics)

@rmloveland rmloveland marked this pull request as ready for review March 13, 2025 18:11
@rmloveland rmloveland requested a review from kvoli March 13, 2025 18:11
@@ -92,7 +93,12 @@ The **Usage** column explains why each metric is important to visualize in a cus
| <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage |
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it can indicate thrashing where stores continuously move replicas and leases among each other without actually improving the overall load distribution in the medium to long term. If there is no activity on this metric and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider shortening the latter parts regarding thrashing and no action being taken during imbalance into something like:

... it indicates more rebalancing activity is taking place due to load imbalance between stores.

The additional thrashing and no activity parts are useful in a support context (and in sharing context in our earlier thread).

wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm! I can see the stuff about thrashing, etc., not necessarily being helpful in docs since ... at that point you may need help from support, etc.

shortened based on your feedback to:

Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates more rebalancing activity is taking place due to load imbalance between stores.

@@ -92,7 +93,12 @@ The **Usage** column explains why each metric is important to visualize in a cus
| <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage |
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ |
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. |
| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it can indicate thrashing where stores continuously move replicas and leases among each other without actually improving the overall load distribution in the medium to long term. If there is no activity on this metric and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue. |
| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it can indicate thrashing where stores continuously move replicas and leases among each other without actually improving the overall load distribution in the medium to long term. If there is no activity on this metric and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue. |
| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter for the kv-level requests received per second by a given [replica]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-replica). This metric is a histogram which maintains buckets so you can query, e.g., the P95 replica's QPS or CPU. | A high value of this metric for a particular replica could indicate that the replica is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also: `rebalancing_replicas_cpunanospersecond`. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the histogram metrics, the metric is per-store not per-replica. The store aggregates all of its replica's CPU and QPS stats and then creates a histogram. Consider removing the "particular replica" wording.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok thanks for clarifying that. I have updated the description to add the info re: "the store aggregates ..." from your comment as well as remove the "particular replica" wording.

Updated to the following, PTAL and suggest edits as needed!

Counter for the {kv-level requests received,CPU nanoseconds of execution time} per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram which maintains buckets so you can query, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also: rebalancing_replicas_cpunanospersecond. |

@rmloveland rmloveland requested a review from kvoli March 18, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants