-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add storage.wal.fsync.latency
and other metrics
#19425
base: main
Are you sure you want to change the base?
Conversation
Files changed:
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify site configuration. |
2392785
to
5d8469e
Compare
storage.wal.fsync.latency
to metricsstorage.wal.fsync.latency
and other metrics
Fixes DOC-11996 Adds the following metrics to the docs: - storage.wal.fsync.latency - rebalancing.range.rebalances - rebalancing.replicas.queriespersecond
5d8469e
to
78c6838
Compare
@@ -92,6 +93,9 @@ The **Usage** column explains why each metric is important to visualize in a cus | |||
| <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage | | |||
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ | | |||
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. | | |||
| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. | [XXX](): USAGE??? | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @kvoli
i've been asked to add some docs for the following metrics which appear to be replication-related (via https://cockroachlabs.atlassian.net/browse/DOC-11996):
rebalancing_lease_transfers
rebalancing_range_rebalances
rebalancing_replicas_queriespersecond
i've taken a shot at writing descriptions for each which i'd really appreciate your feedback on
there is also a 'usage' column which appears to explain "what is this metric for / why should i watch it?", in the case of rebalancing_replicas_queriespersecond
i tried to write this based on an old slack convo of yours i found in glean - please let me know what you think, happy to update
for the other two metrics, I wasn't sure what to write w.r.t. usage, can you please explain what these metrics are used for / why a user should pay attention to them? my sense is something like: too much range rebalancing/lease transfer = bad, but i'm guessing there is more to it than that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rmloveland,
For rebalancing_lease_transfers
, these are lease transfers specific to a certain component which looks for store level load imbalance when looking at either QPS (rebalancing.queriespersecond
) or CPU (rebalancing.cpunanospersecond
) depending on what the kv.allocator.load_based_rebalancing.objective
is set to (qps
or cpu
). Likewise for rebalancing_range_rebalances
, except the action is moving the replicas of a range (potentially including the leaseholder), instead of just the lease.
For rebalancing_replicas_queriespersecond
this is a histogram (also see rebalancing_replicas_cpunanospersecond
) which instead of only maintaining a gauge of the stores currently reported QPS or CPU, it maintains buckets so a user could query the PXX replica's QPS, or CPU. If we don't already have docs for the non-histogram variantsrebalancing.queriespersecond
, rebalancing.cpunanospersecond
, I'd suggest including these instead.
I generally use and have seen the two rebalance action counter metrics rebalancing_(lease_transfers|range_rebalances)
used to identify when there has been more rebalancing activity triggered by imbalance (QPS or CPU) between stores.
For the rebalancing load metrics, since these map directly to what the rebalancing algorithm is looking at when attempting to balance load, I have seen these used to identify the efficiency rebalancing and check for store level imbalances (in cases where there is a hardware level load imbalance).
my sense is something like: too much range rebalancing/lease transfer = bad, but i'm guessing there is more to it than that!
Partially answered above, if these are high (when the count is rated), it can indicate thrashing. Where stores continuously move replicas and leases among eachother without actually improving the overall load distribution in the medium-long term. If there is no activity and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rmloveland,
For
rebalancing_lease_transfers
, these are lease transfers specific to a certain component which looks for store level load imbalance when looking at either QPS (rebalancing.queriespersecond
) or CPU (rebalancing.cpunanospersecond
) depending on what thekv.allocator.load_based_rebalancing.objective
is set to (qps
orcpu
). Likewise forrebalancing_range_rebalances
, except the action is moving the replicas of a range (potentially including the leaseholder), instead of just the lease.For
rebalancing_replicas_queriespersecond
this is a histogram (also seerebalancing_replicas_cpunanospersecond
) which instead of only maintaining a gauge of the stores currently reported QPS or CPU, it maintains buckets so a user could query the PXX replica's QPS, or CPU. If we don't already have docs for the non-histogram variantsrebalancing.queriespersecond
,rebalancing.cpunanospersecond
, I'd suggest including these instead.I generally use and have seen the two rebalance action counter metrics
rebalancing_(lease_transfers|range_rebalances)
used to identify when there has been more rebalancing activity triggered by imbalance (QPS or CPU) between stores.For the rebalancing load metrics, since these map directly to what the rebalancing algorithm is looking at when attempting to balance load, I have seen these used to identify the efficiency rebalancing and check for store level imbalances (in cases where there is a hardware level load imbalance).
my sense is something like: too much range rebalancing/lease transfer = bad, but i'm guessing there is more to it than that!
Partially answered above, if these are high (when the count is rated), it can indicate thrashing. Where stores continuously move replicas and leases among eachother without actually improving the overall load distribution in the medium-long term. If there is no activity and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue to the user.
Thanks, in the latest commit I have tried to:
- update
rebalancing_lease_transfers
andrebalancing_range_rebalances
to incorporate info from your response above in both the 'Description' and 'Usage' columns, as well as adding a "see also: the non-histogram variant X" to the metrics that had such variants - make sure we had docs for the non-histogram variants you mentioned
rebalancing.queriespersecond
(already existed) andrebalancing.cpunanospersecond
(wrote new)
PTAL and let me know what you think
PS i'm updating this to a "real PR" now since I think these will be the bulk of the changes (the WAL failover metric i'll get a separate review from an engineer on Storage team once we're happy with these replication-related metrics)
@@ -92,7 +93,12 @@ The **Usage** column explains why each metric is important to visualize in a cus | |||
| <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage | | |||
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ | | |||
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. | | |||
| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it can indicate thrashing where stores continuously move replicas and leases among each other without actually improving the overall load distribution in the medium to long term. If there is no activity on this metric and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider shortening the latter parts regarding thrashing and no action being taken during imbalance into something like:
... it indicates more rebalancing activity is taking place due to load imbalance between stores.
The additional thrashing and no activity parts are useful in a support context (and in sharing context in our earlier thread).
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm! I can see the stuff about thrashing, etc., not necessarily being helpful in docs since ... at that point you may need help from support, etc.
shortened based on your feedback to:
Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it indicates more rebalancing activity is taking place due to load imbalance between stores.
@@ -92,7 +93,12 @@ The **Usage** column explains why each metric is important to visualize in a cus | |||
| <div style="width:225px">CockroachDB Metric Name</div> | {% if include.deployment == 'self-hosted' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb/?tab=host#metrics)<br>(add `cockroachdb.` prefix)</div> |{% elsif include.deployment == 'advanced' %}<div style="width:225px">[Datadog Integration Metric Name](https://docs.datadoghq.com/integrations/cockroachdb_dedicated/#metrics)<br>(add `crdb_dedicated.` prefix)</div> |{% endif %}<div style="width:150px">Description</div>| Usage | | |||
| ----------------------------------------------------- | {% if include.deployment == 'self-hosted' %}------ |{% elsif include.deployment == 'advanced' %}---- |{% endif %} ------------------------------------------------------------ | ------------------------------------------------------------ | | |||
| leases.transfers.success | leases.transfers.success | Number of successful lease transfers | A high number of [lease](architecture/replication-layer.html#leases) transfers is not a negative or positive signal, rather it is a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors which are normal and expected during rebalancing. Observing this metric may provide a confirmation of the cause of such errors. | | |||
| rebalancing_lease_transfers | rebalancing.lease.transfers | Counter of the number of [lease transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#epoch-based-leases-table-data) that occur during replica rebalancing. These lease transfers are tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it can indicate thrashing where stores continuously move replicas and leases among each other without actually improving the overall load distribution in the medium to long term. If there is no activity on this metric and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue. | | |||
| rebalancing_range_rebalances | {% if include.deployment == 'self-hosted' %}rebalancing.range.rebalances | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter of the number of [load-based range rebalances]({% link {{ page.version.version }}/architecture/replication-layer.md %}#load-based-replica-rebalancing). This range movement is tracked by a component which looks for [store-level]({% link {{ page.version.version }}/cockroach-start.md %}#store) load imbalance when looking at either QPS (`rebalancing.queriespersecond`) or CPU (`rebalancing.cpunanospersecond`) depending on the value of the `kv.allocator.load_based_rebalancing.objective` [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}#setting-kv-allocator-load-based-rebalancing-objective). | Used to identify when there has been more rebalancing activity triggered by imbalance between stores (of QPS or CPU). If this is high (when the count is rated), it can indicate thrashing where stores continuously move replicas and leases among each other without actually improving the overall load distribution in the medium to long term. If there is no activity on this metric and the stores remain imbalanced, this is an indicator that no rebalance activity is being undertaken to fix the imbalance, potentially signalling an issue. | | |||
| rebalancing_replicas_queriespersecond | {% if include.deployment == 'self-hosted' %}rebalancing.replicas.queriespersecond | {% elsif include.deployment == 'advanced' %}NOT AVAILABLE |{% endif %} Counter for the kv-level requests received per second by a given [replica]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-replica). This metric is a histogram which maintains buckets so you can query, e.g., the P95 replica's QPS or CPU. | A high value of this metric for a particular replica could indicate that the replica is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also: `rebalancing_replicas_cpunanospersecond`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the histogram metrics, the metric is per-store not per-replica. The store aggregates all of its replica's CPU and QPS stats and then creates a histogram. Consider removing the "particular replica" wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok thanks for clarifying that. I have updated the description to add the info re: "the store aggregates ..." from your comment as well as remove the "particular replica" wording.
Updated to the following, PTAL and suggest edits as needed!
Counter for the {kv-level requests received,CPU nanoseconds of execution time} per second by a given [store]({% link {{ page.version.version }}/cockroach-start.md %}#store). The store aggregates all of the CPU and QPS stats across all its replicas and then creates a histogram which maintains buckets so you can query, e.g., the P95 replica's QPS or CPU. | A high value of this metric could indicate that one of the store's replicas is part of a [hot range]({% link {{ page.version.version }}/understand-hotspots.md %}#hot-range). See also:
rebalancing_replicas_cpunanospersecond
. |
Fixes DOC-11996