-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: auto-tune closed timestamp updates #141832
base: master
Are you sure you want to change the base?
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
975e6db
to
63bf726
Compare
⚪ Sysbench [SQL, 3node, oltp_read_write]
Reproducebenchdiff binaries: mkdir -p benchdiff/495dbb8/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/495dbb879c76e25de50358facded7cafdca1ab42/bin/pkg_sql_tests benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/9547614/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/954761451e68c1c5db5fbe0f336225955dad1075/bin/pkg_sql_tests benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests benchdiff command: benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=9547614 --new=495dbb8 ./pkg/sql/tests ⚪ Sysbench [KV, 1node, local, oltp_read_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/495dbb8/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/495dbb879c76e25de50358facded7cafdca1ab42/bin/pkg_sql_tests benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/9547614/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/954761451e68c1c5db5fbe0f336225955dad1075/bin/pkg_sql_tests benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests benchdiff command: benchdiff --run=^BenchmarkSysbench/KV/1node_local/oltp_read_only$ --old=9547614 --new=495dbb8 ./pkg/sql/tests 🟡 Sysbench [KV, 1node, local, oltp_write_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/495dbb8/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/495dbb879c76e25de50358facded7cafdca1ab42/bin/pkg_sql_tests benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/9547614/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/954761451e68c1c5db5fbe0f336225955dad1075/bin/pkg_sql_tests benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests benchdiff command: benchdiff --run=^BenchmarkSysbench/KV/1node_local/oltp_write_only$ --old=9547614 --new=495dbb8 ./pkg/sql/tests Artifactsdownload: mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/495dbb879c76e25de50358facded7cafdca1ab42/13460484212-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/954761451e68c1c5db5fbe0f336225955dad1075/13460484212-1/\* old/ Legend
No regressions detected! built with commit: 495dbb879c76e25de50358facded7cafdca1ab42 |
495dbb8
to
11023f2
Compare
TODO: more testing to make sure using the weighted average time & 30 measurement window size idea here is fine |
11023f2
to
97017fa
Compare
Wanted to have some high level feedback and check two things I'm not sure:
|
8bbb30b
to
85798af
Compare
This patch adds a new field `avgProposalToLocalApplicationLatency` to replica, tracking latency similar to `raft.replication.latency` with the following differences: 1. It is tracked at the per replica level. 2. Only includes successful write commands excluding errors like 'AmbiguousResultError' or rejected proposals. 3. Only starts recording when the proposal is being proposed to raft. This field tracks the average time between proposal and local application of a command on successful writes. An exponentially weighted moving average is used with an effective window size of 30, where recent data points have a higher influence, and older data gradually decays. The metric is thread-safe. Notes: 1. It does not include full application on follower replicas. 2. No measurements are recorded for read-only commands or read-write commands that do not result in writes. 3. No measurements are recorded for proposal failures (e.g. due to AmbiguousResultError, rejected proposals, or for request evaluation that did not lead to raft proposals). Note that avgProposalToLocalApplicationLatency is left unused. Future commits will use it to compute closed timestamp target for global tables. Informs: cockroachdb#72393 Release note: none
5200bdb
to
2a6b0a4
Compare
d7388e8
to
091d18e
Compare
This patch adds a new field `avgMaxNetWorkLatency` to Sender, tracking the average maximum network latency between the sender and any of its follower nodes An exponentially weighted moving average is used with an effective window size of 30, where recent data points have a higher influence, and older data gradually decays. Note that `avgMaxNetWorkLatency` is left unused. Future commits will use it to compute closed timestamp target for global tables. Part of: cockroachdb#59680 Release note: none
Previously, if a cluster defines `kv.closed_timestamp.lead_for_global_reads_override`, the lead time calclulated at the sender will just be overriden in the end. This patch moves the override part earlier in the logic and breaks on an early return for better readability. Epic: none Release note: none
Previously, conditional check is used to find the maximum between sideTransportPropTime and raftTransportPropTime. This patch changes it to use the max builtin for better readability. Release note: none Epic: none
…_tune This patch introduces a new cluster setting, `kv.closed_timestamp.lead_for_global_reads_auto_tune`, which enables auto tuning of the lead time that global_read ranges use to publish closed timestamps. The tuning is based on observed raft and side-transport closed timestamp propagation latencies. The `kv.closed_timestamp.lead_for_global_reads_override` cluster setting takes precedence over this one. If auto-tuning is disabled or no data is observed, the system falls back to a hardcoded computed lead time. Note that setting is currently unused, and no release notes have been added yet. Future commits will populate and integrate it. Part of: cockroachdb#59680 Release note: none
This patch refactors `closedts.TargetForPolicy` by extracting part of its code into a helper function. It does not change any existing behavior but makes future commits cleaner. Part of: cockroachdb#59680 Release note: none
Previously, for global reads, we calculated the target closed timestamp using the `LEAD_FOR_GLOBAL_READS` policy. This policy indicates that the range’s closed timestamp is configured to lead the present time so that all follower replicas of the range can serve present time reads. Note that this is an estimate/goal but not the lead time itself. This is how the calculation is currently done: ``` raft_propagation_time = max_network_rtt*1.5 (closed timestamp propagation by raft from a leader) + raft_overhead(proposal evaluation, log sync, and state machine apply) side_propagation_time = max_network_rtt*0.5 (closed timestamp propagation by side transport) + side_transport_close_interval // Maximum time required to propagate a closed timestamp update to followers from the sender. closed_ts_propagation = max(raft_propagation_time, side_propagation_time) closed_ts_at_sender = now(sender’s current clock) + maxClockOffset (max clock uncertainty) + closed_ts_propagation ``` By default, this results in a closed timestamp leading the sender's current clock by 800ms. Users can override this by tuning `kv.closed_timestamp.lead_for_global_reads_override`. **Problems with the previous approach** 1. Tuning trade-offs: A higher lead timestamp forces leaseholders to wait longer for the current time to catch up, incurring a high write latency. A lower lead timestamp may result in followers not fully replicating the changes before returning control to the client, forcing follower replicas to redirect query to the leaseholder and increase read latency. 2. Cluster wide setting: This setting applies to the entire cluster, not per range. For geographically distant replicas, a short lead time may not be enough for full application. However, increasing the lead time cluster wide can harm write performance for replicas that are close together and don’t require as much time to replicate. **Improvements in this patch** This patch auto-tunes closed timestamp propagation for global reads by providing more accurate estimates for raft and side transport propagation time. 1. Raft Propagation Time Adjustment Instead of `raft_propagation_time= max_network_rtt*1.5+ raft_overhead`, the `max_network_rtt*1.5` factor is now replaced with a smoothed average time taken between proposal evaluation and local application at the leader replica. This measurement considers only successful write commands, capturing the time up to when the proposal is committed to the raft logs and applied locally. It does not include the full application on follower replicas. Ideally, we would account for full application time on follower replicas, but the leader has no direct visibility into when follower replicas apply updates to their state machine. Tracking this propagation across all followers seems complex, so for now, we rely on raft_overhead to approximate this delay. 2. Side Transport Propagation Time Adjustment Instead of `side_propagation_time = max_network_rtt*0.5 + side_transport_close_interval`, the `max_network_rtt*0.5` factor is now replaced by a smoothed average max latency observed by the sender node. This average considers the maximum latency across all nodes where the sender has leaseholder ranges and needs to send closed timestamps to nodes with follower replicas. Ideally, every node-to-node rpc would have its own network latency average, implying a different closed timestamp target for each pair. However, this would complicate the side transport design. Currently, senders and receivers share state by using a single closed timestamp for each policy, with any new or removed ranges noted in the group. To assign a unique timestamp for each range based on different network latencies, we would need to introduce a node-to-node latency mapping on top of the group ranges state view. In addition, the side transport propagation calculation is usually dominated by sideTransportCloseInterval (200ms by default). To avoid the complexity, we simply use the maximum latency to estimate the one hop network latency. Resolves: cockroachdb#59680 Release note: `kv.closed_timestamp.lead_for_global_reads_auto_tune` can now be used to auto-tune the lead time that ranges global tables use to publish close timestamps. The `kv.closed_timestamp.lead_for_global_reads_override` cluster setting takes precedence over this one. If auto-tuning is disabled or no data is observed, the system falls back to a hardcoded computed lead time.
…uning This patch adds more tests for closed timestamp auto-tuning specifically for raft proposals closed timestamp propagation. Part of: cockroachdb#59680 Release note: none
091d18e
to
843beac
Compare
Previously, for global reads, we calculated the target closed timestamp using
the
LEAD_FOR_GLOBAL_READS
policy. This policy indicates that the range’s closedtimestamp is configured to lead the present time so that all follower replicas
of the range can serve present time reads. Note that this is an estimate/goal
but not the lead time itself.
This is how the calculation is currently done:
By default, this results in a closed timestamp leading the sender's current clock by 800ms. Users
can override this by tuning
kv.closed_timestamp.lead_for_global_reads_override
.Problems with the previous approach
Tuning trade-offs: A higher lead timestamp forces leaseholders to wait longer
for the current time to catch up, incurring a high write latency. A lower lead
timestamp may result in followers not fully replicating the changes before
returning control to the client, forcing follower replicas to redirect query to
the leaseholder and increase read latency.
Cluster wide setting: This setting applies to the entire cluster, not per
range. For geographically distant replicas, a short lead time may not be enough
for full application. However, increasing the lead time cluster wide can harm
write performance for replicas that are close together and don’t require as much
time to replicate.
Improvements in this patch
This patch auto-tunes closed timestamp propagation for global reads by providing
more accurate estimates for raft and side transport propagation time.
Instead of
raft_propagation_time= max_network_rtt*1.5+ raft_overhead
, themax_network_rtt*1.5
factor is now replaced with a smoothed average time takenbetween proposal evaluation and local application at the leader replica. This
measurement considers only successful write commands, capturing the time up to
when the proposal is committed to the raft logs and applied locally. It does not
include the full application on follower replicas.
Ideally, we would account for full application time on follower replicas, but
the leader has no direct visibility into when follower replicas apply updates to
their state machine. Tracking this propagation across all followers seems
complex, so for now, we rely on raft_overhead to approximate this delay.
Instead of
side_propagation_time = max_network_rtt*0.5 + side_transport_close_interval
, themax_network_rtt*0.5
factor is now replacedby a smoothed average max latency observed by the sender node. This average
considers the maximum latency across all nodes where the sender has leaseholder
ranges and needs to send closed timestamps to nodes with follower replicas.
Ideally, every node-to-node rpc would have its own network latency average,
implying a different closed timestamp target for each pair. However, this would
complicate the side transport design. Currently, senders and receivers share
state by using a single closed timestamp for each policy, with any new or
removed ranges noted in the group. To assign a unique timestamp for each range
based on different network latencies, we would need to introduce a node-to-node
latency mapping on top of the group ranges state view. In addition, the side
transport propagation calculation is usually dominated by
sideTransportCloseInterval (200ms by default). To avoid the complexity, we
simply use the maximum latency to estimate the one hop network latency.
Resolves: #59680
Release note: (add a cluster setting to guard this)