kvserver: auto-tune closed timestamp updates #141832

wenyihu6 · 2025-02-21T14:26:46Z

Previously, for global reads, we calculated the target closed timestamp using
the LEAD_FOR_GLOBAL_READS policy. This policy indicates that the range’s closed
timestamp is configured to lead the present time so that all follower replicas
of the range can serve present time reads. Note that this is an estimate/goal
but not the lead time itself.

This is how the calculation is currently done:

raft_propagation_time = max_network_rtt*1.5 (closed timestamp propagation by raft from a leader)
+ raft_overhead(proposal evaluation, log sync, and state machine apply)

side_propagation_time = max_network_rtt*0.5 (closed timestamp propagation by side transport)
+ side_transport_close_interval

// Maximum time required to propagate a closed timestamp update to followers from the sender.
closed_ts_propagation = max(raft_propagation_time, side_propagation_time)

closed_ts_at_sender = now(sender’s current clock) + maxClockOffset (max clock uncertainty)
+ closed_ts_propagation

By default, this results in a closed timestamp leading the sender's current clock by 800ms. Users
can override this by tuning kv.closed_timestamp.lead_for_global_reads_override.

Problems with the previous approach

Tuning trade-offs: A higher lead timestamp forces leaseholders to wait longer
for the current time to catch up, incurring a high write latency. A lower lead
timestamp may result in followers not fully replicating the changes before
returning control to the client, forcing follower replicas to redirect query to
the leaseholder and increase read latency.
Cluster wide setting: This setting applies to the entire cluster, not per
range. For geographically distant replicas, a short lead time may not be enough
for full application. However, increasing the lead time cluster wide can harm
write performance for replicas that are close together and don’t require as much
time to replicate.

Improvements in this patch

This patch auto-tunes closed timestamp propagation for global reads by providing
more accurate estimates for raft and side transport propagation time.

Raft Propagation Time Adjustment

Instead of raft_propagation_time= max_network_rtt*1.5+ raft_overhead, the
max_network_rtt*1.5 factor is now replaced with a smoothed average time taken
between proposal evaluation and local application at the leader replica. This
measurement considers only successful write commands, capturing the time up to
when the proposal is committed to the raft logs and applied locally. It does not
include the full application on follower replicas.

Ideally, we would account for full application time on follower replicas, but
the leader has no direct visibility into when follower replicas apply updates to
their state machine. Tracking this propagation across all followers seems
complex, so for now, we rely on raft_overhead to approximate this delay.

Side Transport Propagation Time Adjustment

Instead of side_propagation_time = max_network_rtt*0.5 + side_transport_close_interval, the max_network_rtt*0.5 factor is now replaced
by a smoothed average max latency observed by the sender node. This average
considers the maximum latency across all nodes where the sender has leaseholder
ranges and needs to send closed timestamps to nodes with follower replicas.

Ideally, every node-to-node rpc would have its own network latency average,
implying a different closed timestamp target for each pair. However, this would
complicate the side transport design. Currently, senders and receivers share
state by using a single closed timestamp for each policy, with any new or
removed ranges noted in the group. To assign a unique timestamp for each range
based on different network latencies, we would need to introduce a node-to-node
latency mapping on top of the group ranges state view. In addition, the side
transport propagation calculation is usually dominated by
sideTransportCloseInterval (200ms by default). To avoid the complexity, we
simply use the maximum latency to estimate the one hop network latency.

Resolves: #59680
Release note: (add a cluster setting to guard this)

blathers-crl · 2025-02-21T14:26:51Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-02-21T14:27:02Z

This change is

github-actions · 2025-02-21T15:48:21Z

⚪ Sysbench [SQL, 3node, oltp_read_write]

Metric	Old Commit	New Commit	Delta	Note	Threshold
⚪ sec/op	11.29m ±0%	11.32m ±0%	~	p=0.353 n=10	3.0%
⚪ errs/op	0.000 ±0%	0.000 ±0%	~	p=1.000 n=10	0.0%
⚪ allocs/op	10.19k ±1%	10.19k ±1%	~	p=0.869 n=10	2.0%
⚪ B/op	2.204Mi ±0%	2.206Mi ±1%	~	p=0.971 n=10	2.0%

Reproduce

benchdiff binaries:

mkdir -p benchdiff/495dbb8/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/495dbb879c76e25de50358facded7cafdca1ab42/bin/pkg_sql_tests benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/9547614/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/954761451e68c1c5db5fbe0f336225955dad1075/bin/pkg_sql_tests benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=9547614 --new=495dbb8 ./pkg/sql/tests

⚪ Sysbench [KV, 1node, local, oltp_read_only]

Metric	Old Commit	New Commit	Delta	Note	Threshold
⚪ sec/op	720.7µ ±1%	717.5µ ±0%	~	p=0.075 n=10	2.0%
⚪ errs/op	0.000 ±0%	0.000 ±0%	~	p=1.000 n=10	0.0%
⚪ allocs/op	446.0 ±0%	446.0 ±0%	~	p=1.000 n=10	1.5%
⚪ B/op	254.7Ki ±0%	254.7Ki ±0%	~	p=0.868 n=10	1.5%

Reproduce

benchdiff binaries:

mkdir -p benchdiff/495dbb8/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/495dbb879c76e25de50358facded7cafdca1ab42/bin/pkg_sql_tests benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/9547614/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/954761451e68c1c5db5fbe0f336225955dad1075/bin/pkg_sql_tests benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

benchdiff --run=^BenchmarkSysbench/KV/1node_local/oltp_read_only$ --old=9547614 --new=495dbb8 ./pkg/sql/tests

🟡 Sysbench [KV, 1node, local, oltp_write_only]

Metric	Old Commit	New Commit	Delta	Note	Threshold
🟡 sec/op	1.427m ±1%	1.437m ±1%	+0.71%	p=0.011 n=10	2.5%
⚪ errs/op	0.000 ±0%	0.000 ±0%	~	p=1.000 n=10	0.0%
⚪ allocs/op	1.399k ±0%	1.400k ±0%	~	p=0.138 n=10	1.8%
⚪ B/op	290.3Ki ±0%	290.4Ki ±0%	~	p=0.796 n=10	1.8%

Reproduce

benchdiff binaries:

mkdir -p benchdiff/495dbb8/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/495dbb879c76e25de50358facded7cafdca1ab42/bin/pkg_sql_tests benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/9547614/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/954761451e68c1c5db5fbe0f336225955dad1075/bin/pkg_sql_tests benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

benchdiff --run=^BenchmarkSysbench/KV/1node_local/oltp_write_only$ --old=9547614 --new=495dbb8 ./pkg/sql/tests

Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/495dbb879c76e25de50358facded7cafdca1ab42/13460484212-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/954761451e68c1c5db5fbe0f336225955dad1075/13460484212-1/\* old/

Legend

⚪ Neutral: No significant performance change.
🟡 Warning: Slight degradation, likely due to variance, but still within thresholds.
🔴 Regression: Likely performance regression, requiring investigation.
🟢 Improvement: Possible performance gain.

No regressions detected!

built with commit: 495dbb879c76e25de50358facded7cafdca1ab42

wenyihu6 · 2025-02-21T20:06:30Z

TODO: more testing to make sure using the weighted average time & 30 measurement window size idea here is fine

wenyihu6 · 2025-02-21T21:06:50Z

Wanted to have some high level feedback and check two things I'm not sure:

Raft propagation estimate: Ideally, we’d account for the full application time on followers, but the leader lacks visibility into when follower replicas apply to their state machine. Tracking this across all followers is complex, so we just approximate the delay by just using time it takes for leader replica to locally apply + a hardcoded raft_overhead.

Side transport estimate: Every node to node RPC should have its own latency average, but this would complicate side transport design. Currently, senders and receivers share a single closed timestamp per policy

cockroach/pkg/kv/kvserver/closedts/sidetransport/sender.go

Lines 94 to 108 in c83c57d

    
           // streamState encapsulates the state that's tracked by a stream. Both the 
        
           // Sender and the Receiver use this struct and, for a given stream, both ends 
        
           // are supposed to correspond (modulo message delays), in wonderful symmetry. 
        
           type streamState struct { 
        
           	// lastSeqNum is the sequence number of the last message published. 
        
           	lastSeqNum ctpb.SeqNum 
        
           	// lastClosed is the closed timestamp published for each policy in the 
        
           	// last message. 
        
           	lastClosed [roachpb.MAX_CLOSED_TIMESTAMP_POLICY]hlc.Timestamp 
        
           	// tracked maintains the information that was communicated to connections in 
        
           	// the last sent message (implicitly or explicitly). A range enters this 
        
           	// structure as soon as it's included in a message, and exits it when it's 
        
           	// removed through Update.Removed. 
        
           	tracked map[roachpb.RangeID]trackedRange 
        
           }

. Having a different closed timestamps target would require a node-to-node latency mapping shared between sender and receiver. Since propagation is mostly dominated by sideTransportCloseInterval (200ms), we just simplify by picking the maximum observed latency to reach any follower nodes.

This patch adds a new field `avgProposalToLocalApplicationLatency` to replica, tracking latency similar to `raft.replication.latency` with the following differences: 1. It is tracked at the per replica level. 2. Only includes successful write commands excluding errors like 'AmbiguousResultError' or rejected proposals. 3. Only starts recording when the proposal is being proposed to raft. This field tracks the average time between proposal and local application of a command on successful writes. An exponentially weighted moving average is used with an effective window size of 30, where recent data points have a higher influence, and older data gradually decays. The metric is thread-safe. Notes: 1. It does not include full application on follower replicas. 2. No measurements are recorded for read-only commands or read-write commands that do not result in writes. 3. No measurements are recorded for proposal failures (e.g. due to AmbiguousResultError, rejected proposals, or for request evaluation that did not lead to raft proposals). Note that avgProposalToLocalApplicationLatency is left unused. Future commits will use it to compute closed timestamp target for global tables. Informs: cockroachdb#72393 Release note: none

This patch adds a new field `avgMaxNetWorkLatency` to Sender, tracking the average maximum network latency between the sender and any of its follower nodes An exponentially weighted moving average is used with an effective window size of 30, where recent data points have a higher influence, and older data gradually decays. Note that `avgMaxNetWorkLatency` is left unused. Future commits will use it to compute closed timestamp target for global tables. Part of: cockroachdb#59680 Release note: none

Previously, if a cluster defines `kv.closed_timestamp.lead_for_global_reads_override`, the lead time calclulated at the sender will just be overriden in the end. This patch moves the override part earlier in the logic and breaks on an early return for better readability. Epic: none Release note: none

Previously, conditional check is used to find the maximum between sideTransportPropTime and raftTransportPropTime. This patch changes it to use the max builtin for better readability. Release note: none Epic: none

…_tune This patch introduces a new cluster setting, `kv.closed_timestamp.lead_for_global_reads_auto_tune`, which enables auto tuning of the lead time that global_read ranges use to publish closed timestamps. The tuning is based on observed raft and side-transport closed timestamp propagation latencies. The `kv.closed_timestamp.lead_for_global_reads_override` cluster setting takes precedence over this one. If auto-tuning is disabled or no data is observed, the system falls back to a hardcoded computed lead time. Note that setting is currently unused, and no release notes have been added yet. Future commits will populate and integrate it. Part of: cockroachdb#59680 Release note: none

This patch refactors `closedts.TargetForPolicy` by extracting part of its code into a helper function. It does not change any existing behavior but makes future commits cleaner. Part of: cockroachdb#59680 Release note: none

Previously, for global reads, we calculated the target closed timestamp using the `LEAD_FOR_GLOBAL_READS` policy. This policy indicates that the range’s closed timestamp is configured to lead the present time so that all follower replicas of the range can serve present time reads. Note that this is an estimate/goal but not the lead time itself. This is how the calculation is currently done: ``` raft_propagation_time = max_network_rtt*1.5 (closed timestamp propagation by raft from a leader) + raft_overhead(proposal evaluation, log sync, and state machine apply) side_propagation_time = max_network_rtt*0.5 (closed timestamp propagation by side transport) + side_transport_close_interval // Maximum time required to propagate a closed timestamp update to followers from the sender. closed_ts_propagation = max(raft_propagation_time, side_propagation_time) closed_ts_at_sender = now(sender’s current clock) + maxClockOffset (max clock uncertainty) + closed_ts_propagation ``` By default, this results in a closed timestamp leading the sender's current clock by 800ms. Users can override this by tuning `kv.closed_timestamp.lead_for_global_reads_override`. **Problems with the previous approach** 1. Tuning trade-offs: A higher lead timestamp forces leaseholders to wait longer for the current time to catch up, incurring a high write latency. A lower lead timestamp may result in followers not fully replicating the changes before returning control to the client, forcing follower replicas to redirect query to the leaseholder and increase read latency. 2. Cluster wide setting: This setting applies to the entire cluster, not per range. For geographically distant replicas, a short lead time may not be enough for full application. However, increasing the lead time cluster wide can harm write performance for replicas that are close together and don’t require as much time to replicate. **Improvements in this patch** This patch auto-tunes closed timestamp propagation for global reads by providing more accurate estimates for raft and side transport propagation time. 1. Raft Propagation Time Adjustment Instead of `raft_propagation_time= max_network_rtt*1.5+ raft_overhead`, the `max_network_rtt*1.5` factor is now replaced with a smoothed average time taken between proposal evaluation and local application at the leader replica. This measurement considers only successful write commands, capturing the time up to when the proposal is committed to the raft logs and applied locally. It does not include the full application on follower replicas. Ideally, we would account for full application time on follower replicas, but the leader has no direct visibility into when follower replicas apply updates to their state machine. Tracking this propagation across all followers seems complex, so for now, we rely on raft_overhead to approximate this delay. 2. Side Transport Propagation Time Adjustment Instead of `side_propagation_time = max_network_rtt*0.5 + side_transport_close_interval`, the `max_network_rtt*0.5` factor is now replaced by a smoothed average max latency observed by the sender node. This average considers the maximum latency across all nodes where the sender has leaseholder ranges and needs to send closed timestamps to nodes with follower replicas. Ideally, every node-to-node rpc would have its own network latency average, implying a different closed timestamp target for each pair. However, this would complicate the side transport design. Currently, senders and receivers share state by using a single closed timestamp for each policy, with any new or removed ranges noted in the group. To assign a unique timestamp for each range based on different network latencies, we would need to introduce a node-to-node latency mapping on top of the group ranges state view. In addition, the side transport propagation calculation is usually dominated by sideTransportCloseInterval (200ms by default). To avoid the complexity, we simply use the maximum latency to estimate the one hop network latency. Resolves: cockroachdb#59680 Release note: `kv.closed_timestamp.lead_for_global_reads_auto_tune` can now be used to auto-tune the lead time that ranges global tables use to publish close timestamps. The `kv.closed_timestamp.lead_for_global_reads_override` cluster setting takes precedence over this one. If auto-tuning is disabled or no data is observed, the system falls back to a hardcoded computed lead time.

…uning This patch adds more tests for closed timestamp auto-tuning specifically for raft proposals closed timestamp propagation. Part of: cockroachdb#59680 Release note: none

wenyihu6 force-pushed the rpcnetwork-sc branch 8 times, most recently from 975e6db to 63bf726 Compare February 21, 2025 15:28

wenyihu6 force-pushed the rpcnetwork-sc branch 2 times, most recently from 495dbb8 to 11023f2 Compare February 21, 2025 19:32

wenyihu6 force-pushed the rpcnetwork-sc branch from 11023f2 to 97017fa Compare February 21, 2025 20:10

wenyihu6 force-pushed the rpcnetwork-sc branch 11 times, most recently from 8bbb30b to 85798af Compare February 22, 2025 22:30

wenyihu6 force-pushed the rpcnetwork-sc branch 2 times, most recently from 5200bdb to 2a6b0a4 Compare February 24, 2025 03:27

wenyihu6 force-pushed the rpcnetwork-sc branch 8 times, most recently from d7388e8 to 091d18e Compare February 24, 2025 04:57

wenyihu6 added 7 commits February 24, 2025 00:01

kvserver/closedts: use max() instead of conditional checks

69627cc

Previously, conditional check is used to find the maximum between sideTransportPropTime and raftTransportPropTime. This patch changes it to use the max builtin for better readability. Release note: none Epic: none

kvserver/closedts: extract part of TargetForPolicy

a297b94

This patch refactors `closedts.TargetForPolicy` by extracting part of its code into a helper function. It does not change any existing behavior but makes future commits cleaner. Part of: cockroachdb#59680 Release note: none

kvserver: extend TestProposalBufferClosedTimestamp to test for auto-t…

843beac

…uning This patch adds more tests for closed timestamp auto-tuning specifically for raft proposals closed timestamp propagation. Part of: cockroachdb#59680 Release note: none

wenyihu6 force-pushed the rpcnetwork-sc branch from 091d18e to 843beac Compare February 24, 2025 05:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: auto-tune closed timestamp updates #141832

kvserver: auto-tune closed timestamp updates #141832

wenyihu6 commented Feb 21, 2025 •

edited

Loading

blathers-crl bot commented Feb 21, 2025

cockroach-teamcity commented Feb 21, 2025

github-actions bot commented Feb 21, 2025 •

edited

Loading

wenyihu6 commented Feb 21, 2025 •

edited

Loading

wenyihu6 commented Feb 21, 2025 •

edited

Loading

kvserver: auto-tune closed timestamp updates #141832

Are you sure you want to change the base?

kvserver: auto-tune closed timestamp updates #141832

Conversation

wenyihu6 commented Feb 21, 2025 • edited Loading

blathers-crl bot commented Feb 21, 2025

cockroach-teamcity commented Feb 21, 2025

github-actions bot commented Feb 21, 2025 • edited Loading

wenyihu6 commented Feb 21, 2025 • edited Loading

wenyihu6 commented Feb 21, 2025 • edited Loading

wenyihu6 commented Feb 21, 2025 •

edited

Loading

github-actions bot commented Feb 21, 2025 •

edited

Loading

wenyihu6 commented Feb 21, 2025 •

edited

Loading

wenyihu6 commented Feb 21, 2025 •

edited

Loading