Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: auto-tune closed timestamp updates #141832

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Feb 21, 2025

Previously, for global reads, we calculated the target closed timestamp using
the LEAD_FOR_GLOBAL_READS policy. This policy indicates that the range’s closed
timestamp is configured to lead the present time so that all follower replicas
of the range can serve present time reads. Note that this is an estimate/goal
but not the lead time itself.

This is how the calculation is currently done:

raft_propagation_time = max_network_rtt*1.5 (closed timestamp propagation by raft from a leader)
+ raft_overhead(proposal evaluation, log sync, and state machine apply)

side_propagation_time = max_network_rtt*0.5 (closed timestamp propagation by side transport)
+ side_transport_close_interval

// Maximum time required to propagate a closed timestamp update to followers from the sender.
closed_ts_propagation = max(raft_propagation_time, side_propagation_time)

closed_ts_at_sender = now(sender’s current clock) + maxClockOffset (max clock uncertainty)
+ closed_ts_propagation

By default, this results in a closed timestamp leading the sender's current clock by 800ms. Users
can override this by tuning kv.closed_timestamp.lead_for_global_reads_override.

Problems with the previous approach

  1. Tuning trade-offs: A higher lead timestamp forces leaseholders to wait longer
    for the current time to catch up, incurring a high write latency. A lower lead
    timestamp may result in followers not fully replicating the changes before
    returning control to the client, forcing follower replicas to redirect query to
    the leaseholder and increase read latency.

  2. Cluster wide setting: This setting applies to the entire cluster, not per
    range. For geographically distant replicas, a short lead time may not be enough
    for full application. However, increasing the lead time cluster wide can harm
    write performance for replicas that are close together and don’t require as much
    time to replicate.

Improvements in this patch

This patch auto-tunes closed timestamp propagation for global reads by providing
more accurate estimates for raft and side transport propagation time.

  1. Raft Propagation Time Adjustment

Instead of raft_propagation_time= max_network_rtt*1.5+ raft_overhead, the
max_network_rtt*1.5 factor is now replaced with a smoothed average time taken
between proposal evaluation and local application at the leader replica. This
measurement considers only successful write commands, capturing the time up to
when the proposal is committed to the raft logs and applied locally. It does not
include the full application on follower replicas.

Ideally, we would account for full application time on follower replicas, but
the leader has no direct visibility into when follower replicas apply updates to
their state machine. Tracking this propagation across all followers seems
complex, so for now, we rely on raft_overhead to approximate this delay.

  1. Side Transport Propagation Time Adjustment

Instead of side_propagation_time = max_network_rtt*0.5 + side_transport_close_interval, the max_network_rtt*0.5 factor is now replaced
by a smoothed average max latency observed by the sender node. This average
considers the maximum latency across all nodes where the sender has leaseholder
ranges and needs to send closed timestamps to nodes with follower replicas.

Ideally, every node-to-node rpc would have its own network latency average,
implying a different closed timestamp target for each pair. However, this would
complicate the side transport design. Currently, senders and receivers share
state by using a single closed timestamp for each policy, with any new or
removed ranges noted in the group. To assign a unique timestamp for each range
based on different network latencies, we would need to introduce a node-to-node
latency mapping on top of the group ranges state view. In addition, the side
transport propagation calculation is usually dominated by
sideTransportCloseInterval (200ms by default). To avoid the complexity, we
simply use the maximum latency to estimate the one hop network latency.

Resolves: #59680
Release note: (add a cluster setting to guard this)

Copy link

blathers-crl bot commented Feb 21, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the rpcnetwork-sc branch 8 times, most recently from 975e6db to 63bf726 Compare February 21, 2025 15:28
Copy link

github-actions bot commented Feb 21, 2025

⚪ Sysbench [SQL, 3node, oltp_read_write]
Metric Old Commit New Commit Delta Note Threshold
sec/op 11.29m ±0% 11.32m ±0% ~ p=0.353 n=10 3.0%
errs/op 0.000 ±0% 0.000 ±0% ~ p=1.000 n=10 0.0%
allocs/op 10.19k ±1% 10.19k ±1% ~ p=0.869 n=10 2.0%
B/op 2.204Mi ±0% 2.206Mi ±1% ~ p=0.971 n=10 2.0%
Reproduce

benchdiff binaries:

mkdir -p benchdiff/495dbb8/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/495dbb879c76e25de50358facded7cafdca1ab42/bin/pkg_sql_tests benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/9547614/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/954761451e68c1c5db5fbe0f336225955dad1075/bin/pkg_sql_tests benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=9547614 --new=495dbb8 ./pkg/sql/tests
⚪ Sysbench [KV, 1node, local, oltp_read_only]
Metric Old Commit New Commit Delta Note Threshold
sec/op 720.7µ ±1% 717.5µ ±0% ~ p=0.075 n=10 2.0%
errs/op 0.000 ±0% 0.000 ±0% ~ p=1.000 n=10 0.0%
allocs/op 446.0 ±0% 446.0 ±0% ~ p=1.000 n=10 1.5%
B/op 254.7Ki ±0% 254.7Ki ±0% ~ p=0.868 n=10 1.5%
Reproduce

benchdiff binaries:

mkdir -p benchdiff/495dbb8/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/495dbb879c76e25de50358facded7cafdca1ab42/bin/pkg_sql_tests benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/9547614/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/954761451e68c1c5db5fbe0f336225955dad1075/bin/pkg_sql_tests benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

benchdiff --run=^BenchmarkSysbench/KV/1node_local/oltp_read_only$ --old=9547614 --new=495dbb8 ./pkg/sql/tests
🟡 Sysbench [KV, 1node, local, oltp_write_only]
Metric Old Commit New Commit Delta Note Threshold
🟡 sec/op 1.427m ±1% 1.437m ±1% +0.71% p=0.011 n=10 2.5%
errs/op 0.000 ±0% 0.000 ±0% ~ p=1.000 n=10 0.0%
allocs/op 1.399k ±0% 1.400k ±0% ~ p=0.138 n=10 1.8%
B/op 290.3Ki ±0% 290.4Ki ±0% ~ p=0.796 n=10 1.8%
Reproduce

benchdiff binaries:

mkdir -p benchdiff/495dbb8/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/495dbb879c76e25de50358facded7cafdca1ab42/bin/pkg_sql_tests benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/495dbb8/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/9547614/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/954761451e68c1c5db5fbe0f336225955dad1075/bin/pkg_sql_tests benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/9547614/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

benchdiff --run=^BenchmarkSysbench/KV/1node_local/oltp_write_only$ --old=9547614 --new=495dbb8 ./pkg/sql/tests
Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/495dbb879c76e25de50358facded7cafdca1ab42/13460484212-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/954761451e68c1c5db5fbe0f336225955dad1075/13460484212-1/\* old/
Legend
  • Neutral: No significant performance change.
  • 🟡 Warning: Slight degradation, likely due to variance, but still within thresholds.
  • 🔴 Regression: Likely performance regression, requiring investigation.
  • 🟢 Improvement: Possible performance gain.

No regressions detected!

built with commit: 495dbb879c76e25de50358facded7cafdca1ab42

@wenyihu6 wenyihu6 force-pushed the rpcnetwork-sc branch 2 times, most recently from 495dbb8 to 11023f2 Compare February 21, 2025 19:32
@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Feb 21, 2025

TODO: more testing to make sure using the weighted average time & 30 measurement window size idea here is fine

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Feb 21, 2025

Wanted to have some high level feedback and check two things I'm not sure:

  1. Raft propagation estimate: Ideally, we’d account for the full application time on followers, but the leader lacks visibility into when follower replicas apply to their state machine. Tracking this across all followers is complex, so we just approximate the delay by just using time it takes for leader replica to locally apply + a hardcoded raft_overhead.

  2. Side transport estimate: Every node to node RPC should have its own latency average, but this would complicate side transport design. Currently, senders and receivers share a single closed timestamp per policy

    // streamState encapsulates the state that's tracked by a stream. Both the
    // Sender and the Receiver use this struct and, for a given stream, both ends
    // are supposed to correspond (modulo message delays), in wonderful symmetry.
    type streamState struct {
    // lastSeqNum is the sequence number of the last message published.
    lastSeqNum ctpb.SeqNum
    // lastClosed is the closed timestamp published for each policy in the
    // last message.
    lastClosed [roachpb.MAX_CLOSED_TIMESTAMP_POLICY]hlc.Timestamp
    // tracked maintains the information that was communicated to connections in
    // the last sent message (implicitly or explicitly). A range enters this
    // structure as soon as it's included in a message, and exits it when it's
    // removed through Update.Removed.
    tracked map[roachpb.RangeID]trackedRange
    }
    . Having a different closed timestamps target would require a node-to-node latency mapping shared between sender and receiver. Since propagation is mostly dominated by sideTransportCloseInterval (200ms), we just simplify by picking the maximum observed latency to reach any follower nodes.

@wenyihu6 wenyihu6 force-pushed the rpcnetwork-sc branch 11 times, most recently from 8bbb30b to 85798af Compare February 22, 2025 22:30
This patch adds a new field `avgProposalToLocalApplicationLatency` to replica,
tracking latency similar to `raft.replication.latency` with the following
differences:
1. It is tracked at the per replica level.
2. Only includes successful write commands excluding errors like
'AmbiguousResultError' or rejected proposals.
3. Only starts recording when the proposal is being proposed to raft.

This field tracks the average time between proposal and local application
of a command on successful writes. An exponentially weighted moving average
is used with an effective window size of 30, where recent data points
have a higher influence, and older data gradually decays. The metric is
thread-safe.

Notes:
1. It does not include full application on follower replicas.
2. No measurements are recorded for read-only commands or read-write commands
that do not result in writes.
3. No measurements are recorded for proposal failures (e.g. due to
AmbiguousResultError, rejected proposals, or for request
evaluation that did not lead to raft proposals).

Note that avgProposalToLocalApplicationLatency is left unused. Future
commits will use it to compute closed timestamp target
for global tables.

Informs: cockroachdb#72393
Release note: none
@wenyihu6 wenyihu6 force-pushed the rpcnetwork-sc branch 2 times, most recently from 5200bdb to 2a6b0a4 Compare February 24, 2025 03:27
@wenyihu6 wenyihu6 force-pushed the rpcnetwork-sc branch 8 times, most recently from d7388e8 to 091d18e Compare February 24, 2025 04:57
This patch adds a new field `avgMaxNetWorkLatency` to Sender, tracking the
average maximum network latency between the sender and any of its follower nodes
An exponentially weighted moving average is used with an effective window size
of 30, where recent data points have a higher influence, and older data
gradually decays.

Note that `avgMaxNetWorkLatency` is left unused. Future
commits will use it to compute closed timestamp target
for global tables.

Part of: cockroachdb#59680
Release note: none
Previously, if a cluster defines `kv.closed_timestamp.lead_for_global_reads_override`,
the lead time calclulated at the sender will just be overriden in the end.

This patch moves the override part earlier in the logic and
breaks on an early return for better readability.

Epic: none
Release note: none
Previously, conditional check is used to find the maximum
between sideTransportPropTime and raftTransportPropTime.

This patch changes it to use the max builtin for better
readability.

Release note: none
Epic: none
…_tune

This patch introduces a new cluster setting,
`kv.closed_timestamp.lead_for_global_reads_auto_tune`, which enables auto tuning
of the lead time that global_read ranges use to publish closed timestamps. The
tuning is based on observed raft and side-transport closed timestamp propagation
latencies. The `kv.closed_timestamp.lead_for_global_reads_override` cluster
setting takes precedence over this one. If auto-tuning is disabled or no data is
observed, the system falls back to a hardcoded computed lead time.

Note that setting is currently unused, and no release notes have been added yet.
Future commits will populate and integrate it.

Part of: cockroachdb#59680
Release note: none
This patch refactors `closedts.TargetForPolicy` by extracting part of its code
into a helper function. It does not change any existing behavior but makes
future commits cleaner.

Part of: cockroachdb#59680
Release note: none
Previously, for global reads, we calculated the target closed timestamp using
the `LEAD_FOR_GLOBAL_READS` policy. This policy indicates that the range’s closed
timestamp is configured to lead the present time so that all follower replicas
of the range can serve present time reads. Note that this is an estimate/goal
but not the lead time itself.

This is how the calculation is currently done:

```
raft_propagation_time = max_network_rtt*1.5 (closed timestamp propagation by raft from a leader)
+ raft_overhead(proposal evaluation, log sync, and state machine apply)

side_propagation_time = max_network_rtt*0.5 (closed timestamp propagation by side transport)
+ side_transport_close_interval

// Maximum time required to propagate a closed timestamp update to followers from the sender.
closed_ts_propagation = max(raft_propagation_time, side_propagation_time)

closed_ts_at_sender = now(sender’s current clock) + maxClockOffset (max clock uncertainty)
+ closed_ts_propagation
```

By default, this results in a closed timestamp leading the sender's current clock by 800ms. Users
can override this by tuning `kv.closed_timestamp.lead_for_global_reads_override`.

**Problems with the previous approach**

1. Tuning trade-offs: A higher lead timestamp forces leaseholders to wait longer
for the current time to catch up, incurring a high write latency. A lower lead
timestamp may result in followers not fully replicating the changes before
returning control to the client, forcing follower replicas to redirect query to
the leaseholder and increase read latency.

2. Cluster wide setting: This setting applies to the entire cluster, not per
range. For geographically distant replicas, a short lead time may not be enough
for full application. However, increasing the lead time cluster wide can harm
write performance for replicas that are close together and don’t require as much
time to replicate.

**Improvements in this patch**

This patch auto-tunes closed timestamp propagation for global reads by providing
more accurate estimates for raft and side transport propagation time.

1. Raft Propagation Time Adjustment

Instead of `raft_propagation_time= max_network_rtt*1.5+ raft_overhead`, the
`max_network_rtt*1.5` factor is now replaced with a smoothed average time taken
between proposal evaluation and local application at the leader replica. This
measurement considers only successful write commands, capturing the time up to
when the proposal is committed to the raft logs and applied locally. It does not
include the full application on follower replicas.

Ideally, we would account for full application time on follower replicas, but
the leader has no direct visibility into when follower replicas apply updates to
their state machine. Tracking this propagation across all followers seems
complex, so for now, we rely on raft_overhead to approximate this delay.

2. Side Transport Propagation Time Adjustment

Instead of `side_propagation_time = max_network_rtt*0.5 +
side_transport_close_interval`, the `max_network_rtt*0.5` factor is now replaced
by a smoothed average max latency observed by the sender node. This average
considers the maximum latency across all nodes where the sender has leaseholder
ranges and needs to send closed timestamps to nodes with follower replicas.

Ideally, every node-to-node rpc would have its own network latency average,
implying a different closed timestamp target for each pair. However, this would
complicate the side transport design. Currently, senders and receivers share
state by using a single closed timestamp for each policy, with any new or
removed ranges noted in the group. To assign a unique timestamp for each range
based on different network latencies, we would need to introduce a node-to-node
latency mapping on top of the group ranges state view. In addition, the side
transport propagation calculation is usually dominated by
sideTransportCloseInterval (200ms by default). To avoid the complexity, we
simply use the maximum latency to estimate the one hop network latency.

Resolves: cockroachdb#59680
Release note: `kv.closed_timestamp.lead_for_global_reads_auto_tune` can now
be used to auto-tune the lead time that ranges global tables use to publish
close timestamps. The `kv.closed_timestamp.lead_for_global_reads_override`
cluster setting takes precedence over this one. If auto-tuning is disabled or
no data is observed, the system falls back to a hardcoded computed lead time.
…uning

This patch adds more tests for closed timestamp auto-tuning specifically for
raft proposals closed timestamp propagation.

Part of: cockroachdb#59680
Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: tune closed timestamp duration for ranges with global_reads based on replication latency
2 participants