Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented Nov 28, 2025

This PR is stacked on top of #158373


Bring replica rebalancing in line with the candidate filtering strategy
documented in rebalanceStore.

Pre-means filtering (via retainReadyReplicaTargetStoresOnly):
The mean should be computed over stores that could hold a replica.
Stores are excluded if they're not viable replica hosts:

  • Health is not HealthOK (dead/unhealthy/unknown)
  • Replica disposition is not OK (refusing/shedding), except for the
    shedding store which already has the replica

Post-means filtering (via postMeansExclusions):
Stores that should contribute to the mean but aren't valid targets
for this specific transfer:

  • Stores on nodes that already have a replica of this range
  • All stores on the shedding store's node if that node is CPU
    overloaded (legacy behavior, TODO to remove)
  • The shedding store itself (it's a viable location, but we're
    moving away from it)

Other changes:

  • Rename computeCandidatesForRange to computeCandidatesForReplicaTransfer
  • Consolidate exclusions into single postMeansExclusions parameter
  • Move candidate logging into computeCandidatesForReplicaTransfer
  • Add unit tests for retainReadyReplicaTargetStoresOnly

Part of #156776.
^-- need to plumb the mmaprototype.Status inputs to mmaAllocator still, both in asim and production code.

Epic: CRDB-55052

tbg added 5 commits November 26, 2025 09:32
Move tracer creation and recording span setup to be shared across all
datadriven commands. Previously only rebalance-stores created a tracer;
now every command automatically gets a ctx with a recording span and
access to finishAndGet() to collect the trace.

This is prep for adding more commands that need tracing output.

Epic: CRDB-55052
Add health and store-level disposition checks to lease target filtering,
and move all eligibility filtering (including the existing per-replica
disposition check) to happen before computing load means.

Computing means only over eligible targets is semantically correct: when
asking "is store X above or below average?", we mean among stores that
could actually receive the lease. Including ineligible stores (dead,
unhealthy, refusing) distorts the average.

Example: with eligible targets at 30/40/50 QPS and a dead store at 0 QPS:
- Including dead store: mean=30, so 50 QPS looks "significantly above average"
- Excluding dead store: mean=40, so 50 QPS is "slightly above average"

The latter correctly reflects reality among viable candidates.

Add a datadriven directive and testdata file to exercise the new
retainReadyLeaseTargetStoresOnly function, covering health, store-level
disposition, and per-replica disposition filtering. The functionality
is also exercised through a newly introduced TestClusterState test.

`asim` testing is going to be deferred to a different PR because
it is contingent cockroachdb#158455 and an additional in-progress plumbing PR.

Epic: CRDB-55052
Add a detailed comment to rebalanceStore explaining the two-phase
candidate filtering approach and how it relates to load mean computation.

Pre-means filtering excludes stores whose load data is irrelevant:
- Dead stores (stale/zero load distorts averages)
- Unhealthy stores for lease transfers (leases will move anyway)

Post-means filtering excludes stores with accurate load that shouldn't
be targets:
- Unhealthy stores for replica transfers (data persists, capacity counts)
- Disposition-based (refusing/shedding but load is real)
- Load-based criteria (context-dependent)

The key insight is that filtering timing determines mean composition,
which affects whether stores appear under/overloaded.
Rename rebalance_stores_cpu_replica_unhealthy_store.txt to
rebalance_stores_cpu_lease_refusing_target.txt to accurately
reflect that the test covers lease disposition filtering, not
replica health filtering.
Bring replica rebalancing in line with the candidate filtering strategy
documented in rebalanceStore.

**Pre-means filtering** (via retainReadyReplicaTargetStoresOnly):
The mean should be computed over stores that could hold a replica.
Stores are excluded if they're not viable replica hosts:
- Health is not HealthOK (dead/unhealthy/unknown)
- Replica disposition is not OK (refusing/shedding), except for the
  shedding store which already has the replica

**Post-means filtering** (via postMeansExclusions):
Stores that should contribute to the mean but aren't valid targets
for this specific transfer:
- Stores on nodes that already have a replica of this range
- All stores on the shedding store's node if that node is CPU
  overloaded (legacy behavior, TODO to remove)
- The shedding store itself (it's a viable location, but we're
  moving away from it)

Other changes:
- Rename computeCandidatesForRange to computeCandidatesForReplicaTransfer
- Consolidate exclusions into single postMeansExclusions parameter
- Move candidate logging into computeCandidatesForReplicaTransfer
- Add unit tests for retainReadyReplicaTargetStoresOnly
@tbg tbg requested review from a team as code owners November 28, 2025 17:03
@tbg tbg marked this pull request as draft November 28, 2025 17:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants