Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stmtstats: investigate mutex contention in ssmemstorage.(*Container).getStatsForStmtWithKey #140590

Open
tbg opened this issue Feb 6, 2025 · 2 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-performance Perf of queries or internals. Solution not expected to change functional behavior. o-perf-efficiency Related to performance efficiency P-2 Issues/test failures with a fix SLA of 3 months T-observability

Comments

@tbg
Copy link
Member

tbg commented Feb 6, 2025

See #139557 for BenchmarkParallelSysbench, whose mutex profile prompted creation of this issue.

The mutex profile there is potentially interesting because it shows a very prominent contended mutex in ssmemstorage.(*Container).getStatsForStmtWithKey. This accounts for >50% of contention (329.36s), on a benchmark invocation that ran for ~122s. I think this could be interpreted as losing ~2.7 (=329/122) worker threads (out of 24) to contention. Perhaps one could say that throughput could be higher by a factor of 24/21.3 ≈ 1.13x. That's 13%, so it would be pretty massive. I would take this with a large grain of salt though, since I didn't see this contention as prominently in mutex profiles from a real cluster1, and instead it may be an artifact of having too many worker goroutines (we have 24 in this benchmark, but throughput is only ~10x higher than when running with a single worker). On the other hand, the mutex being hot might highlight that this is one reason linear scaling breaks down at that point.
We should take a look.

Mutex

image

Epic: CRDB-42584

Jira issue: CRDB-47252

Footnotes

  1. https://github.com/cockroachdb/cockroach/issues/140235

@tbg tbg added branch-master Failures and bugs on the master branch. C-performance Perf of queries or internals. Solution not expected to change functional behavior. o-perf-efficiency Related to performance efficiency P-2 Issues/test failures with a fix SLA of 3 months T-observability labels Feb 6, 2025
@tbg
Copy link
Member Author

tbg commented Feb 6, 2025

A quick experiment with this very hacky diff on top of #139557

diff

commit b99b2d6e9e34e67044c252c566c5e5354f7fb50d
Author: Tobias Grieger <[email protected]>
Date:   Thu Feb 6 15:02:48 2025 +0100

    [dnm] use RLock for getStatsFor{Stmt,Txn}WithKey

diff --git a/pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go b/pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go
index a0d4455e138..464d11fced9 100644
--- a/pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go
+++ b/pkg/sql/sqlstats/ssmemstorage/ss_mem_storage.go
@@ -477,8 +477,8 @@ func (s *Container) getStatsForStmt(
 func (s *Container) getStatsForStmtWithKey(
 	key stmtKey, stmtFingerprintID appstatspb.StmtFingerprintID, createIfNonexistent bool,
 ) (stats *stmtStats, created, throttled bool) {
-	s.mu.Lock()
-	defer s.mu.Unlock()
+	s.mu.RLock()
+	defer s.mu.RUnlock()
 	return s.getStatsForStmtWithKeyLocked(key, stmtFingerprintID, createIfNonexistent)
 }
 
@@ -496,8 +496,13 @@ func (s *Container) getStatsForStmtWithKeyLocked(
 		}
 		stats = &stmtStats{}
 		stats.ID = stmtFingerprintID
+		// Bad hack alert!
+		s.mu.RUnlock()
+		s.mu.Lock()
 		s.mu.stmts[key] = stats
 		s.mu.sampledStatementCache[key.sampledPlanKey] = struct{}{}
+		s.mu.Unlock()
+		s.mu.RLock()
 
 		return stats, true /* created */, false /* throttled */
 	}
@@ -509,8 +514,8 @@ func (s *Container) getStatsForTxnWithKey(
 	stmtFingerprintIDs []appstatspb.StmtFingerprintID,
 	createIfNonexistent bool,
 ) (stats *txnStats, created, throttled bool) {
-	s.mu.Lock()
-	defer s.mu.Unlock()
+	s.mu.RLock()
+	defer s.mu.RUnlock()
 
 	return s.getStatsForTxnWithKeyLocked(key, stmtFingerprintIDs, createIfNonexistent)
 }
@@ -531,7 +536,12 @@ func (s *Container) getStatsForTxnWithKeyLocked(
 		}
 		stats = &txnStats{}
 		stats.statementFingerprintIDs = stmtFingerprintIDs
+		// Hack time.
+		s.mu.RUnlock()
+		s.mu.Lock()
 		s.mu.txns[key] = stats
+		s.mu.Unlock()
+		s.mu.RLock()
 		return stats, true /* created */, false /* throttled */
 	}
 	return stats, false /* created */, false /* throttled */

shows promising results:

old:  7fbc3d2 tests: extract BenchmarkParallelSysbench
new:  b99b2d6 [dnm] use RLock for getStatsFor{Stmt,Txn}WithKey
args: benchdiff "./pkg/sql/tests" "-b" "-r" "BenchmarkParallelSysbench/SQL/3node/oltp_read_write.*" "-d" "10000x" "-c" "20"

running benchmarks:
name                                           old time/op    new time/op    delta
ParallelSysbench/SQL/3node/oltp_read_write-24    1.12ms ± 1%    1.09ms ± 2%   -2.48%  (p=0.000 n=11+11)

name                                           old errs/op    new errs/op    delta
ParallelSysbench/SQL/3node/oltp_read_write-24      0.01 ±11%      0.01 ±17%  +27.94%  (p=0.000 n=12+12)

name                                           old alloc/op   new alloc/op   delta
ParallelSysbench/SQL/3node/oltp_read_write-24    2.05MB ± 0%    2.04MB ± 0%   -0.31%  (p=0.002 n=12+11)

name                                           old allocs/op  new allocs/op  delta
ParallelSysbench/SQL/3node/oltp_read_write-24     8.73k ± 1%     8.65k ± 1%   -0.85%  (p=0.000 n=12+12)

Also, predictably, the mutex disappears from the contention profile

mutex profile with hack

Image

I think this kind of improvement would generally benefit all SQL gateway nodes on which certain statement fingerprints are very hot, which is common enough, and certainly is common in benchmarks.

I believe it's straightforward to change this code to use RLock by default, and to re-enter only when the map entry has to be created for the first time. That's a standard pattern, for example here:

func (s *Store) tryGetOrCreateReplica(
ctx context.Context,
rangeID roachpb.RangeID,
replicaID roachpb.ReplicaID,
creatingReplica *roachpb.ReplicaDescriptor,
) (_ *Replica, created bool, _ error) {
// The common case: look up an existing replica.
if repl, err := s.tryGetReplica(ctx, rangeID, replicaID, creatingReplica); err != nil {
return nil, false, err
} else if repl != nil {
return repl, false, nil
}
// No replica currently exists, so try to create one. Multiple goroutines may
// be racing at this point, so grab a "lock" over this rangeID (represented by
// s.mu.creatingReplicas[rangeID]) by one goroutine, and retry others.
s.mu.Lock()
if _, ok := s.mu.creatingReplicas[rangeID]; ok {
// Lost the race - another goroutine is currently creating that replica. Let
// the caller retry so that they can eventually see it.
s.mu.Unlock()
return nil, false, errRetry
}
s.mu.creatingReplicas[rangeID] = struct{}{}
s.mu.Unlock()

or here:

cockroach/pkg/rpc/context.go

Lines 2033 to 2053 in b99b2d6

k := peerKey{TargetAddr: target, NodeID: remoteNodeID, Class: class}
if p, ok := rpcCtx.peers.get(k); ok {
// There's a cached peer, so we have a cached connection, use it.
return p.c
}
// Slow path. Race to create a peer.
conns := &rpcCtx.peers
conns.mu.Lock()
defer conns.mu.Unlock()
if p, lostRace := conns.getRLocked(k); lostRace {
return p.c
}
// Won race. Actually create a peer.
if conns.mu.m == nil {
conns.mu.m = map[peerKey]*peer{}
}

I haven't checked, but a sync.Map may also be an option here.

Either way, this seems like a small effort with a definite upside, so I would recommend taking this on.

@xinhaoz xinhaoz self-assigned this Feb 6, 2025
@xinhaoz
Copy link
Member

xinhaoz commented Feb 6, 2025

Thank you! This is being pursued in #140393. Extracting changes from that rn, including the lock improvements, but I would expect the biggest benefit will be from moving the recording out of the execution path which will take a bit longer for us to test thoroughly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-performance Perf of queries or internals. Solution not expected to change functional behavior. o-perf-efficiency Related to performance efficiency P-2 Issues/test failures with a fix SLA of 3 months T-observability
Projects
None yet
Development

No branches or pull requests

2 participants