-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix WBM concurrency control, Add SetAllowStall(), Cleanup #11253
Conversation
cae3e96
to
8bbb04a
Compare
8bbb04a
to
ed61838
Compare
ed61838
to
9eeeb70
Compare
9eeeb70
to
5ee9591
Compare
|
||
private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup1: make functions that are marked as "should be called within RocksDB internal" private
// Value should only be changed by BeginWriteStall() and MaybeEndWriteStall() | ||
// while holding mu_, but it can be read without a lock. | ||
std::atomic<bool> stall_active_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup2: I don't see how stall_active_
and its related functions are needed. It seems like as long as we have our stall condition check (ShouldStall()
) implemented right, then we will always have stall_active_ == ShouldStall()
. But let me know if I overlook anything :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akankshamahajan15 shared an perspective of stall_active_
might exists as a perf optimization to reduce lock contention for the case of multiple DB using same WBM. She will cite more previous discussion on this soon.
I am not entirely sure about keeping this yet mainly for the reason that such perf optimization makes the concurrency model of WBM harder to understand as some can be access without lock while some can't.
[TODO for me] Understand the previous conversation on having stall_active_ the need of that; reconsider again the perf cost VS model simplicity
if (new_size == 0) { | ||
assert(false); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup3: based on a comment in the code Cannot early-exit on !enabled() because SetBufferSize(0) needs to unblock...
, I believe we don't want buffer size to be set to 0 in runtime. So I added an explicit check here for both prod and debug build.
// If the node was not consumed, the stall has ended already and we can signal | ||
// the caller. | ||
if (!new_node.empty()) { | ||
new_node.front()->Signal(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup4: I don't see why we need to signal wbm_stall
here since if wbm_stall
has been blocked due to another writer thread of the same DB has set it to blocked, that writer thread would be responsible for unclocking/signaling this wbm_stall
, not the current one. But again let me know if I miss anything :)
// Cannot early-exit on !enabled() because SetBufferSize(0) needs to unblock | ||
// the writers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question 2: I am not quite clear about this comment.
Does it means we don't need to early-exit on !enabled()
because it won't happen as "SetBufferSize(0) needs to unblock the writers"?
Or does it mean we are not able to do "early-exit" because we don't meet the requirement which "SetBufferSize(0) needs to unblock the writers"?
// Cannot early-exit on !enabled() because SetBufferSize(0) needs to unblock | ||
// the writers. | ||
if (!allow_stall_) { | ||
if (ShouldStall()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup5: I decided to encapsulate stall condition check inside ShouldStall()
and use this function as much as possible whenever we need to check if the stall condition has been met or not. This makes it easier for us to change stall condition check in the future e.g, making some of it runtime-changeable, adding new stall condition (cc @ajkr: you know what I'm talking about :p :p :p - the stall on global memory full)
return; // Stall conditions have not resolved. | ||
} | ||
|
||
// Perform all deallocations outside of the lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we can't do this now as we need to hold the lock for calling MaybeEndWriteStall(). See perf test result for any perf regression concern.
@@ -173,9 +179,8 @@ void WriteBufferManager::RemoveDBFromQueue(StallInterface* wbm_stall) { | |||
|
|||
// Deallocate the removed nodes outside of the lock. | |||
std::list<StallInterface*> cleanup; | |||
|
|||
if (enabled() && allow_stall_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimization 2: it appears to me that we shouldn't check enabled() && allow_stall_
again with this PR and don't need to check it before this PR. Reasons:
- allow_stall_ can be changed to false in runtime and we don't want it to interfere with RemoveDBFromQueue() as a DB cleanup
- Even when allow_stall_ was not run-time changeable before, the
queue_
wouldn't have anything ifenabled() && allow_stall_
was false cuz we won't have the stall on then.
assert(wbm_stall != nullptr); | ||
assert(allow_stall_); | ||
|
||
// Allocate outside of the lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question 3: just to clarify, this "allocate" means allocate of the list of pointers new_node
not the actual StallInterface object wbm_stall
right? Same for all the de-allocation done outside the lock below like // Deallocate the removed nodes outside of the lock.
342301b
to
4cc5b83
Compare
4cc5b83
to
dfedf99
Compare
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
void FreeMemWithCache(size_t mem); | ||
|
||
// Mutex used to protect WriteBufferManager's data variables. | ||
mutable std::mutex wbm_mutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup 6: use 1 instead of multiple mutex to guard separate groups of WBM data for simplicity.
@hx235 has updated the pull request. You must reimport the pull request before landing. |
void WriteBufferManagerStallWrites(); | ||
// If stall conditions are met, begin stalling of writes with help of | ||
// `WriteBufferManager` | ||
void MaybeWriteBufferManagerStallWrites(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup 7: renaming this function to reflect that we will still do stall condition check again within this function.
static_cast<WBMStallInterface*>(wbm_stall_.get()) | ||
->SetState(WBMStallInterface::State::BLOCKED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup 8: make this part of the responsibility of MaybeBeginWriteStall() as the casting static_cast<WBMStallInterface*>
indicates we are dealing something too low-level here
…ore stopping adding future write threads to queue
ad47215
to
fbfd947
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
if (enabled()) { | ||
memory_active_.fetch_sub(mem, std::memory_order_relaxed); | ||
} | ||
} | ||
|
||
void WriteBufferManager::FreeMem(size_t mem) { | ||
std::unique_lock<std::mutex> lock(wbm_mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we acquire the lock inside these functions FreeMemWithCache, MaybeEndWriteStall etc to reduce the lock duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a concrete concern where and how such lock duration can be problematic? As far as I know, the longest duration come from cache_res_mgr_->UpdateCacheReservation()
evicting dummy entries linear to mem
to be freed from cache. Would that be long enough to be problem?
My intention is to make FreeMem
an atomic function since read/write to memory_used_
and MaybeEndWriteStall()
are very related through ShouldStall()
. Making these two into one atomic option relieves us from thinking about several data race.
We also have a plan of considering the returned status s
of cache_res_mgr_->UpdateCacheReservation()
in WriteBufferManager::FreeMemWithCache()
(or ReserveMemWithCache()
) into the decision of we should stall or not. Now we only do s.PermitUncheckedError();
So from this perspective, I'd like to make FreeMemWithCache
atomic with read/write to memory_used_
and MaybeEndWriteStall()
.
@@ -84,26 +117,25 @@ void WriteBufferManager::ReserveMemWithCache(size_t mem) { | |||
} | |||
|
|||
void WriteBufferManager::ScheduleFreeMem(size_t mem) { | |||
std::unique_lock<std::mutex> lock(wbm_mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we don't use lock for memory_active_
and other similar variables to reduce the locking/releasing the mutex which can be expensive and these variables are already atomic so they don't really need the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are reading the buffer_size_
's value with order std::memory_order_relaxed
, if without locking, do we concern about not getting the latest modified value by other thread to buffer_size_
through SetBufferSize()
because of std::memory_order_relaxed
?
This is another reason in addition to "making related operations atomic" for simplicity as mentioned in https://github.com/facebook/rocksdb/pull/11253/files#r1151045690.
This reminds me of we should sync up on whether we care about these two things as the answer to these questions can impact this PR e.g, remove some lock added
Update: based on the discussion @akankshamahajan15, most of the "Fix WBM concurrency control" will probably add overhead to each write as it requires many threads (write threads X number db sharing same WBM) to compete for a lock. Therefore, I am planning to only add SetAllowStal() part to the PR. Since this PR is full of conversation focused on WBM concurrency control, I will close this one (for historical record) and open another one. |
Context:
Allow changing
WBM::allow_stall_
in runtime gives flexibility to users. In order to do that, we need to close a few gaps in WBM's concurrency control so this parameter change could happen concurrently.Summary:
buffer_size_
from other thread was not protected by such synchronization. In particular, use one instead of a few internal locks to do so.SetAllowStall()
Test:
pre-change:
fillseq [AVG 38 runs] : 960 (± 276) ops/sec; 0.5 (± 0.1) MB/sec
post-change (no regression but a slight improvement):
fillseq [AVG 38 runs] : 997 (± 298) ops/sec; 0.5 (± 0.2) MB/sec