Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WBM concurrency control, Add SetAllowStall(), Cleanup #11253

Closed
wants to merge 3 commits into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Feb 25, 2023

Context:
Allow changing WBM::allow_stall_ in runtime gives flexibility to users. In order to do that, we need to close a few gaps in WBM's concurrency control so this parameter change could happen concurrently.

Summary:

  • Add synchronization to ALL WBM's data members as they are all part of the following decisions. Previously the dynamic-changeable buffer_size_ from other thread was not protected by such synchronization. In particular, use one instead of a few internal locks to do so.
Write => Optionally charge memory to cache => check stall condition (allow_stall, enable(), buffer_size vs memory usage)  => if stalled, create stall and wait & ShouldFlush() returns true => flush will decrease memory usage (Optionally release memory from cache) => stall condition no longer meets =>  notify waiting write to stop stalling
  • Add new function SetAllowStall()
  • Misc:
    • A few cleanups in the logic - see PR conversation.
    • Clarify class/function comments about concurrency; Made some public functions but should only called by RocksDB internal private

Test:

  • New UT
  • Performance test
./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3

pre-change: fillseq [AVG 38 runs] : 960 (± 276) ops/sec; 0.5 (± 0.1) MB/sec
post-change (no regression but a slight improvement): fillseq [AVG 38 runs] : 997 (± 298) ops/sec; 0.5 (± 0.2) MB/sec

@hx235 hx235 changed the title Draft [ForDiscussion] Improve WBM concurrency control + add SetAllowStall + cleanup/refactory Feb 25, 2023
@hx235 hx235 force-pushed the runtime-changeable-allow-stall branch from cae3e96 to 8bbb04a Compare February 27, 2023 19:57
@hx235 hx235 force-pushed the runtime-changeable-allow-stall branch from 8bbb04a to ed61838 Compare March 16, 2023 19:41
@hx235 hx235 changed the title [ForDiscussion] Improve WBM concurrency control + add SetAllowStall + cleanup/refactory Improve WBM concurrency control + add SetAllowStall + cleanup/refactory Mar 16, 2023
@hx235 hx235 force-pushed the runtime-changeable-allow-stall branch from ed61838 to 9eeeb70 Compare March 16, 2023 23:12
@hx235 hx235 changed the title Improve WBM concurrency control + add SetAllowStall + cleanup/refactory Improve WBM concurrency control, add SetAllowStall(), cleanup/refractory Mar 16, 2023
@hx235 hx235 changed the title Improve WBM concurrency control, add SetAllowStall(), cleanup/refractory Improve WBM concurrency control and add SetAllowStall() Mar 16, 2023
@hx235 hx235 changed the title Improve WBM concurrency control and add SetAllowStall() Fix WBM concurrency control, Add SetAllowStall() Mar 16, 2023
@hx235 hx235 force-pushed the runtime-changeable-allow-stall branch from 9eeeb70 to 5ee9591 Compare March 17, 2023 00:08

private:
Copy link
Contributor Author

@hx235 hx235 Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup1: make functions that are marked as "should be called within RocksDB internal" private

@hx235 hx235 changed the title Fix WBM concurrency control, Add SetAllowStall() Fix WBM concurrency control, Add SetAllowStall(), cleanup Mar 17, 2023
Comment on lines -169 to -171
// Value should only be changed by BeginWriteStall() and MaybeEndWriteStall()
// while holding mu_, but it can be read without a lock.
std::atomic<bool> stall_active_;
Copy link
Contributor Author

@hx235 hx235 Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup2: I don't see how stall_active_ and its related functions are needed. It seems like as long as we have our stall condition check (ShouldStall()) implemented right, then we will always have stall_active_ == ShouldStall(). But let me know if I overlook anything :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akankshamahajan15 shared an perspective of stall_active_ might exists as a perf optimization to reduce lock contention for the case of multiple DB using same WBM. She will cite more previous discussion on this soon.

I am not entirely sure about keeping this yet mainly for the reason that such perf optimization makes the concurrency model of WBM harder to understand as some can be access without lock while some can't.

[TODO for me] Understand the previous conversation on having stall_active_ the need of that; reconsider again the perf cost VS model simplicity

Comment on lines +57 to +60
if (new_size == 0) {
assert(false);
return;
}
Copy link
Contributor Author

@hx235 hx235 Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup3: based on a comment in the code Cannot early-exit on !enabled() because SetBufferSize(0) needs to unblock..., I believe we don't want buffer size to be set to 0 in runtime. So I added an explicit check here for both prod and debug build.

Comment on lines -134 to -137
// If the node was not consumed, the stall has ended already and we can signal
// the caller.
if (!new_node.empty()) {
new_node.front()->Signal();
Copy link
Contributor Author

@hx235 hx235 Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup4: I don't see why we need to signal wbm_stall here since if wbm_stall has been blocked due to another writer thread of the same DB has set it to blocked, that writer thread would be responsible for unclocking/signaling this wbm_stall, not the current one. But again let me know if I miss anything :)

Comment on lines -143 to -144
// Cannot early-exit on !enabled() because SetBufferSize(0) needs to unblock
// the writers.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question 2: I am not quite clear about this comment.

Does it means we don't need to early-exit on !enabled() because it won't happen as "SetBufferSize(0) needs to unblock the writers"?

Or does it mean we are not able to do "early-exit" because we don't meet the requirement which "SetBufferSize(0) needs to unblock the writers"?

// Cannot early-exit on !enabled() because SetBufferSize(0) needs to unblock
// the writers.
if (!allow_stall_) {
if (ShouldStall()) {
Copy link
Contributor Author

@hx235 hx235 Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup5: I decided to encapsulate stall condition check inside ShouldStall() and use this function as much as possible whenever we need to check if the stall condition has been met or not. This makes it easier for us to change stall condition check in the future e.g, making some of it runtime-changeable, adding new stall condition (cc @ajkr: you know what I'm talking about :p :p :p - the stall on global memory full)

return; // Stall conditions have not resolved.
}

// Perform all deallocations outside of the lock.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we can't do this now as we need to hold the lock for calling MaybeEndWriteStall(). See perf test result for any perf regression concern.

@@ -173,9 +179,8 @@ void WriteBufferManager::RemoveDBFromQueue(StallInterface* wbm_stall) {

// Deallocate the removed nodes outside of the lock.
std::list<StallInterface*> cleanup;

if (enabled() && allow_stall_) {
Copy link
Contributor Author

@hx235 hx235 Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization 2: it appears to me that we shouldn't check enabled() && allow_stall_ again with this PR and don't need to check it before this PR. Reasons:

  • allow_stall_ can be changed to false in runtime and we don't want it to interfere with RemoveDBFromQueue() as a DB cleanup
  • Even when allow_stall_ was not run-time changeable before, the queue_ wouldn't have anything if enabled() && allow_stall_ was false cuz we won't have the stall on then.

assert(wbm_stall != nullptr);
assert(allow_stall_);

// Allocate outside of the lock.
Copy link
Contributor Author

@hx235 hx235 Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question 3: just to clarify, this "allocate" means allocate of the list of pointers new_node not the actual StallInterface object wbm_stall right? Same for all the de-allocation done outside the lock below like // Deallocate the removed nodes outside of the lock.

@hx235 hx235 force-pushed the runtime-changeable-allow-stall branch 2 times, most recently from 342301b to 4cc5b83 Compare March 17, 2023 02:45
@hx235 hx235 changed the title Fix WBM concurrency control, Add SetAllowStall(), cleanup Fix WBM concurrency control, Add SetAllowStall(), Cleanup Mar 17, 2023
@hx235 hx235 requested a review from akankshamahajan15 March 17, 2023 04:50
@hx235 hx235 force-pushed the runtime-changeable-allow-stall branch from 4cc5b83 to dfedf99 Compare March 17, 2023 06:15
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 requested a review from ajkr March 17, 2023 06:27
void FreeMemWithCache(size_t mem);

// Mutex used to protect WriteBufferManager's data variables.
mutable std::mutex wbm_mutex_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup 6: use 1 instead of multiple mutex to guard separate groups of WBM data for simplicity.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

void WriteBufferManagerStallWrites();
// If stall conditions are met, begin stalling of writes with help of
// `WriteBufferManager`
void MaybeWriteBufferManagerStallWrites();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup 7: renaming this function to reflect that we will still do stall condition check again within this function.

Comment on lines -1908 to -1912
static_cast<WBMStallInterface*>(wbm_stall_.get())
->SetState(WBMStallInterface::State::BLOCKED);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup 8: make this part of the responsibility of MaybeBeginWriteStall() as the casting static_cast<WBMStallInterface*> indicates we are dealing something too low-level here

@hx235 hx235 force-pushed the runtime-changeable-allow-stall branch from ad47215 to fbfd947 Compare March 22, 2023 00:13
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

if (enabled()) {
memory_active_.fetch_sub(mem, std::memory_order_relaxed);
}
}

void WriteBufferManager::FreeMem(size_t mem) {
std::unique_lock<std::mutex> lock(wbm_mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we acquire the lock inside these functions FreeMemWithCache, MaybeEndWriteStall etc to reduce the lock duration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a concrete concern where and how such lock duration can be problematic? As far as I know, the longest duration come from cache_res_mgr_->UpdateCacheReservation() evicting dummy entries linear to mem to be freed from cache. Would that be long enough to be problem?

My intention is to make FreeMem an atomic function since read/write to memory_used_ and MaybeEndWriteStall() are very related through ShouldStall(). Making these two into one atomic option relieves us from thinking about several data race.

We also have a plan of considering the returned status s of cache_res_mgr_->UpdateCacheReservation() in WriteBufferManager::FreeMemWithCache() (or ReserveMemWithCache()) into the decision of we should stall or not. Now we only do s.PermitUncheckedError(); So from this perspective, I'd like to make FreeMemWithCache atomic with read/write to memory_used_ and MaybeEndWriteStall().

@@ -84,26 +117,25 @@ void WriteBufferManager::ReserveMemWithCache(size_t mem) {
}

void WriteBufferManager::ScheduleFreeMem(size_t mem) {
std::unique_lock<std::mutex> lock(wbm_mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we don't use lock for memory_active_ and other similar variables to reduce the locking/releasing the mutex which can be expensive and these variables are already atomic so they don't really need the mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are reading the buffer_size_'s value with order std::memory_order_relaxed, if without locking, do we concern about not getting the latest modified value by other thread to buffer_size_ through SetBufferSize() because of std::memory_order_relaxed?

This is another reason in addition to "making related operations atomic" for simplicity as mentioned in https://github.com/facebook/rocksdb/pull/11253/files#r1151045690.

This reminds me of we should sync up on whether we care about these two things as the answer to these questions can impact this PR e.g, remove some lock added

@hx235
Copy link
Contributor Author

hx235 commented Mar 28, 2023

Update: based on the discussion @akankshamahajan15, most of the "Fix WBM concurrency control" will probably add overhead to each write as it requires many threads (write threads X number db sharing same WBM) to compete for a lock. Therefore, I am planning to only add SetAllowStal() part to the PR. Since this PR is full of conversation focused on WBM concurrency control, I will close this one (for historical record) and open another one.

@hx235 hx235 closed this Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants