Skip to content

Move replica banning to its own task #181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

drdrsh
Copy link
Collaborator

@drdrsh drdrsh commented Sep 27, 2022

This PR moves banning logic to be owned an async task.

The async task will receives events from various sources that want to request a replica be banned.

This design allows for agents other than clients to ban replicas. This should allows us to introduce manual instance banning or banning based on periodic health checks.

We also replace the RWLock around banlists and use ArcSwap instead. This should make Banlist reads more or less lock-free

src/pool.rs Outdated
Comment on lines 434 to 440
// We check if banning this address will result in all replica being banned
// If so, we unban all replicas instead
let pool_banned_addresses = self.ban_reporter.banlist(
self.settings.name.clone(),
self.settings.user.username.clone(),
);

Copy link
Collaborator Author

@drdrsh drdrsh Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the check for "All replicas are banned" to be in ban to avoid making the super hot is_banned method slower. Under normal operations, we rarely ever call ban but we call is_banned all the time.

src/bans.rs Outdated

/// Send statistics to the task keeping track of stats.
fn send(&self, event: BanManagerEvent) {
let result = self.channel_to_worker.try_send(event.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this send instead of try_send. We really don't want to ever lose this message.

src/bans.rs Outdated

pub fn report_failed_checkout(&self, pool_name: String, username: String, address: Address) {
let event = BanManagerEvent::Ban {
address: address,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to create some kind of structure uniquely identifying a "pool", if we don't have one already. Passing around strings is fine, but they are arbitrarily large and error-prone.

@drdrsh drdrsh changed the title Mostafa ban task Move replica banning to its own task Oct 9, 2022
src/bans.rs Outdated

impl Default for BanManager {
fn default() -> BanManager {
let (channel_to_worker, _rx) = channel(1000);
Copy link
Contributor

@levkk levkk Oct 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud:

This channel can get saturated during incidents. Imagine we have 30k clients that together produce 50k QPS. If a replica goes down, all of a sudden we'll get 50k banning events which will quickly saturate this channel. The async worker task may not even get scheduled to ban the replica.

With an RW lock, one of the clients will ban, and all the others will read the banlist and see that it's banned immediately. RW locks are probably more expensive than ArcSwap (something we should actually validate), but they are more effective than channels I think (something we should investigate).

I think the middle ground can be a Mutex for banning, and an ArcSwap for reading the list, so only one task gets to ban at a time instead of a thundering herd (pseudo-code):

fn ban() {
  if (is_banned()) return;

  let guard = ban_mutex.lock();

  let new_ban_list = ...;

  arc_swap.set(new_ban_list);
}

fn is_banned() {
   let ban_list = (*arc_swap);
   ban_list.contains_key(address);
}

The thesis that this architecture allows for other agents than clients to ban seems shaky. Another agent, e.g. an admin command, can easily use the pool.ban and pool.unban methods, for example, and it would require very little changes to the code and no changes to the arch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants