Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor concurrency to not use qsize #323

Closed
StevenSong opened this issue Jun 15, 2020 · 4 comments
Closed

refactor concurrency to not use qsize #323

StevenSong opened this issue Jun 15, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@StevenSong
Copy link
Collaborator

StevenSong commented Jun 15, 2020

EDIT: split off the deadlock issue that spawned this issue into a standalone bug report at #326

What
This segment of code tries to synchronize the workers before proceeding: https://github.com/broadinstitute/ml/blob/e3540e1eff2fc45301255c1e89b87c8bb5d18405/ml4cvd/tensor_generators.py#L429-L430

This is a task that is traditionally accomplished with barriers. Additionally, documentation for current version of python multiprocessing on qsize method used here states the value is unreliable and implies that the function is potentially not portable, like macOS https://docs.python.org/3.6/library/multiprocessing.html#multiprocessing.Queue.qsize

Why
reusing familiar coding patterns is good for readability
additionally, portability of code is important

How
implement barriers in tensor generators

Acceptance Criteria
barriers in tensor generators

@StevenSong StevenSong self-assigned this Jun 15, 2020
@StevenSong StevenSong removed their assignment Jun 15, 2020
@erikr erikr added the enhancement New feature or request label Jun 16, 2020
@StevenSong
Copy link
Collaborator Author

to be honest @paolodi, I think the reliance on Queue.qsize() is the culprit for logs like this:

!!!!>~~~~~~~~~~~~ validation_worker completed true epoch 28 ~~~~~~~~~~~~<!!!!
Aggregated information string:
    Generator looped & shuffled over 200 paths. Epoch: 28
    229 tensors were presented.
    0 paths were skipped because they previously failed.
    No errors raised. 

@StevenSong
Copy link
Collaborator Author

StevenSong commented Jun 25, 2020

taking some notes here:

  1. the point of this line
    https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L164-L165
    is to check if all the workers have put something in the queue. this can be accomplished by setting queue size = num_workers and checking q.full()
    EDIT: it seems the stats_q is indeed set to max the number of workers, even easier to replace with q.full()
    EDIT 2: actually, it seems the queue is initialized with max_size = 0 as len(worker_instances) would be 0 before the subsequent initialization of workers
    https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L128
    this results in infinite queue size
    https://docs.python.org/3/library/queue.html#queue.Queue

  2. this is an easy line to replace with q.empty()
    https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L175

  3. looking at the consumer function that dequeues
    https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L171-L186
    and the producer function that enqueues
    https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L426-L437
    there is no protection against this order of operations:

let's say workers = 4
1. 4 workers put things into queue and block while the queue is full
2. consumer begins to dequeue and pops 1 item off the queue
3. workers begin putting more items into the queue because queue is no longer "full"
4. consumer pops from queue until queue is empty
at this point, consumer has consumed more than the 4 items it originally wanted.

@paolodi
Copy link
Contributor

paolodi commented Jun 25, 2020

WOW that's great investigative work, @StevieSong !!! Let me digest it a bit, and let's get some more eyes on it.

FYI: @lucidtronix @ndiamant @mklarqvist see above some possible side effects of relying on qsize

@StevenSong
Copy link
Collaborator Author

StevenSong commented Jun 25, 2020

to be honest @paolodi, I think the reliance on Queue.qsize() is the culprit for logs like this:

!!!!>~~~~~~~~~~~~ validation_worker completed true epoch 28 ~~~~~~~~~~~~<!!!!
Aggregated information string:
    Generator looped & shuffled over 200 paths. Epoch: 28
    229 tensors were presented.
    0 paths were skipped because they previously failed.
    No errors raised. 

dug some more, this line was at least partially culprit for this log, easy fix as (i + 1) % ...:
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L450-L451

@StevenSong StevenSong changed the title refactor concurrency to use proper barriers refactor concurrency to not use qsize Jun 25, 2020
@StevenSong StevenSong removed their assignment Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants