refactor concurrency to not use qsize #323

StevenSong · 2020-06-15T15:03:49Z

EDIT: split off the deadlock issue that spawned this issue into a standalone bug report at #326

What
This segment of code tries to synchronize the workers before proceeding: https://github.com/broadinstitute/ml/blob/e3540e1eff2fc45301255c1e89b87c8bb5d18405/ml4cvd/tensor_generators.py#L429-L430

This is a task that is traditionally accomplished with barriers. Additionally, documentation for current version of python multiprocessing on qsize method used here states the value is unreliable and implies that the function is potentially not portable, like macOS https://docs.python.org/3.6/library/multiprocessing.html#multiprocessing.Queue.qsize

Why
reusing familiar coding patterns is good for readability
additionally, portability of code is important

How
implement barriers in tensor generators

Acceptance Criteria
barriers in tensor generators

The text was updated successfully, but these errors were encountered:

StevenSong · 2020-06-25T14:28:01Z

to be honest @paolodi, I think the reliance on Queue.qsize() is the culprit for logs like this:

!!!!>~~~~~~~~~~~~ validation_worker completed true epoch 28 ~~~~~~~~~~~~<!!!!
Aggregated information string:
    Generator looped & shuffled over 200 paths. Epoch: 28
    229 tensors were presented.
    0 paths were skipped because they previously failed.
    No errors raised.

StevenSong · 2020-06-25T14:40:35Z

taking some notes here:

the point of this line
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L164-L165
is to check if all the workers have put something in the queue. this can be accomplished by setting queue size = num_workers and checking q.full()
EDIT: it seems the stats_q is indeed set to max the number of workers, even easier to replace with q.full()
EDIT 2: actually, it seems the queue is initialized with max_size = 0 as len(worker_instances) would be 0 before the subsequent initialization of workers
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L128
this results in infinite queue size
https://docs.python.org/3/library/queue.html#queue.Queue
this is an easy line to replace with q.empty()
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L175
looking at the consumer function that dequeues
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L171-L186
and the producer function that enqueues
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L426-L437
there is no protection against this order of operations:

let's say workers = 4
1. 4 workers put things into queue and block while the queue is full
2. consumer begins to dequeue and pops 1 item off the queue
3. workers begin putting more items into the queue because queue is no longer "full"
4. consumer pops from queue until queue is empty
at this point, consumer has consumed more than the 4 items it originally wanted.

paolodi · 2020-06-25T14:51:11Z

WOW that's great investigative work, @StevieSong !!! Let me digest it a bit, and let's get some more eyes on it.

FYI: @lucidtronix @ndiamant @mklarqvist see above some possible side effects of relying on qsize

StevenSong · 2020-06-25T16:03:22Z

to be honest @paolodi, I think the reliance on Queue.qsize() is the culprit for logs like this:

!!!!>~~~~~~~~~~~~ validation_worker completed true epoch 28 ~~~~~~~~~~~~<!!!!
Aggregated information string:
    Generator looped & shuffled over 200 paths. Epoch: 28
    229 tensors were presented.
    0 paths were skipped because they previously failed.
    No errors raised.

dug some more, this line was at least partially culprit for this log, easy fix as (i + 1) % ...:
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L450-L451

StevenSong self-assigned this Jun 15, 2020

StevenSong mentioned this issue Jun 15, 2020

fix infinite loop in _on_epoch_end #326

Closed

StevenSong removed their assignment Jun 15, 2020

StevenSong mentioned this issue Jun 15, 2020

fixes in tensor generator #327

Merged

erikr added the enhancement New feature or request label Jun 16, 2020

StevenSong self-assigned this Jun 25, 2020

StevenSong mentioned this issue Jun 25, 2020

Improve clarity of logfile contents #216

Closed

StevenSong changed the title ~~refactor concurrency to use proper barriers~~ refactor concurrency to not use qsize Jun 25, 2020

StevenSong mentioned this issue Jun 25, 2020

synchronize with signals and fix tensors presented #345

Closed

StevenSong removed their assignment Feb 24, 2021

lucidtronix closed this as completed Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor concurrency to not use qsize #323

refactor concurrency to not use qsize #323

StevenSong commented Jun 15, 2020 •

edited

Loading

StevenSong commented Jun 25, 2020

StevenSong commented Jun 25, 2020 •

edited

Loading

paolodi commented Jun 25, 2020 •

edited

Loading

StevenSong commented Jun 25, 2020 •

edited

Loading

refactor concurrency to not use qsize #323

refactor concurrency to not use qsize #323

Comments

StevenSong commented Jun 15, 2020 • edited Loading

StevenSong commented Jun 25, 2020

StevenSong commented Jun 25, 2020 • edited Loading

paolodi commented Jun 25, 2020 • edited Loading

StevenSong commented Jun 25, 2020 • edited Loading

StevenSong commented Jun 15, 2020 •

edited

Loading

StevenSong commented Jun 25, 2020 •

edited

Loading

paolodi commented Jun 25, 2020 •

edited

Loading

StevenSong commented Jun 25, 2020 •

edited

Loading