synchronize with signals and fix tensors presented #345

StevenSong · 2020-06-25T16:35:11Z

resolves #323
resolves #346

this PR aims to use more reliable and portable python functions (no longer using qsize), to fix the number of tensors presented (i + 1) % ... and to fix 2 possible concurrency edge cases:

consumer dequeues more items than originally produced at time of decision to consume. current code only blocks producers when qsize == num_workers and consumer consumes until the queue is empty (see appendix for current code). this opens the door for this scenario:

let's say workers = 4
1. 4 workers put things into queue and block while the queue is full
2. consumer begins to dequeue and pops 1 item off the queue
3. workers begin putting more items into the queue because queue is no longer "full"
4. consumer pops from queue until queue is empty
at this point, consumer has consumed more than the 4 items it originally wanted.

the fix is to have producers wait until the consumer gives the "all clear" to begin producing again. the consumer waits for all the producers to have enqueued, processes all the items, and then gives the all clear. concurrently, the producers are waiting to produce the next item.

the other edge case is for a single producer to enqueue more than once before the consumer dequeues. this example in the current code can happen:

let's say workers = 2
1. worker 0 enqueues an item
2. worker 1 is lazy
3. worker 0 enqueues another item
4. consumer detects 2 items in queue, start to dequeue
worker 0 has just reported stats for the same set of paths twice, as each worker gets a distinct set of paths

the fix is again to have the producers wait until the "all clear" is given before enqueuing the next set of items.

Appendix:

current consumer code:
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L171-L186

current producer code:
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L426-L437

StevenSong · 2020-06-26T02:04:16Z

~~this change might currently be causing deadlock, investigating~~

StevenSong · 2020-06-26T20:35:35Z

found the deadlock...

lets say num workers = 4

4 producers all put things in q then signal that they have all put items in
consumer begins consuming until the q is empty - consumer reports only 3 items were dequeued and counts of tensors presented are off because the last item from the q was not reported
since queue was initialized to max size num_workers, next time producers try to enqueue, the last producer will have a q with 4 items already in it (1 leftover from last round) will be stuck since q is full and can't signal consumer, therefore consumer won't see signal and won't dequeue - deadlock

the implication here is that Queue put is non-atomic and that there is a lag between an item being put into queue in a producer process and the item being visible in queue in a consumer process

there's some documentation of non-atomicity here:
https://bugs.python.org/issue14976
and here:
https://codewithoutrules.com/2017/08/16/concurrency-python/ (not exactly same bug but implies put is non-atomic)
and here:
https://stackoverflow.com/questions/34641807/why-does-pythons-multiprocessing-queue-have-a-buffer-and-a-pipe

the worst part is that apparently this issue was fixed in python 3.7 (see the python bug report link). we're on 3.6 (let's upgrade eventually! @lucidtronix)

the fix is hopefully to enforce the consumer consumes num_workers number of items - ~~currently testing~~ fixed

ndiamant · 2020-06-30T19:19:09Z

ml4cvd/tensor_generators.py

@@ -161,7 +161,7 @@ def set_worker_paths(self, paths: List[Path]):
    def __next__(self) -> Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray], Optional[List[str]]]:
        if not self._started:
            self._init_workers()
-        if self.stats_q.qsize() == self.num_workers:
+        if all(worker.signal.is_set() for worker in self.worker_instances):


Can we abstract this to a function, e.g. def epoch_just_finished() or something? I want to use it to know when to stop inference with multiprocessing

sure, aggregate_and_print_stats will reset this condition though, just a thing to be aware of

ndiamant · 2020-06-30T19:22:20Z

ml4cvd/tensor_generators.py

@@ -161,7 +161,7 @@ def set_worker_paths(self, paths: List[Path]):
    def __next__(self) -> Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray], Optional[List[str]]]:


Can you write a test that runs one epoch worth of batches (with multiprocessing) and checks that every path is visited exactly once?

oh, this was a really good idea, during development of this test, it seems its possible to see a path twice before a true epoch is reached (all the paths are seen at least once)

screenshot from a notebook, batch size = 10, validation steps = 20

Just realized this may be inevitable if some workers finish before others. Maybe the test should just be each path is visited at least once. 200 vs 121 feels extreme though

200 vs 121 might be because of small batch size, the worker that is started first is probably able to get through 20 (10 per batch * 2 batches) before the last worker is started maybe

can get this closer but to fix issue in such a way that tensors are repeated correctly when number of tensors is not a multiple of batch size is proving very difficult

see: #345 (comment)

StevenSong · 2020-07-01T15:52:58Z

intermediate implementation is to prevent workers from completing cycle N+1 before other workers complete cycle N. this lead to this discovery:

using old example, batch size = 10, validation steps = 20, validation tensors = 200, num workers = 6. each worker has 33-34 tensors each - i.e. to visit each tensor at least once, all workers must complete 4 cycles

validation worker 1-6 complete cycle 1 and yield 60 tensors (batch size 10) (6 validation steps)
validation worker 1-6 complete cycle 2 and yield 60 tensors, total 120 tensors (12 validation steps)
validation worker 1-6 complete cycle 3 and yield 60 tensors, total 180 tensors (18 validation steps)
2 more validation workers complete cycle 4 and yield 20 tensors - these encompass 6-8 more unique tensors and 14-12 non-unique tensors (depending on if the workers have 33 or 34 tensors each). this is the completion of 20 validation steps. It is not possible to reach all 200 validation tensors in 20 validation steps with 6 validation workers.

the fix is to mess with num workers again to allocate according to batch size, number of steps and number of tensors.

StevenSong · 2020-07-01T20:46:44Z

the fix is to mess with num workers again to allocate according to batch size, number of steps and number of tensors.

not as simple as changing number of workers, changing number of workers is okay when the number of tensors is a multiple of the batch size as then you can reach a true epoch by requesting N batches. if the number of tensors in a true epoch are not a multiple of batch size, then every true epoch after the first will require varying number of batches, depending on which workers finish first

StevenSong · 2020-07-01T21:04:29Z

intermediate implementation is to prevent workers from completing cycle N+1 before other workers complete cycle N. this lead to this discovery:

using old example, batch size = 10, validation steps = 20, validation tensors = 200, num workers = 6. each worker has 33-34 tensors each - i.e. to visit each tensor at least once, all workers must complete 4 cycles

validation worker 1-6 complete cycle 1 and yield 60 tensors (batch size 10) (6 validation steps)
validation worker 1-6 complete cycle 2 and yield 60 tensors, total 120 tensors (12 validation steps)
validation worker 1-6 complete cycle 3 and yield 60 tensors, total 180 tensors (18 validation steps)
2 more validation workers complete cycle 4 and yield 20 tensors - these encompass 6-8 more unique tensors and 14-12 non-unique tensors (depending on if the workers have 33 or 34 tensors each). this is the completion of 20 validation steps. It is not possible to reach all 200 validation tensors in 20 validation steps with 6 validation workers.

the fix is to mess with num workers again to allocate according to batch size, number of steps and number of tensors.

A proposal to this is to set the number of steps larger such that all tensors are seen by the model at least once.

Probably num_steps = int(2 * num_tensors / batch_size) + 1 would be good

StevenSong · 2020-07-04T19:46:33Z

rewriting tensor generator #353

synchronize with signals and fix tensors presented

aa2a565

StevenSong self-assigned this Jun 25, 2020

StevenSong requested a review from paolodi June 25, 2020 16:41

StevenSong mentioned this pull request Jun 25, 2020

not all tensors properly used by train #346

Closed

consume num_workers stats from q

5d885c1

ndiamant suggested changes Jun 30, 2020

View reviewed changes

epoch_is_finished

9e64a06

StevenSong mentioned this pull request Jun 30, 2020

tensor paths are revisited before other tensor paths are seen at least once #349

Closed

StevenSong added 2 commits July 1, 2020 14:38

all tensors in a true epoch are used N times before the N+1 time

00b9cb6

revert worker allocation

4538b77

StevenSong closed this Jul 4, 2020

StevenSong deleted the ss_concurrency branch August 12, 2020 21:04

mklarqvist mentioned this pull request Aug 14, 2020

Migrate to tf.data.Dataset and/or torch.utils.data.DataLoader #380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synchronize with signals and fix tensors presented #345

synchronize with signals and fix tensors presented #345

StevenSong commented Jun 25, 2020 •

edited

Loading

StevenSong commented Jun 26, 2020 •

edited

Loading

StevenSong commented Jun 26, 2020 •

edited

Loading

ndiamant Jun 30, 2020

StevenSong Jun 30, 2020

ndiamant Jun 30, 2020

StevenSong Jun 30, 2020 •

edited

Loading

StevenSong Jun 30, 2020

ndiamant Jun 30, 2020

StevenSong Jun 30, 2020

StevenSong Jul 1, 2020

StevenSong commented Jul 1, 2020 •

edited

Loading

StevenSong commented Jul 1, 2020

StevenSong commented Jul 1, 2020 •

edited

Loading

StevenSong commented Jul 4, 2020

		@@ -161,7 +161,7 @@ def set_worker_paths(self, paths: List[Path]):
		def __next__(self) -> Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray], Optional[List[str]]]:

synchronize with signals and fix tensors presented #345

synchronize with signals and fix tensors presented #345

Conversation

StevenSong commented Jun 25, 2020 • edited Loading

StevenSong commented Jun 26, 2020 • edited Loading

StevenSong commented Jun 26, 2020 • edited Loading

ndiamant Jun 30, 2020

Choose a reason for hiding this comment

StevenSong Jun 30, 2020

Choose a reason for hiding this comment

ndiamant Jun 30, 2020

Choose a reason for hiding this comment

StevenSong Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

StevenSong Jun 30, 2020

Choose a reason for hiding this comment

ndiamant Jun 30, 2020

Choose a reason for hiding this comment

StevenSong Jun 30, 2020

Choose a reason for hiding this comment

StevenSong Jul 1, 2020

Choose a reason for hiding this comment

StevenSong commented Jul 1, 2020 • edited Loading

StevenSong commented Jul 1, 2020

StevenSong commented Jul 1, 2020 • edited Loading

StevenSong commented Jul 4, 2020

StevenSong commented Jun 25, 2020 •

edited

Loading

StevenSong commented Jun 26, 2020 •

edited

Loading

StevenSong commented Jun 26, 2020 •

edited

Loading

StevenSong Jun 30, 2020 •

edited

Loading

StevenSong commented Jul 1, 2020 •

edited

Loading

StevenSong commented Jul 1, 2020 •

edited

Loading