-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
synchronize with signals and fix tensors presented #345
Conversation
|
found the deadlock... lets say num workers = 4
the implication here is that Queue put is non-atomic and that there is a lag between an item being put into queue in a producer process and the item being visible in queue in a consumer process there's some documentation of non-atomicity here: the worst part is that apparently this issue was fixed in python 3.7 (see the python bug report link). we're on 3.6 (let's upgrade eventually! @lucidtronix) the fix is hopefully to enforce the consumer consumes num_workers number of items - |
ml4cvd/tensor_generators.py
Outdated
@@ -161,7 +161,7 @@ def set_worker_paths(self, paths: List[Path]): | |||
def __next__(self) -> Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray], Optional[List[str]]]: | |||
if not self._started: | |||
self._init_workers() | |||
if self.stats_q.qsize() == self.num_workers: | |||
if all(worker.signal.is_set() for worker in self.worker_instances): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we abstract this to a function, e.g. def epoch_just_finished()
or something? I want to use it to know when to stop inference with multiprocessing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, aggregate_and_print_stats will reset this condition though, just a thing to be aware of
@@ -161,7 +161,7 @@ def set_worker_paths(self, paths: List[Path]): | |||
def __next__(self) -> Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray], Optional[List[str]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write a test that runs one epoch worth of batches (with multiprocessing) and checks that every path is visited exactly once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, this was a really good idea, during development of this test, it seems its possible to see a path twice before a true epoch is reached (all the paths are seen at least once)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized this may be inevitable if some workers finish before others. Maybe the test should just be each path is visited at least once. 200 vs 121 feels extreme though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
200 vs 121 might be because of small batch size, the worker that is started first is probably able to get through 20 (10 per batch * 2 batches) before the last worker is started maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can get this closer but to fix issue in such a way that tensors are repeated correctly when number of tensors is not a multiple of batch size is proving very difficult
see: #345 (comment)
intermediate implementation is to prevent workers from completing cycle N+1 before other workers complete cycle N. this lead to this discovery: using old example, batch size = 10, validation steps = 20, validation tensors = 200, num workers = 6. each worker has 33-34 tensors each - i.e. to visit each tensor at least once, all workers must complete 4 cycles validation worker 1-6 complete cycle 1 and yield 60 tensors (batch size 10) (6 validation steps) the fix is to mess with num workers again to allocate according to batch size, number of steps and number of tensors. |
not as simple as changing number of workers, changing number of workers is okay when the number of tensors is a multiple of the batch size as then you can reach a true epoch by requesting N batches. if the number of tensors in a true epoch are not a multiple of batch size, then every true epoch after the first will require varying number of batches, depending on which workers finish first |
A proposal to this is to set the number of steps larger such that all tensors are seen by the model at least once. Probably |
rewriting tensor generator #353 |
resolves #323
resolves #346
this PR aims to use more reliable and portable python functions (no longer using
qsize
), to fix the number of tensors presented(i + 1) % ...
and to fix 2 possible concurrency edge cases:qsize == num_workers
and consumer consumes until the queue is empty (see appendix for current code). this opens the door for this scenario:the fix is to have producers wait until the consumer gives the "all clear" to begin producing again. the consumer waits for all the producers to have enqueued, processes all the items, and then gives the all clear. concurrently, the producers are waiting to produce the next item.
the fix is again to have the producers wait until the "all clear" is given before enqueuing the next set of items.
Appendix:
current consumer code:
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L171-L186
current producer code:
https://github.com/broadinstitute/ml/blob/dd13b4518b3547ebdcad13698f0c71a4abaaafb8/ml4cvd/tensor_generators.py#L426-L437