Open
Conversation
2523ea1 to
16e9a86
Compare
Contributor
Author
|
Stuff is gonna move around, don't worry about reviewing too in-depth for now |
Northbadge
added a commit
to Northbadge/ml-compiler-opt
that referenced
this pull request
Aug 11, 2022
Part of google#96
Merged
Northbadge
added a commit
to Northbadge/ml-compiler-opt
that referenced
this pull request
Aug 11, 2022
- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of google#96
Northbadge
added a commit
to Northbadge/ml-compiler-opt
that referenced
this pull request
Aug 11, 2022
- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of google#96
Northbadge
added a commit
to Northbadge/ml-compiler-opt
that referenced
this pull request
Aug 16, 2022
Will be used directly by google#96
yundiqian
pushed a commit
that referenced
this pull request
Aug 16, 2022
Will be used directly by #96
Northbadge
added a commit
to Northbadge/ml-compiler-opt
that referenced
this pull request
Aug 18, 2022
- Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of google#96
mtrofin
pushed a commit
that referenced
this pull request
Aug 18, 2022
* Add pause/resume/context to workers - Allows a user to start/stop processes at will, via OS signals SIGSTOP and SIGCONT. - Allows a user to bind processes to specific CPUs. - Allows local_worker_pool to be used outside of a context manager - Switch workers to be Protocol based, so Workers are effectively duck-typed (i.e. anything that has the required methods passes as a Worker) Part of #96
- Allows concurrent evaluation of models on a separate dataset during training, with --validation_data_path - This is done with minimal impact on training time by only utilizing the CPU for the validation dataset when it is mostly idle doing tf.train(), and pinning processes to specific CPUs - The amount of impact can be adjusted via a gin.config on cpu_affinity.py - CPU affinities are only optimized for internal AMD-Zen based systems at the moment, but can be extended in the future.
16e9a86 to
bd8f87a
Compare
mtrofin
reviewed
Aug 19, 2022
| _NR_CPUS = psutil.cpu_count() | ||
|
|
||
| _CPU_CONFIG = { # List of CPU numbers in cache-sharing order. | ||
| # 'google-epyc' assumes logical core 0 and N/2 are the same physical core. |
Collaborator
There was a problem hiding this comment.
it can be probably named something more neutral, like 'default'?
Contributor
Author
There was a problem hiding this comment.
'default' kind of implies it'd work fine for any system, whereas we can only guarantee it'll work fine on a google server running Epyc CPUs
mtrofin
reviewed
Aug 19, 2022
| cancelled work. | ||
| RuntimeError: if llvm-size produces unexpected output. | ||
| """ | ||
| if cancellation_manager is None: |
Collaborator
There was a problem hiding this comment.
weird... hmm, should we just not pass cancellation_manager because there's self._cancellation_manager anyway?
Northbadge
added a commit
to Northbadge/ml-compiler-opt
that referenced
this pull request
Aug 19, 2022
This was referenced Aug 19, 2022
mtrofin
reviewed
Aug 19, 2022
| self._running_policy = None | ||
| self._default_futures: List[worker.WorkerFuture] = [] | ||
| self._current_work: List[Tuple[corpus.ModuleSpec, worker.WorkerFuture]] = [] | ||
| self._last_time = None |
Collaborator
There was a problem hiding this comment.
do we need the time stuff anymore?
Contributor
Author
There was a problem hiding this comment.
the time stuff is purely cosmetic, just to get the total wall time spent compiling validation modules
Contributor
Author
mtrofin
pushed a commit
that referenced
this pull request
Sep 7, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
--validation_data_pathMissing tests at the moment, will add after the interface is more concrete