Skip to content

Commit 187b38a

Browse files
authored
Merge pull request #101 from jgaglione/master
jgaglione status update
2 parents 466d4f1 + e08c6c0 commit 187b38a

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

pages/postdocs/jgaglione.md

+4-1
Original file line numberDiff line numberDiff line change
@@ -21,5 +21,8 @@ mentors:
2121
proposal: /assets/pdfs/Jethro-Gaglione.pdf
2222
presentations:
2323
current_status: >
24-
We have compiled ML training examples to familiarize new users with our recommended running environments and packages. These include functionality to log training runs on MLflow, a platform that facilitates important aspects of the training cycle, such as metric monitoring, reproducibility, and deployment. We have put together a "quick start" guide, which are working on expanding to be full-fledged user documentation. We have single-GPU training capabilities in place and are currently working on interfacing with multiple GPUs and submission to multi-GPU nodes via our schedulers.
24+
The last quarter we have incorporated and finalized testing on submission of jobs to the cluster via an mlflow-slurm interface that allows users to take advantage of cluster capabilities via MLflow projects. These nicely package the intended training project along with package and system requirements for easy reproducibility without having to explicitly learn Slurm scheduler directives and scripting. Work is ongoing to make a REST-based MLflow interface, which will allow CMS users without ACCRE accounts to submit training workflows remotely.
25+
26+
We have also begun investigating the incorporating the use of Optuna as a suggested framework for hyperparameter optimization. This can work seamlessly with MLFlow and Slurm, and takes advantage of Bayesian optimization and pruning to efficiently run HPO.
27+
2528
---

0 commit comments

Comments
 (0)