Skip to content

Commit 926f4da

Browse files
authored
Merge pull request #107 from jgaglione/master
Fixed jgaglione.md
2 parents e9b5114 + 8a311a0 commit 926f4da

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

pages/postdocs/jgaglione.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,12 @@ e-mail: [email protected]
1515
project_title: Development of a Distributed GPU Machine Learning Training Facility at Vanderbilt's ACCRE Cluster
1616
project_goal: >
1717
The objective of this project is to leverage the knowledge and hardware available within the Vanderbilt research computing center (ACCRE) to develop a prototype distributed GPU machine learning training facility. We aim to provide the efficiency boost of training on a multi-GPU platform to CMS users and beyond, while abstracting away the highly technical details necessary to do so.
18-
mentors:
19-
- Andrew Melo - (Vanderbilt University)>
18+
mentors: >
19+
- Andrew Melo - (Vanderbilt University)
2020
2121
proposal: /assets/pdfs/Jethro-Gaglione.pdf
2222
presentations:
2323
current_status: >
24-
The last quarter we have incorporated and finalized testing on submission of jobs to the cluster via an mlflow-slurm interface that allows users to take advantage of cluster capabilities via MLflow projects. These nicely package the intended training project along with package and system requirements for easy reproducibility without having to explicitly learn Slurm scheduler directives and scripting. Work is ongoing to make a REST-based MLflow interface, which will allow CMS users without ACCRE accounts to submit training workflows remotely. We have also begun investigating the incorporating the use of Optuna as a suggested framework for hyperparameter optimization. This can work seamlessly with MLFlow and Slurm, and takes advantage of Bayesian optimization and pruning to efficiently run HPO.
24+
The last quarter we have incorporated and finalized testing on submission of jobs to the cluster via an mlflow-slurm interface that allows users to take advantage of cluster capabilities via MLflow projects. These nicely package the intended training project along with package and system requirements for easy reproducibility without having to explicitly learn Slurm scheduler directives and scripting. Work is ongoing to make a REST-based MLflow interface, which will allow CMS users without ACCRE accounts to submit training workflows remotely. We have also begun investigating the incorporating the use of Optuna as a suggested framework for hyperparameter optimization. This can work seamlessly with MLFlow and Slurm, and takes advantage of Bayesian optimization and pruning to efficiently run HPO.
2525
2626
---

0 commit comments

Comments
 (0)