Skip to content

Commit ad4ed4d

Browse files
make user descriptions technology generic
1 parent c5cda4c commit ad4ed4d

File tree

1 file changed

+5
-6
lines changed

1 file changed

+5
-6
lines changed

target_users.md

+5-6
Original file line numberDiff line numberDiff line change
@@ -14,19 +14,18 @@
1414
* Gang-Scheduling for Distributed Compute
1515
* Job/Infrastructure Queuing
1616

17-
I want to enable a team of data scientists to have self-serve, but limited, access to a shared pool of distributed compute resources such as GPUs for large scale machine learning model training jobs. If the existing pool of resources is insufficient, I want my cluster to scale up (to a defined quota) to meet my users’ needs and scale back down automatically when their jobs have completed. I want these features to be made available through simple installation of Operators via the OperatorHub UI. I also want the ability to see the current MCAD queue, active and requested resources on my clusters, and the progress of all current jobs visualized in a simple dashboard.
17+
I want to enable a team of data scientists to have self-serve, but limited, access to a shared pool of distributed compute resources such as GPUs for large scale machine learning model training jobs. If the existing pool of resources is insufficient, I want my cluster to scale up (to a defined quota) to meet my users’ needs and scale back down automatically when their jobs have completed. I want these features to be made available through simple installation of generic modules via a user-friendly interface. I also want the ability to monitor current queue of pending tasks, the utilization of active resources, and the progress of all current jobs visualized in a simple dashboard.
1818

1919
## Data Scientist I
2020

2121
* Training Mid-Size Models (less than 1,000 nodes)
2222
* Fine-Tuning Existing Models
23-
* Ray/KubeRay
23+
* Distributed Compute Framework
2424

25-
I need temporary access to a reasonably large set of GPU enabled nodes on my team’s shared cluster for short term experimentation, parallelizing my existing ML workflow, or fine-tuning existing large scale models. I’d prefer to work from a notebook environment with access to a python sdk that I can use to request the creation of Ray Clusters that I can distribute my workloads across. In addition to interactive experimentation work, I also want the ability to “fire-and-forget” longer running ML jobs onto temporarily deployed Ray Clusters with the ability to monitor these jobs while they are running and access to all of their artifacts once complete. I also want to see where my jobs are in the current MCAD queue and the progress of all my current jobs visualized in a simple dashboard.
25+
I need temporary access to a reasonably large set of GPU enabled nodes on my team’s shared cluster for short term experimentation, parallelizing my existing ML workflow, or fine-tuning existing large scale models. I’d prefer to work from a notebook environment with access to a python sdk that I can use to request the creation of Framework Clusters that I can distribute my workloads across. In addition to interactive experimentation work, I also want the ability to “fire-and-forget” longer running ML jobs onto temporarily deployed Framework Clusters with the ability to monitor these jobs while they are running and access to all of their artifacts once complete. I also want to see where my jobs are in the current queue and the progress of all my current jobs visualized in a simple dashboard.
2626

2727
## Data Scientist II
2828
* Training Foundation Models (1,000+ nodes)
29-
* TorchX-MCAD
30-
* Ray/KubeRay
29+
* Distributed Compute Framework
3130

32-
I need temporary (but long term) access to a massive amount of GPU enabled infrastructure to train a foundation model. I want to be able to “fire-and-forget” my ML Job into this environment, which involves submitting my job directly to MCAD via TorchX, with the MCAD-Kubernetes scheduler or a Ray Cluster via TorchX, with the Ray scheduler. Due to the size and cost associated with this job, it has already been well tested and validated, so access to jupyter notebooks is unnecessary. I would prefer to write my job as a bash script leveraging the CodeFlare CLI, or as a python script leveraging the CodeFlare SDK. I need the ability to monitor the job while it is running, as well as access to all of its artifacts once complete. I also want to see where my jobs are in the current MCAD queue and the progress of all my current jobs visualized in a simple dashboard.
31+
I need temporary (but long term) access to a massive amount of GPU enabled infrastructure to train a foundation model. I want to be able to “fire-and-forget” my ML Job into this environment. Due to the size and cost associated with this job, it has already been well tested and validated, so access to jupyter notebooks is unnecessary. I would prefer to write my job as a bash script leveraging a CLI, or as a python script leveraging an SDK. I need the ability to monitor the job while it is running, as well as access to all of its artifacts once complete. I also want to see where my jobs are in the current queue and the progress of all my current jobs visualized in a simple dashboard.

0 commit comments

Comments
 (0)