Skip to content

Commit 3464c14

Browse files
authoredJul 25, 2023
add target users doc (#209)
* add target users doc * make user descriptions technology generic
1 parent 4d523cc commit 3464c14

File tree

1 file changed

+31
-0
lines changed

1 file changed

+31
-0
lines changed
 

Diff for: ‎target_users.md

+31
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# CodeFlare Stack Target Users
2+
3+
[Cluster Admin](#cluster-administrator)
4+
5+
[Data Scientist I](#data-scientist-i)
6+
7+
[Data Scientist II](#data-scientist-ii)
8+
9+
10+
11+
## Cluster Administrator
12+
13+
* Quota Management
14+
* Gang-Scheduling for Distributed Compute
15+
* Job/Infrastructure Queuing
16+
17+
I want to enable a team of data scientists to have self-serve, but limited, access to a shared pool of distributed compute resources such as GPUs for large scale machine learning model training jobs. If the existing pool of resources is insufficient, I want my cluster to scale up (to a defined quota) to meet my users’ needs and scale back down automatically when their jobs have completed. I want these features to be made available through simple installation of generic modules via a user-friendly interface. I also want the ability to monitor current queue of pending tasks, the utilization of active resources, and the progress of all current jobs visualized in a simple dashboard.
18+
19+
## Data Scientist I
20+
21+
* Training Mid-Size Models (less than 1,000 nodes)
22+
* Fine-Tuning Existing Models
23+
* Distributed Compute Framework
24+
25+
I need temporary access to a reasonably large set of GPU enabled nodes on my team’s shared cluster for short term experimentation, parallelizing my existing ML workflow, or fine-tuning existing large scale models. I’d prefer to work from a notebook environment with access to a python sdk that I can use to request the creation of Framework Clusters that I can distribute my workloads across. In addition to interactive experimentation work, I also want the ability to “fire-and-forget” longer running ML jobs onto temporarily deployed Framework Clusters with the ability to monitor these jobs while they are running and access to all of their artifacts once complete. I also want to see where my jobs are in the current queue and the progress of all my current jobs visualized in a simple dashboard.
26+
27+
## Data Scientist II
28+
* Training Foundation Models (1,000+ nodes)
29+
* Distributed Compute Framework
30+
31+
I need temporary (but long term) access to a massive amount of GPU enabled infrastructure to train a foundation model. I want to be able to “fire-and-forget” my ML Job into this environment. Due to the size and cost associated with this job, it has already been well tested and validated, so access to jupyter notebooks is unnecessary. I would prefer to write my job as a bash script leveraging a CLI, or as a python script leveraging an SDK. I need the ability to monitor the job while it is running, as well as access to all of its artifacts once complete. I also want to see where my jobs are in the current queue and the progress of all my current jobs visualized in a simple dashboard.

0 commit comments

Comments
 (0)
Please sign in to comment.