|
| 1 | +**Spark Job Submit Tool (SJST)** is a tool to submit Spark jobs on **Aurora**. SJST enables [OAP MLlib](https://github.com/oap-project/oap-mllib), which takes advantage of [Intel® oneAPI Data Analytics Library (oneDAL)](https://github.com/oneapi-src/oneDAL) and [Intel® oneAPI Collective Communications Library (oneCCL)](https://github.com/oneapi-src/oneCCL) to implement highly optimized machine learning algorithms. It can get most out of CPU and GPU capabilities and take efficient communication patterns in multi-node multi-GPU clusters. |
| 2 | + |
| 3 | +The following is a quick start for SJST with an example about how to run K-Means program with GPU, and a detailed configuration description. |
| 4 | + |
| 5 | +- [Quick Start](#quick-start) |
| 6 | + - [Introduction To SJST](#introduction-to-sjst) |
| 7 | + - [Create Spark Configuration Files](#create-spark-configuration-files) |
| 8 | + - [Submit Spark Job](#submit-spark-job) |
| 9 | + - [Submit Spark Job (interactive model)](#submit-spark-job-interactive-model) |
| 10 | + - [Check Your Job Status](#check-your-job-status) |
| 11 | + - [Check Output of Your Job](#check-output-of-your-job) |
| 12 | + - [Command Line Help](#command-line-help) |
| 13 | + - [Bash Mode](#bash-mode) |
| 14 | + - [Interactive Mode](#interactive-mode) |
| 15 | +- [Configurations](#configurations) |
| 16 | + - [env_aurora.sh](#env_aurorash) |
| 17 | + - [env_local.sh](#env_localsh) |
| 18 | +- [Example Configuration](#example-configuration) |
| 19 | + |
| 20 | +## Quick Start |
| 21 | +### Introduction To SJST |
| 22 | +SJST is available at /lus/flare/projects/Aurora_deployment/spark/spark-job. The component of SJST is listed in the table below. |
| 23 | +```text |
| 24 | +spark-job |
| 25 | +├── bin/ // The scripts that setup spark cluster and submit jobs |
| 26 | +├── conf-to-your-submit-dir/ // Configurations about workers and executors |
| 27 | +│ // You'll mainly work with this directory. |
| 28 | +├── example/ // Example for testing and reference |
| 29 | +├── jars/ // Files for loading OAP MLlib |
| 30 | +├── LICENSE |
| 31 | +├── README.docx |
| 32 | +└── README.txt |
| 33 | +
|
| 34 | +``` |
| 35 | + |
| 36 | +The following sections give a detailed example about how to submit a job. |
| 37 | + |
| 38 | +### Create Spark Configuration Files |
| 39 | +The spark-job/conf-to-your-submit-dir/ directory contains configuration files you want. You need to make a copy in your working directory. |
| 40 | + |
| 41 | +For example: |
| 42 | +```shell |
| 43 | +$ mkdir ~/spark_work_home |
| 44 | +$ cp /lus/flare/projects/Aurora_deployment/spark/spark-job/conf-to-your-submit-dir/* ~/spark_work_home |
| 45 | +``` |
| 46 | + |
| 47 | +### Submit Spark Job |
| 48 | +After setup configurations, you can use the scripts in spark-job/bin/ to submit jobs from your working directory. Please make sure your configuration files are placed properly. |
| 49 | + |
| 50 | +Here are the steps to submit the dense K-Means job with 2 nodes to queue “lustre_scaling” from Aurora login node. The data path given below is a 192G CSV file from DAOS |
| 51 | + |
| 52 | +```shell |
| 53 | +$ cd ~/spark_work_home |
| 54 | +# Submit a Spark job with 2 nodes to read data from DAOS and run K-Means example. |
| 55 | +$ /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/submit-spark.sh -A Aurora_deployment -l walltime=2:00:00 -l select=2 -l filesystems=flare:daos_user -q lustre_scaling /lus/flare/projects/Aurora_deployment/spark/spark-job/example/kmeans-pyspark.py daos://Intel2/hadoop_fs/HiBench/Kmeans/Input/36000000 |
| 56 | +# Submit a Spark job with 2 nodes to read data from Lustre Filesystem and run K-Means example. |
| 57 | +$ /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/submit-spark.sh -A Aurora_deployment -l walltime=2:00:00 -l select=2 -l filesystems=flare:daos_user -q lustre_scaling /lus/flare/projects/Aurora_deployment/spark/spark-job/example/kmeans-pyspark.py file:///lus/flare/projects/Aurora_deployment/spark/DataRoot/HiBench/Kmeans/Input/36000000 |
| 58 | + |
| 59 | +``` |
| 60 | +After submitted successfully, you can get your job ID which is generated by Aurora. It is 27643.amn-0001 in this example. The following shows a successful submission: |
| 61 | + |
| 62 | +```shell |
| 63 | +# Submitting job: /lus/flare/projects/Aurora_deployment/spark/spark-job/example/kmeans-pyspark.py daos://Intel/hadoop_fs2/HiBench/Kmeans/Input/36000000 |
| 64 | +debug 1286330.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov |
| 65 | +debug -A Aurora_deployment -l select=2 -l walltime=2:00:00 -l filesystems=flare:daos_user -q lustre_scaling -v SPARKJOB_SCRIPTS_DIR=/lus/flare/projects/Aurora_deployment/spark/spark-job/bin,SPARKJOB_CONFIG_DIR=/home/damon/spark_work_home,SPARKJOB_INTERACTIVE=0,SPARKJOB_SCRIPTMODE=0,SPARKJOB_OUTPUT_DIR=/home/damon/spark_work_home,SPARKJOB_SEPARATE_MASTER=0,SPARKJOB_OAPML=1,SPARKJOB_DAOS=1,SPARKJOB_ARG=/lus/flare/projects/Aurora_deployment/spark/spark-job/example/kmeans-pyspark.py^daos://Intel2/hadoop_fs/HiBench/Kmeans/Input/36000000 -o /home/damon/spark_work_home -e /home/damon/spark_work_home /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/start-spark.sh |
| 66 | +# Submitted |
| 67 | +SPARKJOB_JOBID=1286330.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov |
| 68 | +``` |
| 69 | +If the submission failed, the output will be like the following: |
| 70 | +```shell |
| 71 | +# Submitting job: /lus/flare/projects/Aurora_deployment/spark/spark-job/example/kmeans-pyspark.py daos://Intel2/hadoop_fs/HiBench/Kmeans/Input/36000000 |
| 72 | +qsub: would exceed queue generic's per-user limit of jobs in 'Q' state |
| 73 | +debug |
| 74 | +
|
| 75 | +debug -A Aurora_deployment -l select=2 -l walltime=2:00:00 -l filesystems=flare:daos_user -q lustre_scaling -v SPARKJOB_SCRIPTS_DIR=/lus/flare/projects/Aurora_deployment/spark/spark-job/bin,SPARKJOB_CONFIG_DIR=/home/damon/spark_work_home,SPARKJOB_INTERACTIVE=0,SPARKJOB_SCRIPTMODE=0,SPARKJOB_OUTPUT_DIR=/home/damon/spark_work_home,SPARKJOB_SEPARATE_MASTER=0,SPARKJOB_OAPML=1,SPARKJOB_DAOS=1,SPARKJOB_ARG=/lus/flare/projects/Aurora_deployment/spark/spark-job/example/kmeans-pyspark.py^daos://Intel2/hadoop_fs/HiBench/Kmeans/Input/36000000 -o /home/damon/spark_work_home -e /home/damon/spark_work_home /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/start-spark.sh |
| 76 | +
|
| 77 | +# Submitting failed. |
| 78 | +``` |
| 79 | +
|
| 80 | +### Submit Spark Job (interactive model) |
| 81 | +After setup configurations, you can use the scripts in spark-job/bin/ to submit jobs from your working directory. Please make sure your configuration files are placed properly. |
| 82 | +
|
| 83 | +Here are the steps to enter the interactive interface with 2 nodes to queue “lustre_scaling” from Aurora login node. |
| 84 | +```shell |
| 85 | +$ cd ~/spark_work_home |
| 86 | +$ /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/submit-spark.sh -A Aurora_deployment -l walltime=2:00:00 -l select=2 -l filesystems=flare:daos_user -q lustre_scaling -I |
| 87 | +``` |
| 88 | +After submitted successfully, you can get your job ID which is generated by Aurora. It is 67690.amn-0001 in this example. The following shows a successful submission: |
| 89 | +```shell |
| 90 | +Submitting an interactive job and wait for at most 1800 sec. |
| 91 | +debug 67690.amn-0001 |
| 92 | +debug -A Aurora_deployment -l select=2 -l walltime=2:00:00 -l filesystems=flare:daos_user -q lustre_scaling -v SPARKJOB_SCRIPTS_DIR=/lus/flare/projects/Aurora_deployment/spark/spark-job/bin,SPARKJOB_CONFIG_DIR=/home/damon/spark_work_home,SPARKJOB_INTERACTIVE=1,SPARKJOB_SCRIPTMODE=0,SPARKJOB_OUTPUT_DIR=/home/damon/spark_work_home,SPARKJOB_SEPARATE_MASTER=0,SPARKJOB_OAPML=1,SPARKJOB_DAOS=1,SPARKJOB_ARG= -o /home/damon/spark_work_home -e /home/damon/spark_work_home /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/start-spark.sh |
| 93 | +# Submitted |
| 94 | +SPARKJOB_JOBID=67690.amn-0001 |
| 95 | +need spark job host: 0 |
| 96 | +Waiting for Spark to launch by checking /home/damon/spark_work_home/67690.amn-0001 |
| 97 | +# Spark is now running (SPARKJOB_JOBID=67690.amn-0001) on: |
| 98 | +# x1921c1s2b0n0.hostmgmt2000.cm.americas.sgi.com |
| 99 | +# x1921c1s5b0n0.hostmgmt2000.cm.americas.sgi.com |
| 100 | +declare -x SPARK_MASTER="spark://x1921c1s2b0n0.hostmgmt.cm.americas.sgi.com:7077" |
| 101 | +# Spawning bash on host: x1921c1s2b0n0 |
| 102 | +Adding oap mllib to additional class path |
| 103 | +need spark job host: 1 |
| 104 | + |
| 105 | +sourced /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/env_gpu.sh |
| 106 | +need spark job host: 1 |
| 107 | +loading DAOS module |
| 108 | +sourced /home/damon/spark_work_home/env_aurora.sh |
| 109 | +sourced /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/env_spark_daos.sh |
| 110 | +GPU Options: --conf spark.oap.mllib.device=GPU |
| 111 | +sourced /home/damon/spark_work_home/env_local.sh |
| 112 | +``` |
| 113 | +After entering the interactive interface, you can run command line. |
| 114 | +```shell |
| 115 | +$ spark-submit /lus/flare/projects/Aurora_deployment/spark/spark-job/example/kmeans-pyspark.py daos://Intel2/hadoop_fs/HiBench/Kmeans/Input/36000000 |
| 116 | +``` |
| 117 | +
|
| 118 | +### Check Your Job Status |
| 119 | +After submitting jobs, Aurora will allocate resources and schedule your jobs. You can check your job with command qstat. |
| 120 | +
|
| 121 | +For the example above, the output will be like this: |
| 122 | +```shell |
| 123 | +// The job is waiting for resources. |
| 124 | +$ qstat 27643.amn-0001 |
| 125 | +Job id Name User Time Use S Queue |
| 126 | +---------------- ---------------- ---------------- -------- - ----- |
| 127 | +27643.amn-0001 start-spark.sh damon 0 Q lustre_scaling |
| 128 | + |
| 129 | +// The job is running. |
| 130 | +$ qstat 27643.amn-0001 |
| 131 | +Job id Name User Time Use S Queue |
| 132 | +---------------- ---------------- ---------------- -------- - ----- |
| 133 | +27643.amn-0001 start-spark.sh damon 0 R lustre_scaling |
| 134 | + |
| 135 | +// The job is end. |
| 136 | +$ qstat 27643.amn-0001 |
| 137 | +Job id Name User Time Use S Queue |
| 138 | +---------------- ---------------- ---------------- -------- - ----- |
| 139 | +27643.amn-0001 start-spark.sh damon 0 E lustre_scaling |
| 140 | +``` |
| 141 | +To check your job status periodically, you can use watch command. After the job ends, you can use Ctrl-C to stop "watch" process. |
| 142 | +```shell |
| 143 | +$ watch -n 2 qstat 27643.amn-0001 |
| 144 | +Job id Name User Time Use S Queue |
| 145 | +---------------- ---------------- ---------------- -------- - ----- |
| 146 | +27643.amn-0001 start-spark.sh damon 0 Q lustre_scaling |
| 147 | +``` |
| 148 | +
|
| 149 | +### Check Output of Your Job |
| 150 | +The output of your job is stored in your working directory and here is an example. |
| 151 | +```text |
| 152 | +spark_work_home |
| 153 | +├── 27643.amn-0001/ |
| 154 | +│ └── conf/ // Spark conf like spark-env.sh, spark-default.xml and slaves |
| 155 | +│ └── logs/ // Spark master and worker logs |
| 156 | +│ └── app-20230301080732-0000/ // After Spark submitted successfully, you will get application ID that contains stderr and stdout log for each executor. |
| 157 | +├── 27643.amn-0001.ER // All stderr output for your job. |
| 158 | +└── 27643.amn-0001.OU // All stdout output for your job. |
| 159 | +``` |
| 160 | +
|
| 161 | +The results of dense K-Means example are at the end of **27643.amn-0001.OU**. |
| 162 | +
|
| 163 | +### Command Line Help |
| 164 | +submit-spark.sh provides a few arguments. You can get detailed info about the arguments and examples with below command. |
| 165 | +```shell |
| 166 | +$ /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/submit-spark.sh -h |
| 167 | +Spark Job Submit Tool v1.0.3 for Aurora |
| 168 | + |
| 169 | +Usage: |
| 170 | + submit-spark.sh [options] JOBFILE [arguments ...] |
| 171 | + |
| 172 | +JOBFILE can be: |
| 173 | + script.py pyspark scripts |
| 174 | + run-example examplename run Spark examples, like "SparkPi" |
| 175 | + --class classname example.jar run Spark job "classname" wrapped in example.jar |
| 176 | + shell-script.sh when option "-s" is specified |
| 177 | + |
| 178 | +Required options: |
| 179 | + -l walltime=<v> Max run time, e.g., -l walltime=30:00 for running 30 minutes at most |
| 180 | + -l select=<v> Job node count, e.g., -l select=2 for requesting 2 nodes |
| 181 | + -l filesystems=<v> Filesystem type, e.g., -l filesystems=flare:daos_user for requesting flare and daos filesystems |
| 182 | + -q QUEUE Queue name |
| 183 | + |
| 184 | +Optional options: |
| 185 | + -A PROJECT Allocation name |
| 186 | + -o OUTPUTDIR Directory for job output files (default: current dir) |
| 187 | + -s Enable shell script mode |
| 188 | + -y Spark master uses a separate node |
| 189 | + -I Start an interactive ssh session |
| 190 | + -b Prefer Spark built-in mllib to default OAP mllib |
| 191 | + -n Run without DAOS loaded |
| 192 | + -h Print this help messages |
| 193 | +Example: |
| 194 | + submit-spark.sh -A Aurora_deployment -l walltime=60 -l select=2 -l filesystems=flare:daos_user -q workq kmeans-pyspark.py daos://pool0/cont1/kmeans/input/libsvm 10 |
| 195 | + submit-spark.sh -A Aurora_deployment -l walltime=30 -l select=1 -l filesystems=flare:daos_user -q workq -o output-dir run-example SparkPi |
| 196 | + submit-spark.sh -A Aurora_deployment -I -l walltime=30 -l select=2 -l filesystems=flare:daos_user -q workq --class com.intel.jlse.ml.KMeansExample example/jlse-ml-1.0-SNAPSHOT.jar daos://pool0/cont1/kmeans/input/csv csv 10 |
| 197 | + submit-spark.sh -A Aurora_deployment -l walltime=30 -l select=1 -l filesystems=flare:daos_user -q workq -s example/test_script.sh job.log |
| 198 | +``` |
| 199 | +#### Bash Mode: |
| 200 | +```shell |
| 201 | +$ /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/submit-spark.sh -A Aurora_deployment -l walltime=2:00:00 -l select=1 -l filesystems=flare:daos_user -q lustre_scaling -s test_script.sh |
| 202 | +``` |
| 203 | +#### Interactive Mode: |
| 204 | +```shell |
| 205 | +$ /lus/flare/projects/Aurora_deployment/spark/spark-job/bin/submit-spark.sh -A Aurora_deployment -l walltime=2:00:00 -l select=1 -l filesystems=flare:daos_user -q lustre_scaling |
| 206 | +$ ./test_script.sh |
| 207 | +``` |
| 208 | +
|
| 209 | +## Configurations |
| 210 | +
|
| 211 | +This section introduces how to configure SJST and make Spark work as you want. By default, SJST provides **env_aurora.sh** and **env_local.sh** that already configured resources properly for Aurora. If you want to customize resources, the following information may help you. |
| 212 | +
|
| 213 | +### env_aurora.sh |
| 214 | +From here you can change Spark worker's CPU, memory and GPU resources. |
| 215 | +| Configuration Option | Description | |
| 216 | +|----------------------|-------------| |
| 217 | +| SPARK_WORKER_CORES | Maximum number of cores for each worker | |
| 218 | +| SPARK_WORKER_MEMORY | Maximum memory for each worker | |
| 219 | +| GPU_RESOURCE_FILE | Path to resources file which is used to find various resources while worker starting up | |
| 220 | +| GPU_WORKER_AMOUNT | Amount of GPUs resource each worker to use. | |
| 221 | +### env_local.sh |
| 222 | +From here you can change Spark and OAP MLlib configurations. For more Spark configuration details, you can refer to Spark configuration docs. |
| 223 | +| Configuration Option | Description | |
| 224 | +|----------------------|-------------| |
| 225 | +| spark.oap.mllib.device | Select compute device as CPU or GPU. default value is GPU. (Currently OAP MLlib Jar only supports GPU, CPU support will be added in the future releases.) | |
| 226 | +| spark.executor.cores | The number of cores to use on each executor. | |
| 227 | +| spark.driver.memory | Amount of memory to use for the driver process. | |
| 228 | +| spark.executor.memory | Amount of memory to use per executor process. | |
| 229 | +| spark.executor.instances | If `--num-executors` (or `spark.executor.instances`) is set and larger than this value, it will be used as the initial number of executors. | |
| 230 | +| spark.worker.resourcesFile | Path to resource file which is used to find various resources while worker starting up. default value is $GPU_RESOURCE_FILE | |
| 231 | +| spark.worker.resource.gpu.amount | Amount of a particular resource to use on the worker. default value is $GPU_WORKER_AMOUNT | |
| 232 | +| spark.executor.resource.gpu.amount | Amount of GPUs resource for each executor | |
| 233 | +| spark.task.resource.gpu.amount | Amount of GPUs resource for each task | |
| 234 | +| spark.driver.extraJavaOptions | A string of extra JVM options to pass to driver. | |
| 235 | +| spark.executor.extraJavaOptions | A string of extra JVM options to pass to executors. | |
| 236 | +| spark.executor.extraClassPath | Extra classpath entries to prepend to the classpath of executors. OAP MLlib Jar path is added here. | |
| 237 | +| spark.driver.extraClassPath | Extra classpath entries to prepend to the classpath of the driver. OAP MLlib Jar path is added here. | |
| 238 | +| spark.eventLog.enabled | Whether to log Spark events. | |
| 239 | +| spark.eventLog.dir | Base directory in which Spark events are logged. | |
| 240 | + |
| 241 | +## Example Configuration |
| 242 | +Here is an example that prioritizes making the dense K-Means example most efficient. |
| 243 | + |
| 244 | +Suppose that we have a cluster as bellow. |
| 245 | +| Cluster | Parameters | |
| 246 | +|----------------------|-------------| |
| 247 | +| Node | 3 | |
| 248 | +| Memory per node | 1024G | |
| 249 | +| Physical cores | 104 | |
| 250 | +| Logical cores | 208 | |
| 251 | +| GPUs | 12 | |
| 252 | + |
| 253 | +We want to ensure each spark executor corresponds to one GPU which is most efficient. Therefore, we want to launch 12 executors and make each executor get enough resources to run the dense K-Means example. We will set 96 physical cores and 840G memory for the worker (leaving some cores and memory for the OS). Meanwhile we set each executor uses 8(96/12) logical cores and 70G (840G/12) memory. Then we come up with the configuration files as bellow. |
| 254 | + |
| 255 | +#### env_aurora.sh |
| 256 | +```shell |
| 257 | +# set Spark Worker resources |
| 258 | +export SPARK_WORKER_CORES=96 # Maximum number of cores that each worker can get |
| 259 | +export SPARK_WORKER_MEMORY=840G # Maximum memory that each worker can get |
| 260 | + |
| 261 | +# set GPU options |
| 262 | +export GPU_RESOURCE_FILE=$SPARKJOB_CONFIG_DIR/gpuResourceFile_aurora.json # Path to resources file which is used to find various resources while worker starting up |
| 263 | +export GPU_WORKER_AMOUNT=12 |
| 264 | + |
| 265 | +``` |
| 266 | + |
| 267 | +#### env_local.sh |
| 268 | +```shell |
| 269 | +spark.driver.memory 20g # each driver has 20g memory |
| 270 | +spark.executor.cores 8 # each executor has 8 (96/12) cores |
| 271 | +spark.executor.memory 70g # each executor has 70g (840G/12) memory |
| 272 | +spark.worker.resource.gpu.amount 12 # each node has 1 worker and each worker has 12 GPUs |
| 273 | +spark.executor.resource.gpu.amount 1 # each executor has 1 GPU |
| 274 | +spark.executor.instances 36 # 36 (12 x 3) executors in total |
| 275 | +spark.task.resource.gpu.amount 0.125 # Each task uses 1/8 of a GPU. This value should be determined based on your **target task concurrency**. For example, if each node has 12 GPUs and you want 96 concurrent GPU tasks per node, then spark.task.resource.gpu.amount = 12 / 96 = 0.125. In this case, 8 tasks share each GPU, allowing better parallelism for lightweight stages like preprocessing. |
| 276 | +``` |
0 commit comments