Skip to content

Commit 701bd2a

Browse files
authored
Merge pull request apache#246 from palantir/resync-kube
[NOSQUASH] Resync from apache-spark-on-k8s upstream
2 parents 692e6f8 + 50c690d commit 701bd2a

File tree

34 files changed

+1202
-102
lines changed

34 files changed

+1202
-102
lines changed

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -653,7 +653,9 @@ object SparkSubmit extends CommandLineUtils {
653653
if (args.isPython) {
654654
childArgs ++= Array("--primary-py-file", args.primaryResource)
655655
childArgs ++= Array("--main-class", "org.apache.spark.deploy.PythonRunner")
656-
childArgs ++= Array("--other-py-files", args.pyFiles)
656+
if (args.pyFiles != null) {
657+
childArgs ++= Array("--other-py-files", args.pyFiles)
658+
}
657659
} else {
658660
childArgs ++= Array("--primary-java-resource", args.primaryResource)
659661
childArgs ++= Array("--main-class", args.mainClass)

docs/running-on-kubernetes.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -770,6 +770,22 @@ from the other deployment modes. See the [configuration page](configuration.html
770770
<code>myIdentifier</code>. Multiple node selector keys can be added by setting multiple configurations with this prefix.
771771
</td>
772772
</tr>
773+
<tr>
774+
<td><code>spark.executorEnv.[EnvironmentVariableName]</code></td>
775+
<td>(none)</td>
776+
<td>
777+
Add the environment variable specified by <code>EnvironmentVariableName</code> to
778+
the Executor process. The user can specify multiple of these to set multiple environment variables.
779+
</td>
780+
</tr>
781+
<tr>
782+
<td><code>spark.kubernetes.driverEnv.[EnvironmentVariableName]</code></td>
783+
<td>(none)</td>
784+
<td>
785+
Add the environment variable specified by <code>EnvironmentVariableName</code> to
786+
the Driver process. The user can specify multiple of these to set multiple environment variables.
787+
</td>
788+
</tr>
773789
</table>
774790

775791

pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2358,7 +2358,7 @@
23582358
<reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
23592359
<junitxml>.</junitxml>
23602360
<filereports>SparkTestSuite.txt</filereports>
2361-
<argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize} ${extraScalaTestArgs}</argLine>
2361+
<argLine>-ea -Xmx3g -XX:ReservedCodeCacheSize=${CodeCacheSize} ${extraScalaTestArgs}</argLine>
23622362
<stderr/>
23632363
<environmentVariables>
23642364
<!--

resource-managers/kubernetes/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,10 +64,12 @@ build/mvn integration-test \
6464
# Running against an arbitrary cluster
6565

6666
In order to run against any cluster, use the following:
67+
```sh
6768
build/mvn integration-test \
6869
-Pkubernetes -Pkubernetes-integration-tests \
6970
-pl resource-managers/kubernetes/integration-tests -am
7071
-DextraScalaTestArgs="-Dspark.kubernetes.test.master=k8s://https://<master> -Dspark.docker.test.driverImage=<driver-image> -Dspark.docker.test.executorImage=<executor-image>"
72+
```
7173

7274
# Preserve the Minikube VM
7375

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
layout: global
3+
title: Kubernetes Implementation of the External Shuffle Service
4+
---
5+
6+
# External Shuffle Service
7+
8+
The `KubernetesExternalShuffleService` was added to allow Spark to use Dynamic Allocation Mode when
9+
running in Kubernetes. The shuffle service is responsible for persisting shuffle files beyond the
10+
lifetime of the executors, allowing the number of executors to scale up and down without losing
11+
computation.
12+
13+
The implementation of choice is as a DaemonSet that runs a shuffle-service pod on each node.
14+
Shuffle-service pods and executors pods that land on the same node share disk using hostpath
15+
volumes. Spark requires that each executor must know the IP address of the shuffle-service pod that
16+
shares disk with it.
17+
18+
The user specifies the shuffle service pods they want executors of a particular SparkJob to use
19+
through two new properties:
20+
21+
* spark.kubernetes.shuffle.service.labels
22+
* spark.kubernetes.shuffle.namespace
23+
24+
KubernetesClusterSchedulerBackend is aware of shuffle service pods and the node corresponding to
25+
them in a particular namespace. It uses this data to configure the executor pods to connect with the
26+
shuffle services that are co-located with them on the same node.
27+
28+
There is additional logic in the `KubernetesExternalShuffleService` to watch the Kubernetes API,
29+
detect failures, and proactively cleanup files in those error cases.
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
---
2+
layout: global
3+
title: Kubernetes Implementation of the Spark Scheduler Backend
4+
---
5+
6+
# Scheduler Backend
7+
8+
The general idea is to run Spark drivers and executors inside Kubernetes [Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod/).
9+
Pods are a co-located and co-scheduled group of one or more containers run in a shared context. The main component is KubernetesClusterSchedulerBackend,
10+
an implementation of CoarseGrainedSchedulerBackend, which manages allocating and destroying executors via the Kubernetes API.
11+
There are auxiliary and optional components: `ResourceStagingServer` and `KubernetesExternalShuffleService`, which serve specific purposes described further below.
12+
13+
The scheduler backend is invoked in the driver associated with a particular job. The driver may run outside the cluster (client mode) or within (cluster mode).
14+
The scheduler backend manages [pods](http://kubernetes.io/docs/user-guide/pods/) for each executor.
15+
The executor code is running within a Kubernetes pod, but remains unmodified and unaware of the orchestration layer.
16+
When a job is running, the scheduler backend configures and creates executor pods with the following properties:
17+
18+
- The pod's container runs a pre-built Docker image containing a Spark distribution (with Kubernetes integration) and
19+
invokes the Java runtime with the CoarseGrainedExecutorBackend main class.
20+
- The scheduler backend specifies environment variables on the executor pod to configure its runtime, p
21+
articularly for its JVM options, number of cores, heap size, and the driver's hostname.
22+
- The executor container has [resource limits and requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container)
23+
that are set in accordance to the resource limits specified in the Spark configuration (executor.cores and executor.memory in the application's SparkConf)
24+
- The executor pods may also be launched into a particular [Kubernetes namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/%5C),
25+
or target a particular subset of nodes in the Kubernetes cluster, based on the Spark configuration supplied.
26+
27+
## Requesting Executors
28+
29+
Spark requests for new executors through the `doRequestTotalExecutors(numExecutors: Int)` method.
30+
The scheduler backend keeps track of the request made by Spark core for the number of executors.
31+
32+
A separate kubernetes-pod-allocator thread handles the creation of new executor pods with appropriate throttling and monitoring.
33+
This indirection is required because the Kubernetes API Server accepts requests for new executor pods optimistically, with the
34+
anticipation of being able to eventually run them. However, it is undesirable to have a very large number of pods that cannot be
35+
scheduled and stay pending within the cluster. Hence, the kubernetes-pod-allocator uses the Kubernetes API to make a decision to
36+
submit new requests for executors based on whether previous pod creation requests have completed. This gives us control over how
37+
fast a job scales up (which can be configured), and helps prevent Spark jobs from DOS-ing the Kubernetes API server with pod creation requests.
38+
39+
## Destroying Executors
40+
41+
Spark requests deletion of executors through the `doKillExecutors(executorIds: List[String])`
42+
method.
43+
44+
The inverse behavior is required in the implementation of doKillExecutors(). When the executor
45+
allocation manager desires to remove executors from the application, the scheduler should find the
46+
pods that are running the appropriate executors, and tell the API server to stop these pods.
47+
It's worth noting that this code does not have to decide on the executors that should be
48+
removed. When `doKillExecutors()` is called, the executors that are to be removed have already been
49+
selected by the CoarseGrainedSchedulerBackend and ExecutorAllocationManager.

0 commit comments

Comments
 (0)