Skip to content

Commit 6073346

Browse files
committed
Store README as site's index
1 parent d7b77fd commit 6073346

File tree

1 file changed

+133
-0
lines changed

1 file changed

+133
-0
lines changed

docs/index.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Kotlin Spark API
2+
3+
4+
Your next API to work with [Spark](https://spark.apache.org/).
5+
6+
We are looking to have this as a part of https://github.com/apache/spark repository. Consider this beta-quality software.
7+
8+
## Goal
9+
10+
This project adds a missing layer of compatibility between [Kotlin](https://kotlinlang.org/) and [Spark](https://spark.apache.org/).
11+
12+
Despite Kotlin having first-class compatibility API, Kotlin developers may want to use familiar features like data classes and lambda expressions as simple expressions in curly braces or method references.
13+
14+
## Non-goals
15+
16+
There is no goal to replace any currently supported language or provide other APIs with some functionality to support Kotlin language.
17+
18+
## Installation
19+
20+
Currently, there are no kotlin-spark-api artifacts in maven central, but you can obtain a copy using JitPack here: [![](https://jitpack.io/v/JetBrains/kotlin-spark-api.svg)](https://jitpack.io/#JetBrains/kotlin-spark-api)
21+
22+
There is support for `Maven`, `Gradle`, `SBT`, and `leinengen` on JitPack.
23+
24+
This project does not force you to use any specific version of Spark, but it has only been tested it with spark `3.0.0-preview2`.
25+
We believe it can work with Spark `2.4.5` but we cannot guarantee that.
26+
27+
So if you're using Maven you'll have to add the following into your `pom.xml`:
28+
29+
```xml
30+
<repositories>
31+
<repository>
32+
<id>jitpack.io</id>
33+
<url>https://jitpack.io</url>
34+
</repository>
35+
</repositories>
36+
<dependency>
37+
<groupId>com.github.JetBrains.kotlin-spark-api</groupId>
38+
<artifactId>kotlin-spark-api</artifactId>
39+
<version>${kotlin-spark-api.version}</version>
40+
</dependency>
41+
<dependency>
42+
<groupId>org.apache.spark</groupId>
43+
<artifactId>spark-sql_2.12</artifactId>
44+
<version>2.4.5</version>
45+
</dependency>
46+
```
47+
48+
`core` is being compiled against Scala version `2.12` and it means you have to use `2.12` build of spark if you want to try out this project.
49+
50+
## Usage
51+
52+
First (and hopefully last) thing you need to do is to add following import to your Kotlin file:
53+
54+
```kotlin
55+
import org.jetbrains.spark.api.*
56+
```
57+
58+
Then you can create a SparkSession:
59+
60+
```kotlin
61+
val spark = SparkSession
62+
.builder()
63+
.master("local[2]")
64+
.appName("Simple Application").orCreate
65+
66+
```
67+
68+
To create a Dataset you can call `toDS` method:
69+
70+
```kotlin
71+
spark.toDS("a" to 1, "b" to 2)
72+
```
73+
74+
Indeed, this produces `Dataset<Pair<String, Int>>`. There are a couple more `toDS` methods which accept different arguments.
75+
76+
Also, there are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design. For example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`.
77+
Note that we are forcing `RIGHT` to be nullable for you as a developer to be able to handle this situation.
78+
79+
We know that `NullPointerException`s are hard to debug in Spark, and we are trying hard to make them as rare as possible.
80+
81+
## Useful helper methods
82+
83+
### `withSpark`
84+
85+
We provide you with useful function `withSpark`, which accepts everything that may be needed to run Spark — properties, name, master location and so on. It also accepts a block of code to execute inside Spark context.
86+
87+
After work block ends, `spark.stop()` is called automatically.
88+
89+
```kotlin
90+
withSpark {
91+
dsOf(1, 2)
92+
.map { it to it }
93+
.show()
94+
}
95+
```
96+
97+
`dsOf` is just one more way to create `Dataset` (`Dataset<Int>`) from varargs.
98+
99+
### `withCached`
100+
101+
It can easily happen that we need to fork our computation to several paths. To compute things only once we should call `cache`
102+
method. But there it is hard to control when we're using cached `Dataset` and when not.
103+
It is also easy to forget to unpersist cached data, which can break things unexpectably or take more memory
104+
than intended.
105+
106+
To solve these problems we introduce `withCached` function
107+
108+
```kotlin
109+
withSpark {
110+
dsOf(1, 2, 3, 4, 5)
111+
.map { it to (it + 2) }
112+
.withCached {
113+
showDS()
114+
115+
filter { it.first % 2 == 0 }.showDS()
116+
}
117+
.map { c(it.first, it.second, (it.first + it.second) * 2) }
118+
.show()
119+
}
120+
```
121+
122+
Here we're showing cached `Dataset` for debugging purposes then filtering it. The `filter` method returns filtered `Dataset` and then the cached `Dataset` is being unpersisted, so we have more memory to call the `map` method and collect the resulting `Dataset`.
123+
124+
## Examples
125+
126+
For more, check out [examples](https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples) module.
127+
To get up and running quickly, check out this [tutorial](quick-start-guide.md).
128+
129+
## Issues and feedback
130+
131+
Issues and any feedback are very welcome in `Issues` here.
132+
133+
If you find that we missed some important features — let us know!

0 commit comments

Comments
 (0)