|
| 1 | +# Kotlin Spark API |
| 2 | + |
| 3 | + |
| 4 | +Your next API to work with [Spark](https://spark.apache.org/). |
| 5 | + |
| 6 | +We are looking to have this as a part of https://github.com/apache/spark repository. Consider this beta-quality software. |
| 7 | + |
| 8 | +## Goal |
| 9 | + |
| 10 | +This project adds a missing layer of compatibility between [Kotlin](https://kotlinlang.org/) and [Spark](https://spark.apache.org/). |
| 11 | + |
| 12 | +Despite Kotlin having first-class compatibility API, Kotlin developers may want to use familiar features like data classes and lambda expressions as simple expressions in curly braces or method references. |
| 13 | + |
| 14 | +## Non-goals |
| 15 | + |
| 16 | +There is no goal to replace any currently supported language or provide other APIs with some functionality to support Kotlin language. |
| 17 | + |
| 18 | +## Installation |
| 19 | + |
| 20 | +Currently, there are no kotlin-spark-api artifacts in maven central, but you can obtain a copy using JitPack here: [](https://jitpack.io/#JetBrains/kotlin-spark-api) |
| 21 | + |
| 22 | +There is support for `Maven`, `Gradle`, `SBT`, and `leinengen` on JitPack. |
| 23 | + |
| 24 | +This project does not force you to use any specific version of Spark, but it has only been tested it with spark `3.0.0-preview2`. |
| 25 | +We believe it can work with Spark `2.4.5` but we cannot guarantee that. |
| 26 | + |
| 27 | +So if you're using Maven you'll have to add the following into your `pom.xml`: |
| 28 | + |
| 29 | +```xml |
| 30 | +<repositories> |
| 31 | + <repository> |
| 32 | + <id>jitpack.io</id> |
| 33 | + <url>https://jitpack.io</url> |
| 34 | + </repository> |
| 35 | +</repositories> |
| 36 | +<dependency> |
| 37 | + <groupId>com.github.JetBrains.kotlin-spark-api</groupId> |
| 38 | + <artifactId>kotlin-spark-api</artifactId> |
| 39 | + <version>${kotlin-spark-api.version}</version> |
| 40 | +</dependency> |
| 41 | +<dependency> |
| 42 | + <groupId>org.apache.spark</groupId> |
| 43 | + <artifactId>spark-sql_2.12</artifactId> |
| 44 | + <version>2.4.5</version> |
| 45 | +</dependency> |
| 46 | +``` |
| 47 | + |
| 48 | +`core` is being compiled against Scala version `2.12` and it means you have to use `2.12` build of spark if you want to try out this project. |
| 49 | + |
| 50 | +## Usage |
| 51 | + |
| 52 | +First (and hopefully last) thing you need to do is to add following import to your Kotlin file: |
| 53 | + |
| 54 | +```kotlin |
| 55 | +import org.jetbrains.spark.api.* |
| 56 | +``` |
| 57 | + |
| 58 | +Then you can create a SparkSession: |
| 59 | + |
| 60 | +```kotlin |
| 61 | +val spark = SparkSession |
| 62 | + .builder() |
| 63 | + .master("local[2]") |
| 64 | + .appName("Simple Application").orCreate |
| 65 | + |
| 66 | +``` |
| 67 | + |
| 68 | +To create a Dataset you can call `toDS` method: |
| 69 | + |
| 70 | +```kotlin |
| 71 | +spark.toDS("a" to 1, "b" to 2) |
| 72 | +``` |
| 73 | + |
| 74 | +Indeed, this produces `Dataset<Pair<String, Int>>`. There are a couple more `toDS` methods which accept different arguments. |
| 75 | + |
| 76 | +Also, there are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design. For example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`. |
| 77 | +Note that we are forcing `RIGHT` to be nullable for you as a developer to be able to handle this situation. |
| 78 | + |
| 79 | +We know that `NullPointerException`s are hard to debug in Spark, and we are trying hard to make them as rare as possible. |
| 80 | + |
| 81 | +## Useful helper methods |
| 82 | + |
| 83 | +### `withSpark` |
| 84 | + |
| 85 | +We provide you with useful function `withSpark`, which accepts everything that may be needed to run Spark — properties, name, master location and so on. It also accepts a block of code to execute inside Spark context. |
| 86 | + |
| 87 | +After work block ends, `spark.stop()` is called automatically. |
| 88 | + |
| 89 | +```kotlin |
| 90 | +withSpark { |
| 91 | + dsOf(1, 2) |
| 92 | + .map { it to it } |
| 93 | + .show() |
| 94 | +} |
| 95 | +``` |
| 96 | + |
| 97 | +`dsOf` is just one more way to create `Dataset` (`Dataset<Int>`) from varargs. |
| 98 | + |
| 99 | +### `withCached` |
| 100 | + |
| 101 | +It can easily happen that we need to fork our computation to several paths. To compute things only once we should call `cache` |
| 102 | +method. But there it is hard to control when we're using cached `Dataset` and when not. |
| 103 | +It is also easy to forget to unpersist cached data, which can break things unexpectably or take more memory |
| 104 | +than intended. |
| 105 | + |
| 106 | +To solve these problems we introduce `withCached` function |
| 107 | + |
| 108 | +```kotlin |
| 109 | +withSpark { |
| 110 | + dsOf(1, 2, 3, 4, 5) |
| 111 | + .map { it to (it + 2) } |
| 112 | + .withCached { |
| 113 | + showDS() |
| 114 | + |
| 115 | + filter { it.first % 2 == 0 }.showDS() |
| 116 | + } |
| 117 | + .map { c(it.first, it.second, (it.first + it.second) * 2) } |
| 118 | + .show() |
| 119 | +} |
| 120 | +``` |
| 121 | + |
| 122 | +Here we're showing cached `Dataset` for debugging purposes then filtering it. The `filter` method returns filtered `Dataset` and then the cached `Dataset` is being unpersisted, so we have more memory to call the `map` method and collect the resulting `Dataset`. |
| 123 | + |
| 124 | +## Examples |
| 125 | + |
| 126 | +For more, check out [examples](https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples) module. |
| 127 | +To get up and running quickly, check out this [tutorial](quick-start-guide.md). |
| 128 | + |
| 129 | +## Issues and feedback |
| 130 | + |
| 131 | +Issues and any feedback are very welcome in `Issues` here. |
| 132 | + |
| 133 | +If you find that we missed some important features — let us know! |
0 commit comments