You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project adds a missing layer of compatibility between [Kotlin](https://kotlinlang.org/) and [Spark](https://spark.apache.org/).
13
-
14
-
Despite Kotlin having first-class compatibility API, Kotlin developers may want to use familiar features like data classes and lambda expressions as simple expressions in curly braces or method references.
15
-
16
-
## Non-goals
17
-
18
-
There is no goal to replace any currently supported language or provide other APIs with some functionality to support Kotlin language.
19
-
20
-
## Installation
21
-
22
-
Currently, there are no kotlin-spark-api artifacts in maven central, but you can obtain a copy using JitPack here: [](https://jitpack.io/#JetBrains/kotlin-spark-api)
23
-
24
-
There is support for `Maven`, `Gradle`, `SBT`, and `leinengen` on JitPack.
25
-
26
-
This project does not force you to use any specific version of Spark, but it has only been tested it with spark `3.0.0`.
27
-
28
-
So if you're using Maven you'll have to add the following into your `pom.xml`:
1
+
# Kotlin for Apache® Spark™
2
+
3
+
Your next API to work with [Apache Spark](https://spark.apache.org/).
4
+
5
+
This project adds a missing layer of compatibility between [Kotlin](https://kotlinlang.org/) and [Apache Spark](https://spark.apache.org/).
6
+
It allows Kotlin developers to use familiar language features such as data classes, and lambda expressions as simple expressions in curly braces or method references.
7
+
8
+
We have opened a Spark Project Improvement Proposal: [Kotlin support for Apache Spark](http://issues.apache.org/jira/browse/SPARK-32530#) to work with the community towards getting Kotlin support as a first-class citizen in Apache Spark. We encourage you to voice your opinions and participate in the discussion.
9
+
10
+
## Table of Contents
11
+
12
+
-[Supported versions of Apache Spark](#supported-apache-spark)
13
+
-[Releases](#releases)
14
+
-[How to configure Kotlin for Apache Spark in your project](#how-to-configure-kotlin-for-apache-spark-in-your-project)
15
+
-[Kotlin for Apache Spark features](#kotlin-for-apache-spark-features)
16
+
-[Creating a SparkSession in Kotlin](#creating-a-sparksession-in-kotlin)
17
+
-[Creating a Dataset in Kotlin](#creating-a-dataset-in-kotlin)
18
+
-[Null safety](#null-safety)
19
+
-[withSpark function](#withspark-function)
20
+
-[withCached function](#withcached-function)
21
+
-[toList and toArray](#tolist-and-toarray-methods)
## How to configure Kotlin for Apache Spark in your project
52
+
53
+
You can add Kotlin for Apache Spark as a dependency to your project: `Maven`, `Gradle`, `SBT`, and `leinengen` are supported.
54
+
55
+
Here's an example `pom.xml`:
29
56
30
57
```xml
31
58
<repositories>
@@ -46,19 +73,17 @@ So if you're using Maven you'll have to add the following into your `pom.xml`:
46
73
</dependency>
47
74
```
48
75
49
-
Note that `core` is being compiled against Scala version `2.12` and it means you have to use `2.12` build of spark if you want to try out this project.
50
-
You can find a complete example with `pom.xml` and `build.gradle` in the [Quick Start Guide](docs/quick-start-guide.md).
51
-
52
-
## Usage
53
-
54
-
First (and hopefully last) thing you need to do is to add following import to your Kotlin file:
76
+
Note that `core` is being compiled against Scala version `2.12`.
77
+
You can find a complete example with `pom.xml` and `build.gradle` in the [Quick Start Guide](docs/quick-start-guide.md).
55
78
79
+
Once you have configured the dependency, you only need to add the following import to your Kotlin file:
56
80
```kotlin
57
81
importorg.jetbrains.spark.api.*
58
-
```
82
+
```
59
83
60
-
Then you can create a SparkSession:
84
+
## Kotlin for Apache Spark features
61
85
86
+
### Creating a SparkSession in Kotlin
62
87
```kotlin
63
88
val spark =SparkSession
64
89
.builder()
@@ -67,22 +92,19 @@ val spark = SparkSession
67
92
68
93
```
69
94
70
-
To create a Dataset you can call `toDS` method:
71
-
95
+
### Creating a Dataset in Kotlin
72
96
```kotlin
73
97
spark.toDS("a" to 1, "b" to 2)
74
98
```
99
+
The example above produces `Dataset<Pair<String, Int>>`.
100
+
101
+
### Null safety
102
+
There are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design.
103
+
For example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`.
104
+
Note that we are forcing `RIGHT` to be nullable for you as a developer to be able to handle this situation.
105
+
`NullPointerException`s are hard to debug in Spark, and we doing our best to make them as rare as possible.
75
106
76
-
Indeed, this produces `Dataset<Pair<String, Int>>`. There are a couple more `toDS` methods which accept different arguments.
77
-
78
-
Also, there are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design. For example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`.
79
-
Note that we are forcing `RIGHT` to be nullable for you as a developer to be able to handle this situation.
80
-
81
-
We know that `NullPointerException`s are hard to debug in Spark, and we are trying hard to make them as rare as possible.
82
-
83
-
## Useful helper methods
84
-
85
-
### `withSpark`
107
+
### withSpark function
86
108
87
109
We provide you with useful function `withSpark`, which accepts everything that may be needed to run Spark — properties, name, master location and so on. It also accepts a block of code to execute inside Spark context.
88
110
@@ -98,14 +120,13 @@ withSpark {
98
120
99
121
`dsOf` is just one more way to create `Dataset` (`Dataset<Int>`) from varargs.
100
122
101
-
### `withCached`
102
-
123
+
### withCached function
103
124
It can easily happen that we need to fork our computation to several paths. To compute things only once we should call `cache`
104
-
method. But there it is hard to control when we're using cached `Dataset` and when not.
105
-
It is also easy to forget to unpersist cached data, which can break things unexpectably or take more memory
125
+
method. However, it becomes difficult to control when we're using cached `Dataset` and when not.
126
+
It is also easy to forget to unpersist cached data, which can break things unexpectedly or take up more memory
106
127
than intended.
107
128
108
-
To solve these problems we introduce`withCached` function
129
+
To solve these problems we've added`withCached` function
109
130
110
131
```kotlin
111
132
withSpark {
@@ -121,19 +142,28 @@ withSpark {
121
142
}
122
143
```
123
144
124
-
Here we're showing cached `Dataset` for debugging purposes then filtering it. The `filter` method returns filtered `Dataset` and then the cached `Dataset` is being unpersisted, so we have more memory to call the `map` method and collect the resulting `Dataset`.
145
+
Here we're showing cached `Dataset` for debugging purposes then filtering it.
146
+
The `filter` method returns filtered `Dataset` and then the cached `Dataset` is being unpersisted, so we have more memory t
147
+
o call the `map` method and collect the resulting `Dataset`.
125
148
126
-
### `toList` and `toArray`
149
+
### toList and toArray methods
127
150
128
-
Kotlin uses `to` method on sequences to convert them to collections, so we have `toList` and `toArray` methods in our API for your code to look idiomatic. Usual `collect` method works too, but result should be casted to `Array` because `collect` returns Scala's array, which is not the same as Java/Kotlin one.
151
+
For more idiomatic Kotlin code we've added `toList` and `toArray` methods in this API. You can still use the `collect` method as in Scala API, however the result should be casted to `Array`.
152
+
This is because `collect` returns a Scala array, which is not the same as Java/Kotlin one.
129
153
130
154
## Examples
131
155
132
156
For more, check out [examples](https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples) module.
133
157
To get up and running quickly, check out this [tutorial](docs/quick-start-guide.md).
134
158
135
-
## Issues and feedback
159
+
## Reporting issues/Support
160
+
Please use [GitHub issues](https://github.com/JetBrains/kotlin-spark-api/issues) for filing feature requests and bug reports.
161
+
You are also welcome to join [kotlin-spark channel](https://kotlinlang.slack.com/archives/C015B9ZRGJF) in the Kotlin Slack.
162
+
163
+
## Code of Conduct
164
+
This project and the corresponding community is governed by the [JetBrains Open Source and Community Code of Conduct](https://confluence.jetbrains.com/display/ALL/JetBrains+Open+Source+and+Community+Code+of+Conduct). Please make sure you read it.
165
+
166
+
## License
167
+
Kotlin for Apache Spark is licensed under the [Apache 2.0 License](LICENSE).
136
168
137
-
Issues and any feedback are very welcome in `Issues` here.
138
169
139
-
If you find that we missed some important features — let us know!
0 commit comments