Skip to content

Commit d27d8a8

Browse files
committed
[DOP-16941] Add example for onETL
1 parent 976fe41 commit d27d8a8

File tree

2 files changed

+57
-54
lines changed

2 files changed

+57
-54
lines changed

docs/data_type_mappings.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ This documentation outlines the customized mappings that the Spark Dialect Exten
44

55
#### Customized Type Mappings with Spark Dialect Extension
66

7+
Primitive types:
8+
79
| ClickHouse Type (Read) | Spark Type | ClickHouse Type (Write) | ClickHouse Type (Create) |
810
|------------------------|----------------------|-------------------------|-------------------------------------------------|
911
| `Bool` | `BooleanType` | `Bool` | `Bool (Spark's default is UInt64)` |
@@ -26,7 +28,7 @@ This documentation outlines the customized mappings that the Spark Dialect Exten
2628
| `DateTime64(6)` | `TimestampType` | `DateTime64(6)` | `DateTime64(6) (Spark's default is DateTime32)` |
2729

2830

29-
``Array(T)`` `->` ``ArrayType(T)``:
31+
``Array(T)`` `->` ``ArrayType(T)`` (without this extension Spark does not support Arrays for GenericJDBC dialect):
3032

3133
| ClickHouse Type (Read) | Spark Type | ClickHouse Type (Write) | ClickHouse Type (Create) |
3234
|------------------------|--------------------------------|-------------------------|--------------------------|
@@ -39,6 +41,3 @@ This documentation outlines the customized mappings that the Spark Dialect Exten
3941
| unsupported | `ArrayType(Date)` | `Array(Date)` | `Array(Date)` |
4042
| unsupported | `ArrayType(FloatType)` | `Array(Float32)` | `Array(Float32)` |
4143
| unsupported | `ArrayType(DoubleType)` | unsupported | unsupported |
42-
43-
44-
#### By Default Array Type Mappings without Spark Dialect Extension is not supported

docs/using_the_dialect.md

Lines changed: 54 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -2,95 +2,99 @@
22

33
This section provides instructions on how to configure Apache Spark to use the Spark Dialect Extension, enabling custom handling of JDBC data types.
44

5-
### Add the JAR to Spark
5+
### Using onETL with PySpark
66

7-
#### Using release version
8-
9-
##### Using SparkConf
10-
11-
For PySpark:
7+
See [onETL documentation](https://onetl.readthedocs.io) for installation instructions.
128

139
```python
1410
from pyspark.sql import SparkSession
11+
from onetl.connection import Clickhouse
1512

13+
# describe packages should be loaded by Spark
14+
maven_packages = [
15+
"io.github.mtsongithub.doetl:spark-dialect-extension_2.12:0.0.1",
16+
*Clickhouse.get_packages(),
17+
]
18+
19+
# Create Spark session
1620
spark = (
1721
SparkSession.builder
1822
.appName("My Spark App")
19-
.config("spark.jars.packages", "io.github.mtsongithub.doetl:spark-dialect-extension_2.12:0.0.1")
23+
.config("spark.jars.packages", ",".join(maven_packages))
2024
.getOrCreate()
2125
)
22-
```
23-
24-
For Spark on Scala:
25-
26-
```scala
27-
import org.apache.spark.sql.SparkSession
28-
29-
val spark = SparkSession.builder()
30-
.appName("My Spark App")
31-
.config("spark.jars.packages", "io.github.mtsongithub.doetl:spark-dialect-extension_2.12:0.0.1")
32-
.getOrCreate()
33-
```
3426

35-
##### Using Spark Submit
36-
37-
```bash
38-
spark-submit --conf spark.jars.packages=io.github.mtsongithub.doetl:spark-dialect-extension_2.12:0.0.1
39-
```
40-
41-
#### Compile from source
42-
43-
##### Build .jar file
44-
45-
See [CONTRIBUTING.md](../CONTRIBUTING.md) for build instructions.
27+
# Register custom Clickhouse dialect
28+
ClickhouseDialectRegistry = spark._jvm.io.github.mtsongithub.doetl.sparkdialectextensions.clickhouse.ClickhouseDialectRegistry
29+
ClickhouseDialectRegistry.register()
4630

47-
After build you'll have a file `/path/to/cloned-repo/target/scala_2.12/spark-dialect-extension_2.12-0.0.1.jar`
4831

49-
##### Using SparkConf
32+
# use onETL to interact with Clickhouse
33+
clickhouse = Clickhouse(
34+
host="my.clickhouse.hostname.or.ip",
35+
port=9000,
36+
user="someuser",
37+
password="******",
38+
spark=spark,
39+
)
5040

51-
For PySpark:
41+
from onetl.db import DBReader, DBWriter
5242

53-
```python
54-
from pyspark.sql import SparkSession
43+
# onETL now can properly read some Clickhouse types
44+
reader = DBReader(connection=clickhouse, source="mytable")
45+
df = reader.run()
5546

56-
spark = (
57-
SparkSession.builder
58-
.appName("My Spark App")
59-
.config("spark.jars", "/path/to/cloned-repo/target/scala_2.12/spark-dialect-extension_2.12-0.0.1.jar")
60-
.getOrCreate()
61-
)
47+
# onETL now can properly write some Clickhouse types
48+
writer = DBWriter(connection=clickhouse, target="anothertable")
49+
writer.run(df)
6250
```
6351

64-
For Spark on Scala:
52+
### Using Spark on Scala
6553

6654
```scala
6755
import org.apache.spark.sql.SparkSession
6856

57+
// describe packages should be loaded by Spark
58+
var maven_packages = Array(
59+
"io.github.mtsongithub.doetl:spark-dialect-extension_2.12:0.0.1",
60+
"com.clickhouse:clickhouse-jdbc:0.6.5",
61+
"com.clickhouse:clickhouse-http-client:0.6.5",
62+
"org.apache.httpcomponents.client5:httpclient5::5.3.1",
63+
)
64+
6965
val spark = SparkSession.builder()
7066
.appName("My Spark App")
71-
.config("spark.jars", "/path/to/cloned-repo/target/scala_2.12/spark-dialect-extension_2.12-0.0.1.jar")
67+
.config("spark.jars.packages", maven_packages.mkString(","))
7268
.getOrCreate()
69+
70+
// Register custom Clickhouse dialect
71+
import io.github.mtsongithub.doetl.sparkdialectextensions.clickhouse.ClickhouseDialectRegistry
72+
73+
ClickhouseDialectRegistry.register()
74+
75+
// now Spark can properly handle some Clickhouse types during read & write
76+
df = spark.read.jdbc.options(...).load()
77+
df.write.jdbc.options(...).saveAsTable("anothertable")
7378
```
7479

75-
##### Using Spark Submit
80+
### Using Spark Submit
81+
82+
Start Spark session with downloaded packages:
7683

7784
```bash
78-
spark-submit --jars /path/to/cloned-repo/target/scala_2.12/spark-dialect-extension_2.12-0.0.1.jar
85+
spark-submit --conf spark.jars.packages=io.github.mtsongithub.doetl:spark-dialect-extension_2.12:0.0.1,com.clickhouse:clickhouse-jdbc:0.6.5,com.clickhouse:clickhouse-http-client:0.6.5,org.apache.httpcomponents.client5:httpclient5::5.3.1 ...
7986
```
8087

81-
### Register a dialect
82-
83-
To integrate the Spark Dialect Extension into your Spark application, you need to use ``<DBMS>DialectRegistry`` classes, which dynamically detect the Spark version and register the corresponding dialect.
88+
And then register custom dialect in started session.
8489

8590
For PySpark:
86-
8791
```python
8892
# Register custom Clickhouse dialect
8993
ClickhouseDialectRegistry = spark._jvm.io.github.mtsongithub.doetl.sparkdialectextensions.clickhouse.ClickhouseDialectRegistry
9094
ClickhouseDialectRegistry.register()
9195
```
9296

93-
For Spark on Scala:
97+
For Scala:
9498
```scala
9599
// Register custom Clickhouse dialect
96100
import io.github.mtsongithub.doetl.sparkdialectextensions.clickhouse.ClickhouseDialectRegistry

0 commit comments

Comments
 (0)