Skip to content

Commit 3b7444a

Browse files
committed
GCP variant of Snowflake Loader and Stream Transformer (close #1091)
1 parent 94e3509 commit 3b7444a

File tree

190 files changed

+4054
-2568
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

190 files changed

+4054
-2568
lines changed

.github/workflows/ci.yml

Lines changed: 11 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ jobs:
6767
- snowflakeLoader
6868
- databricksLoader
6969
- transformerKinesis
70+
- transformerPubsub
7071
steps:
7172
- name: Checkout Github
7273
uses: actions/checkout@v2
@@ -171,26 +172,17 @@ jobs:
171172
curl https://databricks-bi-artifacts.s3.us-east-2.amazonaws.com/simbaspark-drivers/jdbc/2.6.25/DatabricksJDBC42-2.6.25.1044.zip --output DatabricksJDBC42.jar.zip
172173
unzip DatabricksJDBC42.jar.zip
173174
cp DatabricksJDBC42-2.6.25.1044/DatabricksJDBC42.jar DatabricksJDBC42.jar
174-
- name: Build redshift-loader jar
175+
- name: Build jar files
175176
env:
176177
SKIP_TEST: true
177-
run: sbt 'project redshiftLoader' assembly
178-
- name: Build snowflake-loader jar
179-
env:
180-
SKIP_TEST: true
181-
run: sbt 'project snowflakeLoader' assembly
182-
- name: Build databricks-loader jar
183-
env:
184-
SKIP_TEST: true
185-
run: sbt 'project databricksLoader' assembly
186-
- name: Build transformer-batch jar
187-
env:
188-
SKIP_TEST: true
189-
run: sbt 'project transformerBatch' assembly
190-
- name: Build transformer-kinesis jar
191-
env:
192-
SKIP_TEST: true
193-
run: sbt 'project transformerKinesis' assembly
178+
run: |
179+
sbt \
180+
'project redshiftLoader; assembly' \
181+
'project snowflakeLoader; assembly' \
182+
'project databricksLoader; assembly' \
183+
'project transformerBatch; assembly' \
184+
'project transformerKinesis; assembly' \
185+
'project transformerPubsub; assembly'
194186
- name: Get current version
195187
id: ver
196188
run: echo "::set-output name=project_version::${GITHUB_REF#refs/tags/}"
@@ -207,5 +199,6 @@ jobs:
207199
modules/databricks-loader/target/scala-2.12/snowplow-databricks-loader-${{ steps.ver.outputs.project_version }}.jar
208200
modules/transformer-batch/target/scala-2.12/snowplow-transformer-batch-${{ steps.ver.outputs.project_version }}.jar
209201
modules/transformer-kinesis/target/scala-2.12/snowplow-transformer-kinesis-${{ steps.ver.outputs.project_version }}.jar
202+
modules/transformer-pubsub/target/scala-2.12/snowplow-transformer-pubsub-${{ steps.ver.outputs.project_version }}.jar
210203
env:
211204
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

README.md

Lines changed: 27 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,29 +8,38 @@
88

99
## Introduction
1010

11-
This project contains applications required to load Snowplow data into relational databases.
11+
This project contains applications required to load Snowplow data into various data warehouses.
1212

13-
### RDB Shredder
13+
It consists of two types of applications: Transformers and Loaders
1414

15-
RDB Shredder is a [Spark][spark] job which:
15+
### Transformers
1616

17-
1. Reads Snowplow enriched events from S3
18-
2. Extracts any unstructured event JSONs and context JSONs found
19-
3. Validates that these JSONs conform to schema
20-
4. Adds metadata to these JSONs to track their origins
21-
5. Writes these JSONs out to nested folders dependent on their schema
17+
Transformers read Snowplow enriched events, transform them to a format ready to be loaded to a data warehouse, then write them to respective blob storage.
2218

23-
It is designed to be run downstream of the [Enrich][enrich] job.
19+
There are two types of Transformers: Batch and Streaming
2420

25-
### RDB Loader
21+
#### Stream Transformer
2622

27-
RDB Loader (previously known as StorageLoader) is a Scala application that runs in background, discovering [data][shred], produced by RDB Shredder from SQS queue and loading it into one of possible [storage targets][targets].
23+
Stream Transformers read enriched events from respective stream service, transform them, then write transformed events to specified blob storage path.
24+
They write transformed events in periodic windows.
2825

29-
### RDB Stream Shredder (experimental)
26+
There are two different Stream Transformer applications: Transformer Kinesis and Transformer Pubsub. As one can predict, they are different variants for GCP and AWS.
3027

31-
An application similar to RDB Shredder, but working without Apache Spark or EMR
32-
and reading directly from Kinesis Stream. Only Shredder or Stream Shredder
33-
should be used.
28+
29+
#### Batch Transformer
30+
31+
It is a [Spark][spark] job. It only works with AWS services. It reads enriched events from a given S3 path, transforms them, then writes transformed events to a specified S3 path.
32+
33+
34+
### Loaders
35+
36+
Transformers send a message to a message queue after they are finished with transforming some batch and writing it to blob storage.
37+
This message contains information about transformed data such as where it is stored and what it looks like.
38+
39+
Loaders subscribe to the message queue. After a message is received, it is parsed, and necessary bits are extracted to load transformed events to the destination.
40+
Loaders construct necessary SQL statements to load transformed events then they send these SQL statements to the specified destination.
41+
42+
At the moment, we have loader applications for Redshift, Databricks and Snowflake.
3443

3544
## Find out more
3645

@@ -41,7 +50,7 @@ should be used.
4150

4251
## Copyright and License
4352

44-
Snowplow Relational Database Loader is copyright 2012-2021 Snowplow Analytics Ltd.
53+
Snowplow Relational Database Loader is copyright 2012-2022 Snowplow Analytics Ltd.
4554

4655
Licensed under the **[Apache License, Version 2.0][license]** (the "License");
4756
you may not use this software except in compliance with the License.
@@ -56,15 +65,11 @@ limitations under the License.
5665
[techdocs-image]: https://d3i6fms1cm1j0i.cloudfront.net/github/images/techdocs.png
5766
[setup-image]: https://d3i6fms1cm1j0i.cloudfront.net/github/images/setup.png
5867
[roadmap-image]: https://d3i6fms1cm1j0i.cloudfront.net/github/images/roadmap.png
59-
[setup]: https://docs.snowplowanalytics.com/docs/getting-started-on-snowplow-open-source/setup-snowplow-on-aws/setup-destinations/setup-redshift/
60-
[techdocs]: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/
68+
[setup]: https://docs.snowplow.io/docs/getting-started-on-snowplow-open-source/
69+
[techdocs]: https://docs.snowplow.io/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/
6170
[roadmap]: https://github.com/snowplow/snowplow/projects/7
6271

6372
[spark]: http://spark.apache.org/
64-
[enrich]: https://github.com/snowplow/snowplow/enrich
65-
66-
[targets]: https://github.com/snowplow/snowplow/wiki/Configuring-storage-targets
67-
[shred]: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/rdb-shredder/
6873

6974
[build-image]: https://github.com/snowplow/snowplow-rdb-loader/workflows/Test%20and%20deploy/badge.svg
7075
[build]: https://github.com/snowplow/snowplow-rdb-loader/actions?query=workflow%3A%22Test%22

build.sbt

Lines changed: 61 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -12,20 +12,50 @@
1212
*/
1313

1414
lazy val root = project.in(file("."))
15-
.aggregate(common, aws, loader, databricksLoader, redshiftLoader, snowflakeLoader, transformerBatch, transformerKinesis)
15+
.aggregate(
16+
aws,
17+
gcp,
18+
common,
19+
commonTransformerStream,
20+
loader,
21+
databricksLoader,
22+
redshiftLoader,
23+
snowflakeLoader,
24+
transformerBatch,
25+
transformerKinesis,
26+
transformerPubsub,
27+
)
1628

17-
lazy val aws = project
29+
lazy val common: Project = project
30+
.in(file("modules/common"))
31+
.settings(BuildSettings.commonBuildSettings)
32+
.settings(libraryDependencies ++= Dependencies.commonDependencies)
33+
.settings(excludeDependencies ++= Dependencies.exclusions)
34+
.enablePlugins(BuildInfoPlugin)
35+
36+
lazy val aws: Project = project
1837
.in(file("modules/aws"))
1938
.settings(BuildSettings.awsBuildSettings)
20-
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
2139
.settings(libraryDependencies ++= Dependencies.awsDependencies)
40+
.settings(excludeDependencies ++= Dependencies.exclusions)
41+
.dependsOn(common % "compile->compile;test->test")
2242
.enablePlugins(BuildInfoPlugin)
2343

24-
lazy val common: Project = project
25-
.in(file("modules/common"))
26-
.settings(BuildSettings.commonBuildSettings)
27-
.settings(libraryDependencies ++= Dependencies.commonDependencies)
44+
lazy val gcp: Project = project
45+
.in(file("modules/gcp"))
46+
.settings(BuildSettings.gcpBuildSettings)
47+
.settings(libraryDependencies ++= Dependencies.gcpDependencies)
2848
.settings(excludeDependencies ++= Dependencies.exclusions)
49+
.dependsOn(common % "compile->compile;test->test")
50+
.enablePlugins(BuildInfoPlugin)
51+
52+
lazy val commonTransformerStream = project
53+
.in(file("modules/common-transformer-stream"))
54+
.settings(BuildSettings.commonStreamTransformerBuildSettings)
55+
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
56+
.settings(libraryDependencies ++= Dependencies.commonStreamTransformerDependencies)
57+
.settings(excludeDependencies ++= Dependencies.commonStreamTransformerExclusions)
58+
.dependsOn(common % "compile->compile;test->test")
2959
.enablePlugins(BuildInfoPlugin)
3060

3161
lazy val loader = project
@@ -34,15 +64,15 @@ lazy val loader = project
3464
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
3565
.settings(libraryDependencies ++= Dependencies.loaderDependencies)
3666
.settings(excludeDependencies ++= Dependencies.exclusions)
37-
.dependsOn(common % "compile->compile;test->test", aws)
67+
.dependsOn(aws % "compile->compile;test->test", gcp % "compile->compile;test->test")
3868
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin)
3969

4070
lazy val redshiftLoader = project
4171
.in(file("modules/redshift-loader"))
4272
.settings(BuildSettings.redshiftBuildSettings)
4373
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
4474
.settings(libraryDependencies ++= Dependencies.redshiftDependencies)
45-
.dependsOn(common % "compile->compile;test->test", aws, loader % "compile->compile;test->test")
75+
.dependsOn(loader % "compile->compile;test->test")
4676
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin)
4777

4878
lazy val redshiftLoaderDistroless = project
@@ -51,15 +81,15 @@ lazy val redshiftLoaderDistroless = project
5181
.settings(BuildSettings.redshiftDistrolessBuildSettings)
5282
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
5383
.settings(libraryDependencies ++= Dependencies.redshiftDependencies)
54-
.dependsOn(common % "compile->compile;test->test", aws, loader % "compile->compile;test->test")
84+
.dependsOn(loader % "compile->compile;test->test")
5585
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin, LauncherJarPlugin)
5686

5787
lazy val snowflakeLoader = project
5888
.in(file("modules/snowflake-loader"))
5989
.settings(BuildSettings.snowflakeBuildSettings)
6090
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
6191
.settings(libraryDependencies ++= Dependencies.snowflakeDependencies)
62-
.dependsOn(common % "compile->compile;test->test", aws, loader % "compile->compile;test->test")
92+
.dependsOn(common % "compile->compile;test->test",loader % "compile->compile;test->test")
6393
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin)
6494

6595
lazy val snowflakeLoaderDistroless = project
@@ -68,22 +98,22 @@ lazy val snowflakeLoaderDistroless = project
6898
.settings(BuildSettings.snowflakeDistrolessBuildSettings)
6999
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
70100
.settings(libraryDependencies ++= Dependencies.snowflakeDependencies)
71-
.dependsOn(common % "compile->compile;test->test", aws, loader % "compile->compile;test->test")
101+
.dependsOn(loader % "compile->compile;test->test")
72102
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin, LauncherJarPlugin)
73103

74104
lazy val databricksLoader = project
75105
.in(file("modules/databricks-loader"))
76106
.settings(BuildSettings.databricksBuildSettings)
77107
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
78-
.dependsOn(common % "compile->compile;test->test", aws, loader % "compile->compile;test->test")
108+
.dependsOn(loader % "compile->compile;test->test")
79109
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin)
80110

81111
lazy val databricksLoaderDistroless = project
82112
.in(file("modules/distroless/databricks-loader"))
83113
.settings(sourceDirectory := (databricksLoader / sourceDirectory).value)
84114
.settings(BuildSettings.databricksDistrolessBuildSettings)
85115
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
86-
.dependsOn(common % "compile->compile;test->test", aws, loader % "compile->compile;test->test")
116+
.dependsOn(loader % "compile->compile;test->test")
87117
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin, LauncherJarPlugin)
88118

89119
lazy val transformerBatch = project
@@ -98,17 +128,28 @@ lazy val transformerKinesis = project
98128
.in(file("modules/transformer-kinesis"))
99129
.settings(BuildSettings.transformerKinesisBuildSettings)
100130
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
101-
.settings(libraryDependencies ++= Dependencies.transformerKinesisDependencies)
102-
.settings(excludeDependencies ++= Dependencies.transformerKinesisExclusions)
103-
.dependsOn(common % "compile->compile;test->test", aws)
131+
.dependsOn(commonTransformerStream % "compile->compile;test->test", aws % "compile->compile;test->test")
104132
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin)
105133

106134
lazy val transformerKinesisDistroless = project
107135
.in(file("modules/distroless/transformer-kinesis"))
108136
.settings(sourceDirectory := (transformerKinesis / sourceDirectory).value)
109137
.settings(BuildSettings.transformerKinesisDistrolessBuildSettings)
110138
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
111-
.settings(libraryDependencies ++= Dependencies.transformerKinesisDependencies)
112-
.settings(excludeDependencies ++= Dependencies.transformerKinesisExclusions)
113-
.dependsOn(common, aws)
139+
.dependsOn(commonTransformerStream, aws)
140+
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin, LauncherJarPlugin)
141+
142+
lazy val transformerPubsub = project
143+
.in(file("modules/transformer-pubsub"))
144+
.settings(BuildSettings.transformerPubsubBuildSettings)
145+
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
146+
.dependsOn(commonTransformerStream % "compile->compile;test->test", gcp % "compile->compile;test->test")
147+
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin)
148+
149+
lazy val transformerPubsubDistroless = project
150+
.in(file("modules/distroless/transformer-pubsub"))
151+
.settings(sourceDirectory := (transformerPubsub / sourceDirectory).value)
152+
.settings(BuildSettings.transformerPubsubDistrolessBuildSettings)
153+
.settings(addCompilerPlugin(Dependencies.betterMonadicFor))
154+
.dependsOn(commonTransformerStream, gcp)
114155
.enablePlugins(JavaAppPackaging, DockerPlugin, BuildInfoPlugin, LauncherJarPlugin)

config/databricks.config.reference.hocon renamed to config/loader/aws/databricks.config.reference.hocon

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,8 @@
1717
"host": "abc.cloud.databricks.com",
1818
# DB password
1919
"password": {
20-
# A password can be placed in EC2 parameter store or be a plain text
21-
"ec2ParameterStore": {
22-
# EC2 parameter name
20+
# A password can be placed in EC2 parameter store, GCP Secret Manager or be a plain text
21+
"secretStore": {
2322
"parameterName": "snowplow.databricks.password"
2423
}
2524
},

config/redshift.config.reference.hocon renamed to config/loader/aws/redshift.config.reference.hocon

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
{
2+
# Specifies the cloud provider that application will be deployed into
3+
"cloud": "aws"
24
# Data Lake (S3) region
35
# This field is optional if it can be resolved with AWS region provider chain.
46
# It checks places like env variables, system properties, AWS profile file.

config/snowflake.config.reference.hocon renamed to config/loader/aws/snowflake.config.reference.hocon

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,8 @@
2222
"username": "admin",
2323
# DB password
2424
"password": {
25-
# A password can be placed in EC2 parameter store or be a plain text
26-
"ec2ParameterStore": {
27-
# EC2 parameter name
25+
# A password can be placed in EC2 parameter store, GCP Secret Manager or be a plain text
26+
"secretStore": {
2827
"parameterName": "snowplow.snowflake.password"
2928
}
3029
},
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
{
2+
"messageQueue": {
3+
"type": "pubsub"
4+
"subscription": "projects/project-id/subscriptions/subscription-id"
5+
},
6+
"storage" : {
7+
"type": "snowflake",
8+
9+
"snowflakeRegion": "us-west-2",
10+
"username": "admin",
11+
"password": "Supersecret1",
12+
"account": "acme",
13+
"warehouse": "wh",
14+
"schema": "atomic",
15+
"database": "snowplow",
16+
17+
"transformedStage": {
18+
"name": "snowplow_stage"
19+
"location": "gs://bucket/transformed/"
20+
}
21+
}
22+
}

0 commit comments

Comments
 (0)