-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add injection
iceberg usecase
#2
base: main
Are you sure you want to change the base?
Conversation
@ajantha-bhat can you please take a look ? |
* `benchmark` contains use cases benchmark | ||
* `iceberg/datasets` contains scripts to retrieve ready to use data. | ||
* `iceberg/usecases` contains samples and examples using the datasets. | ||
* `iceberg/benchmark` contains use cases benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this and other statements can also end with dot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I will fix that.
@@ -30,25 +30,60 @@ Daily, a zip file is created, containing a CSV file with all events using the fo | |||
|
|||
The format is described here: http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf | |||
|
|||
### TPCDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a title GDELT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point as we might have other usecases/datasets later.
|
||
This query extracts all events for a specific location, using Spark engine. | ||
``` | ||
mvn clean install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe better to make it a gradle project instead.
|
||
for (List<String> line : lines) { | ||
StringBuilder builder = new StringBuilder(); | ||
builder.append("INSERT INTO iceland.gdelt.events VALUES("); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be local
instead of iceland
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can use local
, I will fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or what about the catalog name with prefix as type, like hadoop_catalog
, nessie_catalog
, etc ?
} | ||
|
||
try (SparkSession spark = SparkProvider.get()) { | ||
List<List<String>> lines = parse(file.toFile()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just create a temp view for CSV and insert directly. No need of custom parser code.
https://stackoverflow.com/questions/56572853/how-to-sparksql-load-csv-with-header-on-from-statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we are able to parse with Spark, but not sure we can parse GDELT directly as it's not a CSV but a TSF (TabSeparatedFormat), let me check. Thanks for pointing.
.master("local[2]") | ||
.appName("simple") | ||
.config("spark.sql.extensions", IcebergSparkSessionExtensions.class.getName()) | ||
.config("spark.sql.catalog.spark_catalog", SparkSessionCatalog.class.getName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we don't need need to configure spark_catalog
, there is no HMS. So, having a hive catalog is dummy here.
I just reviewed high level. I will install and try it out later on. |
@@ -0,0 +1,15 @@ | |||
#!/bin/bash | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add validation for expected no of arguments with value format? Like in this case
Expecting one argument with epoch or date
<dependency> | ||
<groupId>org.apache.iceberg</groupId> | ||
<artifactId>iceberg-spark-3.5_2.13</artifactId> | ||
<version>1.4.3</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, lets move version as properties. WDYT ? It should be easy to upgrade if needed.
mvn clean install | ||
``` | ||
|
||
You now have the `iceberg/usecases/injection/target/injection-1.0-SNAPSHOT.jar` uber jar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You now have the `iceberg/usecases/injection/target/injection-1.0-SNAPSHOT.jar` uber jar. | |
You now have the `iceberg/usecases/injection/target/injection-<version>-SNAPSHOT.jar` uber jar. |
``` | ||
|
||
You now have the `iceberg/usecases/injection/target/injection-1.0-SNAPSHOT.jar` uber jar. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a driver with all the available commands? Either we can prompt all the available commands or we can define Command Options to be selected ? thoughts ?
This first PR create the skeleton and add a first
injection
use case.This
injection
use case:CreateTable
creates a Iceberg table to store GDELT eventsDataInjection
parses the GDELT CSV file and insert data in the Iceberg tableAnalytics
provides the first simple queriesNB: when merged, I will create a blog post based on this PR.