Add `injection` iceberg usecase #2

jbonofre · 2024-02-05T13:30:05Z

This first PR create the skeleton and add a first injection use case.

This injection use case:

provides a script to fetch GDELT data as CSV file, ready to be parsed and inserted in the Iceberg table
CreateTable creates a Iceberg table to store GDELT events
DataInjection parses the GDELT CSV file and insert data in the Iceberg table
Analytics provides the first simple queries

NB: when merged, I will create a blog post based on this PR.

…ceberg table.

jbonofre · 2024-02-05T13:31:40Z

@ajantha-bhat can you please take a look ?

ajantha-bhat · 2024-02-05T13:46:19Z

README.md

-* `benchmark` contains use cases benchmark
+* `iceberg/datasets` contains scripts to retrieve ready to use data.
+* `iceberg/usecases` contains samples and examples using the datasets.
+* `iceberg/benchmark` contains use cases benchmark


nit: this and other statements can also end with dot.

Good catch, I will fix that.

ajantha-bhat · 2024-02-05T13:46:43Z

README.md

@@ -30,25 +30,60 @@ Daily, a zip file is created, containing a CSV file with all events using the fo

 The format is described here: http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf

-### TPCDS


Should we add a title GDELT?

Yes, good point as we might have other usecases/datasets later.

ajantha-bhat · 2024-02-05T13:48:15Z

README.md


-This query extracts all events for a specific location, using Spark engine.
+```
+mvn clean install


nit: maybe better to make it a gradle project instead.

ajantha-bhat · 2024-02-05T14:34:11Z

...erg/usecases/injection/src/main/java/com/dremio/iceland/iceberg/injection/DataInjection.java

+
+            for (List<String> line : lines) {
+                StringBuilder builder = new StringBuilder();
+                builder.append("INSERT INTO iceland.gdelt.events VALUES(");


should this be local instead of iceland?

Yeah, we can use local, I will fix that.

Or what about the catalog name with prefix as type, like hadoop_catalog, nessie_catalog, etc ?

ajantha-bhat · 2024-02-05T14:36:55Z

...erg/usecases/injection/src/main/java/com/dremio/iceland/iceberg/injection/DataInjection.java

+        }
+
+        try (SparkSession spark = SparkProvider.get()) {
+            List<List<String>> lines = parse(file.toFile());


maybe just create a temp view for CSV and insert directly. No need of custom parser code.
https://stackoverflow.com/questions/56572853/how-to-sparksql-load-csv-with-header-on-from-statement

I remember we are able to parse with Spark, but not sure we can parse GDELT directly as it's not a CSV but a TSF (TabSeparatedFormat), let me check. Thanks for pointing.

ajantha-bhat · 2024-02-05T14:38:06Z

...erg/usecases/injection/src/main/java/com/dremio/iceland/iceberg/injection/SparkProvider.java

+                .master("local[2]")
+                .appName("simple")
+                .config("spark.sql.extensions", IcebergSparkSessionExtensions.class.getName())
+                .config("spark.sql.catalog.spark_catalog", SparkSessionCatalog.class.getName())


may be we don't need need to configure spark_catalog, there is no HMS. So, having a hive catalog is dummy here.

ajantha-bhat · 2024-02-05T14:38:33Z

I just reviewed high level. I will install and try it out later on.

nk1506 · 2024-04-10T04:22:16Z

iceberg/datasets/gdelt/fetch.sh

@@ -0,0 +1,15 @@
+#!/bin/bash
+


Should we add validation for expected no of arguments with value format? Like in this case
Expecting one argument with epoch or date

nk1506 · 2024-04-10T04:23:32Z

iceberg/usecases/injection/pom.xml

+        <dependency>
+            <groupId>org.apache.iceberg</groupId>
+            <artifactId>iceberg-spark-3.5_2.13</artifactId>
+            <version>1.4.3</version>


IMO, lets move version as properties. WDYT ? It should be easy to upgrade if needed.

nk1506 · 2024-04-10T04:30:43Z

README.md

+mvn clean install
+```
+
+You now have the `iceberg/usecases/injection/target/injection-1.0-SNAPSHOT.jar` uber jar.


Suggested change

You now have the `iceberg/usecases/injection/target/injection-1.0-SNAPSHOT.jar` uber jar.

You now have the `iceberg/usecases/injection/target/injection-<version>-SNAPSHOT.jar` uber jar.

nk1506 · 2024-04-10T04:32:58Z

README.md

+```
+
+You now have the `iceberg/usecases/injection/target/injection-1.0-SNAPSHOT.jar` uber jar.
+


Should we add a driver with all the available commands? Either we can prompt all the available commands or we can define Command Options to be selected ? thoughts ?

First iceberg usecase, fetching/parsing/inserting GDELT events in a I…

368a276

…ceberg table.

jbonofre self-assigned this Feb 5, 2024

ajantha-bhat reviewed Feb 5, 2024

View reviewed changes

nk1506 reviewed Apr 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `injection` iceberg usecase #2

Add `injection` iceberg usecase #2

jbonofre commented Feb 5, 2024

jbonofre commented Feb 5, 2024

ajantha-bhat Feb 5, 2024

jbonofre Feb 5, 2024

ajantha-bhat Feb 5, 2024

jbonofre Feb 5, 2024

ajantha-bhat Feb 5, 2024

ajantha-bhat Feb 5, 2024

jbonofre Feb 5, 2024

nk1506 Apr 10, 2024

ajantha-bhat Feb 5, 2024

jbonofre Feb 5, 2024

ajantha-bhat Feb 5, 2024

ajantha-bhat commented Feb 5, 2024

nk1506 Apr 10, 2024

nk1506 Apr 10, 2024

nk1506 Apr 10, 2024

nk1506 Apr 10, 2024

		@@ -30,25 +30,60 @@ Daily, a zip file is created, containing a CSV file with all events using the fo

		The format is described here: http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf

		### TPCDS

	You now have the `iceberg/usecases/injection/target/injection-1.0-SNAPSHOT.jar` uber jar.
	You now have the `iceberg/usecases/injection/target/injection-<version>-SNAPSHOT.jar` uber jar.

Add injection iceberg usecase #2

Are you sure you want to change the base?

Add injection iceberg usecase #2

Conversation

jbonofre commented Feb 5, 2024

jbonofre commented Feb 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat commented Feb 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `injection` iceberg usecase #2

Add `injection` iceberg usecase #2