Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add injection iceberg usecase #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jbonofre
Copy link
Collaborator

@jbonofre jbonofre commented Feb 5, 2024

This first PR create the skeleton and add a first injection use case.

This injection use case:

  1. provides a script to fetch GDELT data as CSV file, ready to be parsed and inserted in the Iceberg table
  2. CreateTable creates a Iceberg table to store GDELT events
  3. DataInjection parses the GDELT CSV file and insert data in the Iceberg table
  4. Analytics provides the first simple queries

NB: when merged, I will create a blog post based on this PR.

@jbonofre jbonofre self-assigned this Feb 5, 2024
@jbonofre
Copy link
Collaborator Author

jbonofre commented Feb 5, 2024

@ajantha-bhat can you please take a look ?

* `benchmark` contains use cases benchmark
* `iceberg/datasets` contains scripts to retrieve ready to use data.
* `iceberg/usecases` contains samples and examples using the datasets.
* `iceberg/benchmark` contains use cases benchmark

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this and other statements can also end with dot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I will fix that.

@@ -30,25 +30,60 @@ Daily, a zip file is created, containing a CSV file with all events using the fo

The format is described here: http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf

### TPCDS

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a title GDELT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point as we might have other usecases/datasets later.


This query extracts all events for a specific location, using Spark engine.
```
mvn clean install

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe better to make it a gradle project instead.


for (List<String> line : lines) {
StringBuilder builder = new StringBuilder();
builder.append("INSERT INTO iceland.gdelt.events VALUES(");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be local instead of iceland?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can use local, I will fix that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or what about the catalog name with prefix as type, like hadoop_catalog, nessie_catalog, etc ?

}

try (SparkSession spark = SparkProvider.get()) {
List<List<String>> lines = parse(file.toFile());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just create a temp view for CSV and insert directly. No need of custom parser code.
https://stackoverflow.com/questions/56572853/how-to-sparksql-load-csv-with-header-on-from-statement

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we are able to parse with Spark, but not sure we can parse GDELT directly as it's not a CSV but a TSF (TabSeparatedFormat), let me check. Thanks for pointing.

.master("local[2]")
.appName("simple")
.config("spark.sql.extensions", IcebergSparkSessionExtensions.class.getName())
.config("spark.sql.catalog.spark_catalog", SparkSessionCatalog.class.getName())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we don't need need to configure spark_catalog, there is no HMS. So, having a hive catalog is dummy here.

@ajantha-bhat
Copy link

I just reviewed high level. I will install and try it out later on.

@@ -0,0 +1,15 @@
#!/bin/bash

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add validation for expected no of arguments with value format? Like in this case
Expecting one argument with epoch or date

<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-spark-3.5_2.13</artifactId>
<version>1.4.3</version>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, lets move version as properties. WDYT ? It should be easy to upgrade if needed.

mvn clean install
```

You now have the `iceberg/usecases/injection/target/injection-1.0-SNAPSHOT.jar` uber jar.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You now have the `iceberg/usecases/injection/target/injection-1.0-SNAPSHOT.jar` uber jar.
You now have the `iceberg/usecases/injection/target/injection-<version>-SNAPSHOT.jar` uber jar.

```

You now have the `iceberg/usecases/injection/target/injection-1.0-SNAPSHOT.jar` uber jar.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a driver with all the available commands? Either we can prompt all the available commands or we can define Command Options to be selected ? thoughts ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants