S3-Trino-Data-Lakehouse Architecture 🌉

Overview 📖

This repository provides an implementation of a modern Data Lakehouse architecture. It integrates Trino (formerly PrestoSQL) for querying data stored in S3-compatible storage, Apache Hive Metastore for metadata management, and Apache Spark for distributed data processing.

Architecture Components 🏗️

Core Services ⚙️

PostgreSQL (13): Database for storing Hive Metastore metadata.
Hive Metastore (3.1.3): Metadata service for table and schema management.
Trino (426): High-performance distributed SQL query engine.
Apache Spark (3.4.1): Framework for distributed data processing.
- Spark Master: Cluster manager.
- Spark Worker: Execution node.

Storage 📦

S3-Compatible Object Storage: Primary data lake storage with the following features:
- Path-style access.
- SSL/TLS enabled.
- Compatible with MinIO, AWS S3, and other S3-compatible storage systems.

Port Configuration 🔌

Service	Port	Details
PostgreSQL	`5432`	Metadata database
Hive Metastore	`9083`	Metadata service
Trino	`9090`	Web UI: http://localhost:9090
Spark Master	`7077`	Cluster manager
Spark Web UI	`8080`	Web UI: http://localhost:8080

Prerequisites 📋

Docker and Docker Compose installed.
Access to S3-compatible storage (access key and secret key).
Sufficient disk space for PostgreSQL data.
Docker network connectivity.

Quick Start 🚀

Clone the repository:

git clone https://github.com/your-username/s3-trino-data-lakehouse.git
cd s3-trino-data-lakehouse

Set up environment variables:

cp .env.example .env
# Edit .env with your S3 credentials and endpoints

Start the services:

docker compose build
docker compose up -d

Verify the setup:

# Check running containers
docker ps

# Connect to Trino
docker exec -it trinodb trino

Usage Examples 📊

Trino Queries 🛠️

-- Register a Delta Lake table
CALL delta.system.register_table(
    schema_name => 'default',
    table_name => 'mytable',
    table_location => 's3a://my-bucket/path/'
);

-- Query data
SELECT * FROM delta.default.mytable LIMIT 5;

Spark Processing ⚡

# Connect to Spark shell
docker exec -it spark /opt/bitnami/spark/bin/spark-shell

# Process data in Scala
val df = spark.read.format("delta").load("s3a://my-bucket/path/")
df.show(5)

Directory Structure 📂

├── conf/                  # Configuration files
├── data/                  # PostgreSQL data directory
├── etc/
│   └── catalog/           # Trino catalog configurations
├── jars/                  # Additional JAR files
├── docker-compose.yaml    # Service definitions
├── Dockerfile             # Hive Metastore container definition
└── README.md

Configuration Files 🛠️

Hive Metastore

conf/hive-site.xml: Hive configuration.
entrypoint.sh: Initialization script.

Trino

etc/catalog/delta.properties: Delta Lake connector configuration.

Spark

conf/spark-defaults.conf: Spark configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

S3-Trino-Data-Lakehouse Architecture 🌉

Overview 📖

Architecture Components 🏗️

Core Services ⚙️

Storage 📦

Port Configuration 🔌

Prerequisites 📋

Quick Start 🚀

Usage Examples 📊

Trino Queries 🛠️

Spark Processing ⚡

Directory Structure 📂

Configuration Files 🛠️

Hive Metastore

Trino

Spark

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
conf		conf
etc/catalog		etc/catalog
jars		jars
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.py		config.py
docker-compose.yaml		docker-compose.yaml
entrypoint.sh		entrypoint.sh

KatGlo/s3-trino-data-lakehouse

Folders and files

Latest commit

History

Repository files navigation

S3-Trino-Data-Lakehouse Architecture 🌉

Overview 📖

Architecture Components 🏗️

Core Services ⚙️

Storage 📦

Port Configuration 🔌

Prerequisites 📋

Quick Start 🚀

Usage Examples 📊

Trino Queries 🛠️

Spark Processing ⚡

Directory Structure 📂

Configuration Files 🛠️

Hive Metastore

Trino

Spark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages