This project implements an ETL (Extract, Transform, Load) pipeline using Python and PostgreSQL for a music streaming app, Sparkify. It processes song and user activity data, transforming it into a set of dimensional tables for easier querying and analysis.
etl.py
: Contains the ETL processes to read JSON logs on user activity and JSON metadata on songs, and load the data into PostgreSQL tables.create_tables.py
: Drops and creates the database and tables.sql_queries.py
: Contains all the SQL queries used in the ETL process.
The database uses a star schema optimized for queries on song play analysis. This includes the following tables:
- songplays - records in log data associated with song plays
- users - users in the app
- songs - songs in music database
- artists - artists in music database
- time - timestamps of records in songplays broken down into specific units
- Python 3.x
- PostgreSQL
- psycopg2
- pandas
The project uses two datasets:
- Song Dataset: JSON files containing metadata about songs and artists.
- Log Dataset: JSON files containing user activity logs from the music streaming app.
- Process song data to populate the songs and artists tables.
- Process log data to populate the time and users tables.
- Use data from both song and log datasets to populate the songplays fact table.