This project demonstrates a modern, end-to-end data pipeline using AWS services with Spotify data. It simulates how a sample dataset from Spotify can be ingested, processed, queried, and visualized, leveraging:
- AWS S3 for storing raw and transformed data
- AWS Glue for building scalable ETL jobs (using Python)
- AWS Athena for serverless querying of large datasets
- AWS QuickSight for intuitive and powerful dashboarding
The focus is on utilizing serverless and pay-as-you-go AWS services to minimize cost and maximize scalability β an ideal approach for real-world Data Engineering pipelines.
Tool/Service | Purpose |
---|---|
AWS S3 | Raw & curated zone for data lake |
AWS Glue | ETL jobs to clean/transform data |
AWS Athena | Querying data with SQL |
AWS QuickSight | Interactive dashboards & charts |
Python | Automation & scripting |
- π Automated ETL Pipeline with AWS Glue (PySpark script)
- π§Ή Data Cleaning & Transformation from raw Spotify JSON
- π Raw + Curated Zone Architecture on S3
- π΅οΈββοΈ Schema-on-Read via Athena
- π Comprehensive Dashboard using QuickSight
- π Fully serverless for scalability and cost-efficiency
Ever wondered what stories your Spotify listening habits could tell? This project transforms raw Spotify data into powerful insights. We'll construct a modern data pipeline on AWS, using S3 for robust storage, Glue for smart data processing, Athena for querying our music universe, and finally, QuickSight to bring those musical stories to life through vibrant visualizations. Get ready to unlock the power of your playlists!
The overall architecture illustrates the flow of data from ingestion to visualization using integrated AWS services:
The data engineering pipeline processes Spotify data through the following key stages:
-
π₯ Ingestion & Raw Storage (Amazon S3):
- Raw Spotify data (e.g., CSV files for Albums, Artists, Tracks) is ingested and stored in a designated Amazon S3 bucket, serving as the "raw zone" of the data lake.
-
βοΈ ETL Processing (AWS Glue):
- An AWS Glue ETL job (developed using the AWS Glue Visual ETL interface and detailed in the Python script) extracts data from the raw S3 locations.
- It performs essential transformations such as:
- Joining datasets (e.g., Albums with Artists, then with Tracks).
- Cleaning data (e.g., dropping unnecessary fields, handling nulls).
- Structuring data for analytics.
- The detailed visual workflow for this Glue job is shown below:
Figure: The AWS Glue job graph showing S3 data sources (Albums, Artists, Tracks), join operations, the 'DropFields' transformation, and the final S3 data target.
- The transformed and curated data is then loaded into a different Amazon S3 bucket/prefix, designated as the "data warehouse zone."
-
πΊοΈ Schema Discovery & Cataloging (AWS Glue Crawler):
- An AWS Glue Crawler runs against the S3 data warehouse zone.
- It infers the schema of the processed data and creates/updates table definitions in the AWS Glue Data Catalog.
-
π Querying & Analysis (Amazon Athena):
- Amazon Athena utilizes metadata from the AWS Glue Data Catalog to execute standard SQL queries directly on the data stored in the S3 data warehouse zone.
- This enables ad-hoc analysis and data exploration without the need to load data into a traditional database. Query results can also be saved back to S3.
-
π Visualization & Reporting (Amazon QuickSight):
- Amazon QuickSight connects to Amazon Athena as its data source.
- It fetches query results to build interactive dashboards, reports, and visualizations, facilitating valuable insights π‘ into the Spotify data.
The ETL jobs are developed and managed within the AWS Glue Studio environment, which provides a comprehensive visual interface for designing and monitoring data integration workflows.
Figure: The AWS Glue console displaying the "Spotify Project." The "Visual" tab shows the ETL job graph, and the left-hand navigation provides access to various Glue features.
The engine driving the data transformation in this project is the Python script Spotify Project.py
. This script is designed for execution as an AWS Glue ETL job, leveraging the capabilities of PySpark for efficient, distributed data processing.
π Script Location:
- Access the complete script here:
Spotify Project.py
π Core Responsibilities & Functionality: This script translates the visual ETL design into executable code, performing the following critical operations:
- π¬ Initialization: Establishes the Spark and AWS Glue execution environment.
- π₯ Data Extraction (Extract): Reads
albums.csv
,artists.csv
, andtracks.csv
from S3. - π οΈ Data Transformation (Transform): Executes joins, cleaning, and structuring of the data.
- πΎ Data Loading (Load): Writes the final, unified dataset to the S3 data warehouse zone, typically in Parquet format.
π» Technologies Leveraged within the Script:
- Python (π): Primary language for ETL logic.
- PySpark (β¨): Framework for distributed data processing.
- AWS Glue Libraries: For seamless integration with the Glue environment (
GlueContext
,DynamicFrame
).
π‘ Execution Note: This script is intended for deployment as an AWS Glue ETL job, requiring proper IAM permissions and S3 path configurations.
This project uses a modified version of the "Spotify Dataset 2023," structured into three separate CSV files for processing. The original dataset is dedicated to the public domain.
- Platform: Kaggle
- Dataset Name: Spotify Dataset 2023
- Uploader: Tony Gordon Jr.
- Kaggle Link: https://www.kaggle.com/datasets/tonygordonjr/spotify-dataset-2023
- License: CC0: Public Domain (The creator has waived all copyright and related rights to this work.)
The dataset for this project is organized into the following three CSV files, ingested into the raw/
S3 zone:
- πΏ
albums.csv
: Information on Spotify albums.- Source File:
data/albums.csv
- (Key columns:
album_id
,name
,release_date
,total_tracks
,artists_id_array
)
- Source File:
- π€
artists.csv
: Details about Spotify artists.- Source File:
data/artists.csv
- (Key columns:
artist_id
,name
,genres
,popularity
)
- Source File:
- πΆ
tracks.csv
: Attributes for individual Spotify tracks.- Source File:
data/tracks.csv
- (Key columns:
track_id
,name
,album_id
,artists_id_array
,duration_ms
,explicit
,popularity
)
- Source File:
Note: Inspect files via GitHub links for full column details. These are joined in AWS Glue.
To provide a clearer understanding of the project's components and workflow in action, short video demonstrations have been recorded:
- AWS Glue - ETL Process:
- π¬ Watch Demo: [https://github.com/Subhajit-Chowdhury/Spotify_Data-Engineering-using-AWS/blob/main/AWS%20GLUE%20Overview.mp4]
- Description: Showcases AWS Glue job execution and data transformation.
- Amazon Athena - Querying Data:
- π Watch Demo: [https://github.com/Subhajit-Chowdhury/Spotify_Data-Engineering-using-AWS/blob/main/Athena%20Overview.mp4]
- Description: Demonstrates querying processed data in S3 using Athena.
- Amazon QuickSight - Data Visualization:
- π Watch Demo: [https://github.com/Subhajit-Chowdhury/Spotify_Data-Engineering-using-AWS/blob/main/QuickSight_Dashboard.mp4]
- Description: Walkthrough of the interactive QuickSight dashboard.
The culmination of this data pipeline is an interactive dashboard built in Amazon QuickSight, providing insights into the Spotify dataset.
Key Visualizations Uncover:
- How album popularity is distributed across the music catalog!
- Which artists command the highest track counts based on their popularity!
- The ebb and flow of track releases across different eras!
Dashboard Snapshot:
(Caption: Overview of the main Spotify analytics dashboard, revealing interesting patterns in your music world!)
To replicate this project, you would typically follow these steps:
- Prerequisites:
- An AWS Account with appropriate permissions.
- Download the Spotify dataset from Kaggle and split it into
albums.csv
,artists.csv
, andtracks.csv
.
- S3 Setup:
- Create S3 buckets for different zones:
your-raw-bucket
,your-staging-bucket
,your-datawarehouse-bucket
,your-athena-query-results-bucket
. - Upload the raw CSV files to the
your-raw-bucket
.
- Create S3 buckets for different zones:
- AWS Glue Setup:
- Create an AWS Glue ETL job, providing the
Spotify Project.py
script. - Configure the job with appropriate IAM roles, data source paths (pointing to raw S3 data), and data target paths (pointing to the S3 data warehouse zone).
- Create an AWS Glue Crawler to scan the S3 data warehouse zone and populate the Glue Data Catalog.
- Create an AWS Glue ETL job, providing the
- Amazon Athena Setup:
- Ensure a database is created in Athena (the Crawler usually handles this).
- Verify that tables corresponding to your processed data are available and queryable.
- Amazon QuickSight Setup:
- Connect QuickSight to Athena as a data source.
- Create a new dataset in QuickSight based on your Athena table(s).
- Build your dashboard using the created dataset.
- IAM Roles & Permissions: Ensure all services (Glue, Athena, QuickSight) have the necessary IAM permissions to access S3 and other required resources.
(This is a high-level guide. Specific configurations might vary.)
The development of this project was significantly informed and inspired by the invaluable educational content from the DateWithData YouTube channel. Their comprehensive playlists and practical demonstrations provided a strong foundation and clear learning path for building modern data pipelines.
Explore their excellent resources for further learning:
- πΊ YouTube Channel: DateWithData
- π Relevant Playlist: Real-Time DE Project: AWS to Snowflake
Sincere gratitude to the DateWithData team for generously sharing their expertise and fostering learning within the data engineering community.
This project successfully demonstrates the construction of an end-to-end, serverless data engineering pipeline on AWS for Spotify data analytics. By ingesting raw data, performing robust ETL transformations with AWS Glue and PySpark, enabling ad-hoc querying with Amazon Athena, and delivering insights through Amazon QuickSight, we've showcased a practical approach to turning data into valuable knowledge. This pipeline not only provides a framework for analyzing music trends but also serves as a testament to the power and flexibility of cloud-based data solutions.
This project is licensed under the MIT License. Please see the LICENSE
file for more details.
The dataset used is under the CC0: Public Domain license, as specified by its Kaggle source.
Contributions, issues, and feature requests are welcome!
Feel free to check the issues page or submit a PR.
- Fork/Clone the Repo
- Replace with your own Spotify JSON export or streaming history
- Upload it to your own AWS S3 bucket
- Deploy the Glue ETL script (from
/scripts/
) - Run Athena queries from provided notebook
- Connect QuickSight to Athena and build your dashboard!
- Build and deploy your own modern data lake
- Understand AWS Glue, S3, Athena, and QuickSight
- Learn how to design ETL pipelines for analytics
- Practice serverless architecture and data modeling
Created with β€οΈ by Subhajit Chowdhury
π§ Email: [email protected]
π LinkedIn: @subhajit-chowdhury
βοΈ Give this repo a star if it helped you learn something new!