Skip to content

Subhajit-Chowdhury/RAW-Spotify-Data-into-Insights-with-AWS

Repository files navigation

🎧 Turning Spotify Data into Insights: Building a Modern Data Pipeline on AWSΒ πŸš€

πŸ“Œ Project Overview

This project demonstrates a modern, end-to-end data pipeline using AWS services with Spotify data. It simulates how a sample dataset from Spotify can be ingested, processed, queried, and visualized, leveraging:

  • AWS S3 for storing raw and transformed data
  • AWS Glue for building scalable ETL jobs (using Python)
  • AWS Athena for serverless querying of large datasets
  • AWS QuickSight for intuitive and powerful dashboarding

The focus is on utilizing serverless and pay-as-you-go AWS services to minimize cost and maximize scalability β€” an ideal approach for real-world Data Engineering pipelines.

βš™οΈ Tech Stack & Tools

Tool/Service Purpose
AWS S3 Raw & curated zone for data lake
AWS Glue ETL jobs to clean/transform data
AWS Athena Querying data with SQL
AWS QuickSight Interactive dashboards & charts
Python Automation & scripting

πŸ“Š Key Features

  • πŸ”„ Automated ETL Pipeline with AWS Glue (PySpark script)
  • 🧹 Data Cleaning & Transformation from raw Spotify JSON
  • πŸ“ Raw + Curated Zone Architecture on S3
  • πŸ•΅οΈβ€β™‚οΈ Schema-on-Read via Athena
  • πŸ“ˆ Comprehensive Dashboard using QuickSight
  • πŸ”’ Fully serverless for scalability and cost-efficiency

πŸ“Έ QuickSight Dashboard

Dashboard

πŸ“ Introduction

Ever wondered what stories your Spotify listening habits could tell? This project transforms raw Spotify data into powerful insights. We'll construct a modern data pipeline on AWS, using S3 for robust storage, Glue for smart data processing, Athena for querying our music universe, and finally, QuickSight to bring those musical stories to life through vibrant visualizations. Get ready to unlock the power of your playlists!

πŸ—οΈ Project Architecture

The overall architecture illustrates the flow of data from ingestion to visualization using integrated AWS services:

Project Architecture

🌊 Pipeline Flow & Data Journey

The data engineering pipeline processes Spotify data through the following key stages:

  1. πŸ“₯ Ingestion & Raw Storage (Amazon S3):

    • Raw Spotify data (e.g., CSV files for Albums, Artists, Tracks) is ingested and stored in a designated Amazon S3 bucket, serving as the "raw zone" of the data lake.
  2. βš™οΈ ETL Processing (AWS Glue):

    • An AWS Glue ETL job (developed using the AWS Glue Visual ETL interface and detailed in the Python script) extracts data from the raw S3 locations.
    • It performs essential transformations such as:
      • Joining datasets (e.g., Albums with Artists, then with Tracks).
      • Cleaning data (e.g., dropping unnecessary fields, handling nulls).
      • Structuring data for analytics.
    • The detailed visual workflow for this Glue job is shown below:

    AWS Glue ETL Job Diagram Figure: The AWS Glue job graph showing S3 data sources (Albums, Artists, Tracks), join operations, the 'DropFields' transformation, and the final S3 data target.

    • The transformed and curated data is then loaded into a different Amazon S3 bucket/prefix, designated as the "data warehouse zone."
  3. πŸ—ΊοΈ Schema Discovery & Cataloging (AWS Glue Crawler):

    • An AWS Glue Crawler runs against the S3 data warehouse zone.
    • It infers the schema of the processed data and creates/updates table definitions in the AWS Glue Data Catalog.
  4. πŸ” Querying & Analysis (Amazon Athena):

    • Amazon Athena utilizes metadata from the AWS Glue Data Catalog to execute standard SQL queries directly on the data stored in the S3 data warehouse zone.
    • This enables ad-hoc analysis and data exploration without the need to load data into a traditional database. Query results can also be saved back to S3.
  5. πŸ“Š Visualization & Reporting (Amazon QuickSight):

    • Amazon QuickSight connects to Amazon Athena as its data source.
    • It fetches query results to build interactive dashboards, reports, and visualizations, facilitating valuable insights πŸ’‘ into the Spotify data.

🎨 AWS Glue Studio Interface

The ETL jobs are developed and managed within the AWS Glue Studio environment, which provides a comprehensive visual interface for designing and monitoring data integration workflows.

AWS Glue Visual ETL interface for Spotify Project

Figure: The AWS Glue console displaying the "Spotify Project." The "Visual" tab shows the ETL job graph, and the left-hand navigation provides access to various Glue features.

🐍✨ Core ETL Script (Spotify Project.py)

The engine driving the data transformation in this project is the Python script Spotify Project.py. This script is designed for execution as an AWS Glue ETL job, leveraging the capabilities of PySpark for efficient, distributed data processing.

πŸ”— Script Location:

πŸš€ Core Responsibilities & Functionality: This script translates the visual ETL design into executable code, performing the following critical operations:

  1. 🎬 Initialization: Establishes the Spark and AWS Glue execution environment.
  2. πŸ“₯ Data Extraction (Extract): Reads albums.csv, artists.csv, and tracks.csv from S3.
  3. πŸ› οΈ Data Transformation (Transform): Executes joins, cleaning, and structuring of the data.
  4. πŸ’Ύ Data Loading (Load): Writes the final, unified dataset to the S3 data warehouse zone, typically in Parquet format.

πŸ’» Technologies Leveraged within the Script:

  • Python (🐍): Primary language for ETL logic.
  • PySpark (✨): Framework for distributed data processing.
  • AWS Glue Libraries: For seamless integration with the Glue environment (GlueContext, DynamicFrame).

πŸ’‘ Execution Note: This script is intended for deployment as an AWS Glue ETL job, requiring proper IAM permissions and S3 path configurations.

🧾 Dataset Utilized

This project uses a modified version of the "Spotify Dataset 2023," structured into three separate CSV files for processing. The original dataset is dedicated to the public domain.

πŸ”— Original Source & Inspiration

πŸ“‚ Project Dataset Structure

The dataset for this project is organized into the following three CSV files, ingested into the raw/ S3 zone:

  1. πŸ’Ώ albums.csv: Information on Spotify albums.
    • Source File: data/albums.csv
    • (Key columns: album_id, name, release_date, total_tracks, artists_id_array)
  2. 🎀 artists.csv: Details about Spotify artists.
    • Source File: data/artists.csv
    • (Key columns: artist_id, name, genres, popularity)
  3. 🎢 tracks.csv: Attributes for individual Spotify tracks.
    • Source File: data/tracks.csv
    • (Key columns: track_id, name, album_id, artists_id_array, duration_ms, explicit, popularity)

Note: Inspect files via GitHub links for full column details. These are joined in AWS Glue.

πŸš€ Project Demonstrations

To provide a clearer understanding of the project's components and workflow in action, short video demonstrations have been recorded:

πŸ“ˆ QuickSight Dashboard - Spotify Insights

The culmination of this data pipeline is an interactive dashboard built in Amazon QuickSight, providing insights into the Spotify dataset.

Key Visualizations Uncover:

  • How album popularity is distributed across the music catalog!
  • Which artists command the highest track counts based on their popularity!
  • The ebb and flow of track releases across different eras!

Dashboard Snapshot:

QuickSight Dashboard - Overview (Caption: Overview of the main Spotify analytics dashboard, revealing interesting patterns in your music world!)

βš™οΈ How to Run / Setup (High-Level)

To replicate this project, you would typically follow these steps:

  1. Prerequisites:
    • An AWS Account with appropriate permissions.
    • Download the Spotify dataset from Kaggle and split it into albums.csv, artists.csv, and tracks.csv.
  2. S3 Setup:
    • Create S3 buckets for different zones: your-raw-bucket, your-staging-bucket, your-datawarehouse-bucket, your-athena-query-results-bucket.
    • Upload the raw CSV files to the your-raw-bucket.
  3. AWS Glue Setup:
    • Create an AWS Glue ETL job, providing the Spotify Project.py script.
    • Configure the job with appropriate IAM roles, data source paths (pointing to raw S3 data), and data target paths (pointing to the S3 data warehouse zone).
    • Create an AWS Glue Crawler to scan the S3 data warehouse zone and populate the Glue Data Catalog.
  4. Amazon Athena Setup:
    • Ensure a database is created in Athena (the Crawler usually handles this).
    • Verify that tables corresponding to your processed data are available and queryable.
  5. Amazon QuickSight Setup:
    • Connect QuickSight to Athena as a data source.
    • Create a new dataset in QuickSight based on your Athena table(s).
    • Build your dashboard using the created dataset.
  6. IAM Roles & Permissions: Ensure all services (Glue, Athena, QuickSight) have the necessary IAM permissions to access S3 and other required resources.

(This is a high-level guide. Specific configurations might vary.)

πŸ™ Acknowledgements & Educational Foundation

The development of this project was significantly informed and inspired by the invaluable educational content from the DateWithData YouTube channel. Their comprehensive playlists and practical demonstrations provided a strong foundation and clear learning path for building modern data pipelines.

Explore their excellent resources for further learning:

Sincere gratitude to the DateWithData team for generously sharing their expertise and fostering learning within the data engineering community.

🏁 Conclusion

This project successfully demonstrates the construction of an end-to-end, serverless data engineering pipeline on AWS for Spotify data analytics. By ingesting raw data, performing robust ETL transformations with AWS Glue and PySpark, enabling ad-hoc querying with Amazon Athena, and delivering insights through Amazon QuickSight, we've showcased a practical approach to turning data into valuable knowledge. This pipeline not only provides a framework for analyzing music trends but also serves as a testament to the power and flexibility of cloud-based data solutions.

πŸ“œ License

This project is licensed under the MIT License. Please see the LICENSE file for more details. The dataset used is under the CC0: Public Domain license, as specified by its Kaggle source.

🀝 Contributing

Contributions, issues, and feature requests are welcome!
Feel free to check the issues page or submit a PR.

πŸ›  How to Use This Project

  1. Fork/Clone the Repo
  2. Replace with your own Spotify JSON export or streaming history
  3. Upload it to your own AWS S3 bucket
  4. Deploy the Glue ETL script (from /scripts/)
  5. Run Athena queries from provided notebook
  6. Connect QuickSight to Athena and build your dashboard!

πŸ’‘ Learning Outcomes

  • Build and deploy your own modern data lake
  • Understand AWS Glue, S3, Athena, and QuickSight
  • Learn how to design ETL pipelines for analytics
  • Practice serverless architecture and data modeling

πŸ“¬ Contact

Created with ❀️ by Subhajit Chowdhury
πŸ“§ Email: [email protected]
πŸ”— LinkedIn: @subhajit-chowdhury

⭐️ Give this repo a star if it helped you learn something new!

Releases

No releases published

Packages

No packages published

Languages