CSV Data Pipeline using AWS (S3, Lambda, Glue, QuickSight)

This project demonstrates a data pipeline built using various AWS services to process a movie dataset stored in CSV format. The pipeline consists of uploading the raw data to Amazon S3, preprocessing it using AWS Lambda, transforming it with AWS Glue, and finally visualizing the processed data in Amazon QuickSight. The dataset primarily contains information about movies, including various attributes such as movie genre, title, and year, allowing for meaningful analysis of movie trends, genres, and other related insights.

Architecture Diagram

Technology Used

Programming Language - Python
Scripting Language - SQL
AWS
- lambda
- S3
- AWS Glue
- IAM
- QuickSight

Project Workflow

Source CSV File

The process begins when I manually uploads a CSV file to the source S3 bucket, named "csv-raw-data-bucket". This bucket acts as the initial storage point for raw, unprocessed data.
Lambda Trigger & Preprocessing

The S3 upload event triggers an AWS Lambda function, which:
- Reads the raw CSV file.
- Performs necessary preprocessing.
- Saves the processed output to another S3 bucket "csv-processed-data-bucket"
Glue Crawler & ETL Job

Glue Crawler An AWS Glue Crawler is configured to scan the "csv-processed-data-bucket". When run:

It automatically detects the schema of the processed CSV data.

It creates or updates a table in the AWS Glue Data Catalog, making the data queryable using services like Glue, Athena, or Redshift Spectrum.

Glue ETL Job After the crawler finishes, a Glue ETL job is triggered. This job uses the Visual ETL editor, where only SQL-based transformations are applied—no PySpark or Python scripts are involved.

Within this job, it can:
- Run SELECT queries with filters or joins.
- Aggregate data (e.g., GROUP BY operations).
- Create derived columns or remove unnecessary ones.
  
  The transformed dataset is then saved into a third S3 bucket named "csv-final-data-bucket", which stores the final, analysis-ready data.
Visualization in QuickSight

Finally, the clean and transformed data in "csv-final-data-bucket" is connected to Amazon QuickSight, AWS’s business intelligence tool.

Using QuickSight, it can:

Build interactive dashboards and visualizations (bar charts, tables, pie charts, maps, etc.).

Share insights with stakeholders.

Enable data-driven decision-making using real-time dashboards based on your CSV input.

This bar diagram effectively shows the number of movies in each genre, allowing for easy comparison of genre popularity within the dataset.

The horizontal layout and clear labeling make it an appropriate and insightful choice for this type of categorical data analysis..

IAM Roles and Policies Used

Service: Lambda

IAM Role Name: LambdaExecutionRole

Key Permissions
- AmazonS3FullAccess
- AWSLambdaBasicExecutionRole
Service: Glue

IAM Role Name: GlueServiceRole

Key Permissions
- AWSGlueServiceRole
- S3FullAccess
- GlueConsoleFullAccess
Service: QuickSight

IAM Role Name: QuickSightAccessRole

Key Permissions
- AmazonS3FullAccess

Troubles Faced & Solutions

Lambda Function Errors
- Time out Error : Increased timeout(run time)to 30 secs. in Lambda configuration.
- No record error : Created a proper test event that mimics an actual S3 event.
Glue Job Errors
- Visual ETL job failed : This was due to the absence of adequate IAM policies
- Data not appearing in final bucket : Configured the output path explicitly in Glue Job's "S3 Target" settings.
QuickSight Errors
- QuickSight connection issue : Created and uploaded a "manifest.json" into the final s3 bucket.

(Note: All AWS resources (S3, Lambda, Glue, QuickSight, etc.) used in this project have been deleted post-completion to prevent unnecessary billing. The project remains available for review via this GitHub repository.)

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
images		images
source_files		source_files
.gitattributes		.gitattributes
Glue-transformation-query.sql		Glue-transformation-query.sql
LICENSE		LICENSE
README.md		README.md
lambda.py		lambda.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CSV Data Pipeline using AWS (S3, Lambda, Glue, QuickSight)

Architecture Diagram

Technology Used

Project Workflow

IAM Roles and Policies Used

Troubles Faced & Solutions

(Note: All AWS resources (S3, Lambda, Glue, QuickSight, etc.) used in this project have been deleted post-completion to prevent unnecessary billing. The project remains available for review via this GitHub repository.)

About

Uh oh!

Languages

License

Hridya2001/aws-csv-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

CSV Data Pipeline using AWS (S3, Lambda, Glue, QuickSight)

Architecture Diagram

Technology Used

Project Workflow

IAM Roles and Policies Used

Troubles Faced & Solutions

(Note: All AWS resources (S3, Lambda, Glue, QuickSight, etc.) used in this project have been deleted post-completion to prevent unnecessary billing. The project remains available for review via this GitHub repository.)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages