Hometask - ETL Pipeline

Getting Started

Please follow the fork and pull request workflow:

Fork the repository.
Create a new branch for your solution.
Create your pipeline code, documentation.
- Use uv to manage the project and your submission is expected to include pyproject.toml and uv.lock.
- Use python 3.13+.
- Follow requirements stated below.
Send a pull request.
- In your PR description, include any notes about your approach, assumptions, or design decisions

Question: Build a Data Integration Pipeline

Context

You need to build a data pipeline that:

Fetches data from Wikipedia - Toronto and follows links to a limited depth (2 levels max, i.e. starting URL + linked URLs + their linked URLs)
Transforms and validates the data
Loads it into a staging area
Moves clean data to a final destination
Includes basic error handling

Requirements

Part 1: Data Extraction

Create a script that fetches data from Wikipedia - Toronto and follows links to a limited depth (2 levels)
Handle rate limiting and retries
Implement basic error handling
Consider circular reference handling for link traversal

Part 2: Data Transformation

Clean and transform the data (handle nulls, format dates, validate schemas)

Part 3: Data Loading

Load data to a staging location (can be CSV, JSON, or a local database)
Create a final "production" table/view

Part 4: Documentation

Add basic logging throughout the pipeline
Create a README with setup instructions
Document data schema and transformations

Tech Stack (use what you're comfortable with)

Python (required)
SQL (if using a database)
Any libraries you prefer (pandas, sqlalchemy, requests, etc.)

Deliverables

Your pull request should include:

Working, tested code with all code files
Project configuration files:
- pyproject.toml (managed with uv)
- uv.lock
README.md with:
- Setup instructions
- How to run the pipeline
- Data schema documentation
- Assumptions and design decisions
PR description with a brief explanation of your approach

Performance Metrics

In our production environment, we process data for thousands of creators daily, requiring efficient data extraction pipelines. While you don't need to process at that scale for this assignment, we'd like to understand your pipeline's performance characteristics.

Please include in your PR description:

Links processed per minute: What throughput (links/minute) can your solution achieve?
Brief performance notes: Any observations about bottlenecks or optimization opportunities you identified

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hometask - ETL Pipeline

Getting Started

Question: Build a Data Integration Pipeline

Context

Requirements

Part 1: Data Extraction

Part 2: Data Transformation

Part 3: Data Loading

Part 4: Documentation

Tech Stack (use what you're comfortable with)

Deliverables

Performance Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Hometask - ETL Pipeline

Getting Started

Question: Build a Data Integration Pipeline

Context

Requirements

Part 1: Data Extraction

Part 2: Data Transformation

Part 3: Data Loading

Part 4: Documentation

Tech Stack (use what you're comfortable with)

Deliverables

Performance Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages