Please follow the fork and pull request workflow:
- Fork the repository.
- Create a new branch for your solution.
- Create your pipeline code, documentation.
- Use
uvto manage the project and your submission is expected to includepyproject.tomlanduv.lock. - Use python
3.13+. - Follow requirements stated below.
- Use
- Send a pull request.
- In your PR description, include any notes about your approach, assumptions, or design decisions
You need to build a data pipeline that:
- Fetches data from Wikipedia - Toronto and follows links to a limited depth (2 levels max, i.e. starting URL + linked URLs + their linked URLs)
- Transforms and validates the data
- Loads it into a staging area
- Moves clean data to a final destination
- Includes basic error handling
- Create a script that fetches data from Wikipedia - Toronto and follows links to a limited depth (2 levels)
- Handle rate limiting and retries
- Implement basic error handling
- Consider circular reference handling for link traversal
- Clean and transform the data (handle nulls, format dates, validate schemas)
- Load data to a staging location (can be CSV, JSON, or a local database)
- Create a final "production" table/view
- Add basic logging throughout the pipeline
- Create a README with setup instructions
- Document data schema and transformations
- Python (required)
- SQL (if using a database)
- Any libraries you prefer (pandas, sqlalchemy, requests, etc.)
Your pull request should include:
- Working, tested code with all code files
- Project configuration files:
pyproject.toml(managed withuv)uv.lock
- README.md with:
- Setup instructions
- How to run the pipeline
- Data schema documentation
- Assumptions and design decisions
- PR description with a brief explanation of your approach
In our production environment, we process data for thousands of creators daily, requiring efficient data extraction pipelines. While you don't need to process at that scale for this assignment, we'd like to understand your pipeline's performance characteristics.
Please include in your PR description:
- Links processed per minute: What throughput (links/minute) can your solution achieve?
- Brief performance notes: Any observations about bottlenecks or optimization opportunities you identified