Version: 0.1.0
Author: Julian Whiting (j2whitin@gmail.com)
Goviq is an application for scraping Canadian Govt documents. Created with downstream RAG pipelines in mind.
- Asynchronous Crawling: Uses
aiohttpfor efficient, non-blocking I/O. - HTML Parsing: Leverages
BeautifulSoup4for HTML content extraction. - Configurable: Define your own crawler subclasses to handle specific sources or data formats.
- Local Caching: Store the fetched or parsed data to JSON files for offline analysis.
- Clone this repository (or download the source).
- Make sure you have a modern version of
pipandsetuptools:pip install --upgrade pip setuptools
- Install Goviq. If you have a
pyproject.toml, you can do:or, if you prefer the “development/editable” mode:pip install .pip install -e .
After installing, the primary way to build the dataset is by running:
python goviq/crawler_poc.py --output_dir .goviq/crawler_poc.pyis a script that orchestrates the various crawlers to fetch, parse, and save the data.--output_dir .tells the script to store the resulting data in the current directory.- You can change the output directory path as needed.
If you want to call individual crawlers rather than the main script:
python -m goviq.scrapers.parl_caOr programmatically in Python:
from goviq.scrapers.parl_ca import BillCrawler
crawler = BillCrawler()
crawler.crawl()- Parliament sessions are hardcoded somewhere. Ought to be able to accept a date range or list of sessions to parse
- How to handle different versions of acts?
- I don't know if the local cache env var is still needed. I took a long break from developing this :)
- Update README.md with some info about runtime, dataset size, provenance, etc..
- Fork this repository.
- Create a feature branch (
git checkout -b feature/my-new-feature). - Commit your changes (
git commit -am 'Add new feature'). - Push to your branch (
git push origin feature/my-new-feature). - Create a pull request.