Advanced Scrapy Scraper Library, made for production scale. Automated using AWS and Cerery under the hood.
- Scalable data extraction using Scrapy framework
- Automated task scheduling with Celery
- Distributed computing using AWS infrastructure
- Support for handling JavaScript-rendered websites
- Customizable scraping pipelines for data processing and storage
-
Clone the repository:
git clone https://github.com/chibuezedev/Scraprite.git
-
Install the dependencies using pip:
cd advanced-web-scraper pip install -r requirements.txt
-
Configure the AWS credentials in
settings.py
:AWS_ACCESS_KEY_ID = '<your-access-key-id>' AWS_SECRET_ACCESS_KEY = '<your-secret-access-key>'
-
Set up the Celery task broker and result backend in
settings.py
:CELERY_BROKER_URL = 'redis://localhost:6379/0' CELERY_RESULT_BACKEND = 'redis://localhost:6379/0'
-
Start the Celery worker:
celery -A scraper worker --loglevel=info
-
Create a new spider by defining the scraping rules in
spiders/my_spider.py
. You can refer to the Scrapy documentation for more information on defining spiders. -
Customize the data processing and storage pipelines in
pipelines.py
according to your requirements. -
Run the scraper using the following command:
scrapy crawl my_spider
Replace
my_spider
with the name of your spider. -
To schedule tasks automatically, use Celery's task scheduling mechanism. Refer to the Celery documentation for more information on scheduling tasks.
Contributions are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request. Please make sure to follow the code of conduct.
This project is licensed under the MIT License. See the LICENSE file for more details.