Scraplite

Advanced Scrapy Scraper Library, made for production scale. Automated using AWS and Cerery under the hood.

Features

Scalable data extraction using Scrapy framework
Automated task scheduling with Celery
Distributed computing using AWS infrastructure
Support for handling JavaScript-rendered websites
Customizable scraping pipelines for data processing and storage

Installation

Clone the repository:

git clone https://github.com/chibuezedev/Scraprite.git

Install the dependencies using pip:

cd advanced-web-scraper
pip install -r requirements.txt

Configure the AWS credentials in settings.py:

AWS_ACCESS_KEY_ID = '<your-access-key-id>'
AWS_SECRET_ACCESS_KEY = '<your-secret-access-key>'

Set up the Celery task broker and result backend in settings.py:

CELERY_BROKER_URL = 'redis://localhost:6379/0'
CELERY_RESULT_BACKEND = 'redis://localhost:6379/0'

Start the Celery worker:

celery -A scraper worker --loglevel=info

Usage

Create a new spider by defining the scraping rules in spiders/my_spider.py. You can refer to the Scrapy documentation for more information on defining spiders.
Customize the data processing and storage pipelines in pipelines.py according to your requirements.
Run the scraper using the following command:
```
scrapy crawl my_spider
```
Replace my_spider with the name of your spider.
To schedule tasks automatically, use Celery's task scheduling mechanism. Refer to the Celery documentation for more information on scheduling tasks.

Contributing

Contributions are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request. Please make sure to follow the code of conduct.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!