Asynchronous Web Scraping

Author: Emily Quinn Finney Version: 2.0

In this package I implemented an asynchronous breadth first search based web crawler/scraper as part of my time at the Recurse Center (https://www.recurse.com/). The package relies primarily on Python's asyncio and Beautiful Soup libraries.

Features/Details

While writing this code, I tested two different methods of object-oriented software design.

web_crawler_dependency_injection.py: Creates a page loading object (URLoader) and uses this object within another object that scrapes a web page (PageScraper).
web_crawler_main_class.py: Creates a page loading object (URLoader) and a page scraping object (PageScraper) and integrates both objects into a main class (MainScraper) that controls all high-level logic.

Version 1.0 was written recursively, and was not implemented asynchronously.

Running Tests

To test the code, you will need pytest installed. Then type the following into the terminal:

pytest -v -s test_web_crawler_{$TEST_MODULE}.py

Last edited 04/05/2017

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
__init__.py		__init__.py
test_web_crawler_dependency_injection.py		test_web_crawler_dependency_injection.py
test_web_crawler_main_class.py		test_web_crawler_main_class.py
web_crawler_dependency_injection.py		web_crawler_dependency_injection.py
web_crawler_main_class.py		web_crawler_main_class.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Asynchronous Web Scraping

Features/Details

Running Tests

About

Releases

Packages

Languages

eqfinney/web-scraping

Folders and files

Latest commit

History

Repository files navigation

Asynchronous Web Scraping

Features/Details

Running Tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages