Project Name: Interactive Web Scraper Console App
Internship Provider: Cognifyz Technologies
Task Identifier:
- Level 3 Task 6: Create a program for interactive web scraping. Compliance Status: 100% COMPLIANT & VERIFIED (EXCEEDS REQUIREMENTS)
Below is a detailed verification mapping of our implementation against the core steps defined by the Cognifyz Technologies assignment sheets:
-
Step 1: Select a Website and Identify the Data to be Scraped
- Implementation: Preconfigured presets are integrated directly into the scraper CLI menu:
- Quotes to Scrape (
https://quotes.toscrape.com/): Extracts Quote Text, Author Name, and Tag Lists. - Hacker News (
https://news.ycombinator.com/): Extracts Story Title and Destination Link URL. - Books to Scrape (
http://books.toscrape.com/): Extracts Book Titles, Prices, and Stock Status.
- In addition, Option 2 allows you to configure any custom target website URL and define selectors interactively.
- Quotes to Scrape (
- Implementation: Preconfigured presets are integrated directly into the scraper CLI menu:
-
Step 2: Utilize a Web Scraping Library to Fetch the Data
- Implementation: Utilizes Python's
requestslibrary to manage network fetching (incorporating user-agents, polite delay timers, connection timeouts, and DNS error handling) andBeautifulSoup(bs4) using the fastlxmlparser to extract DOM nodes.
- Implementation: Utilizes Python's
-
Step 3: Design a User-Friendly Presentation Format
- Implementation:
- Tabular Preview: Renders extracted data cleanly in column-aligned Markdown tables in the console using
pandasandtabulate. - Export Handlers: Allows exporting scraped datasets directly into formatted CSV or JSON files in the workspace directory.
- Clean Screen Mechanics: Automatically clears the console between navigation cycles, incorporating a "Press Enter" buffer to prevent output screens from vanishing before you can read them.
- Tabular Preview: Renders extracted data cleanly in column-aligned Markdown tables in the console using
- Implementation:
-
Step 4: Test the Program with Different Websites
- Implementation: Verified across multiple test beds:
- Automated unit tests (
test_scraper.py) check Simple and Structured selectors, attribute matching, and list merges. - Interactive CLI successfully tested and verified against the Quotes, Hacker News, and Books sandboxes.
- Automated unit tests (
- Implementation: Verified across multiple test beds:
The project directory consists of the following components:
├── Task6_WebScraper.py # Core Scraper and Interactive CLI application
├── test_scraper.py # Unit tests verifying bs4 selection algorithms
├── requirements.txt # Package dependencies (requests, bs4, pandas, lxml, tabulate)
├── scraped_quotes.csv # Sample dataset exported in CSV format
├── scraped_hackernews.json # Sample dataset exported in JSON format
└── README.md # Task documentation and compliance matrix (This file)
- Python 3.8 or higher.
- A terminal/command prompt.
Run the following commands in the project folder to set up a virtual environment and install dependencies:
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Install requirements
pip install -r requirements.txtExecute the interactive CLI using Python:
python Task6_WebScraper.pyThe core scraping engines were tested using unittest. All tests passed successfully:
python test_scraper.py
.
----------------------------------------------------------------------
Ran 3 tests in 0.003s
OK=== Current Scraping Configuration ===
Target Website: https://news.ycombinator.com/
Scrape Mode: Structured
Container: tr.athing
Fields:
- Title -> Selector: 'span.titleline > a' [Text]
- Link -> Selector: 'span.titleline > a' [Attr: href]
Pagination: Selector-based (Next Button: 'a.morelink') (Max Pages: 1)
Request Delay: 1.0s
Status: No Scraped Data Yet
=== Main Menu Options ===
1. Select a Preset Website (Quotes / HackerNews / Books)
2. Configure Custom Target Website & Selectors
3. Execute Scraper (Run)
4. Preview Scraped Data Table
5. Export Scraped Data to File (CSV / JSON)
6. Exit Scraper CLI
Select an action (1-6): 3
=== Scraper In Progress ===
ℹ Page 1/1: Fetching https://news.ycombinator.com/...
✔ Successfully scraped 30 items from page 1.
✔ Scrape execution completed! Extracted 30 total items.
Press Enter to return to Main Menu...