Skip to content

Maheswara660/Web-Scraper

Repository files navigation

Internship Task 6: Interactive Web Scraping Program

Project Name: Interactive Web Scraper Console App
Internship Provider: Cognifyz Technologies
Task Identifier:

  • Level 3 Task 6: Create a program for interactive web scraping. Compliance Status: 100% COMPLIANT & VERIFIED (EXCEEDS REQUIREMENTS)

📋 Assignment Requirements Mapping & Compliance Matrix

Below is a detailed verification mapping of our implementation against the core steps defined by the Cognifyz Technologies assignment sheets:

Level 3 Task 6: Interactive Web Scraping Program (Task6_WebScraper.py)

  • Step 1: Select a Website and Identify the Data to be Scraped

    • Implementation: Preconfigured presets are integrated directly into the scraper CLI menu:
      1. Quotes to Scrape (https://quotes.toscrape.com/): Extracts Quote Text, Author Name, and Tag Lists.
      2. Hacker News (https://news.ycombinator.com/): Extracts Story Title and Destination Link URL.
      3. Books to Scrape (http://books.toscrape.com/): Extracts Book Titles, Prices, and Stock Status.
      • In addition, Option 2 allows you to configure any custom target website URL and define selectors interactively.
  • Step 2: Utilize a Web Scraping Library to Fetch the Data

    • Implementation: Utilizes Python's requests library to manage network fetching (incorporating user-agents, polite delay timers, connection timeouts, and DNS error handling) and BeautifulSoup (bs4) using the fast lxml parser to extract DOM nodes.
  • Step 3: Design a User-Friendly Presentation Format

    • Implementation:
      • Tabular Preview: Renders extracted data cleanly in column-aligned Markdown tables in the console using pandas and tabulate.
      • Export Handlers: Allows exporting scraped datasets directly into formatted CSV or JSON files in the workspace directory.
      • Clean Screen Mechanics: Automatically clears the console between navigation cycles, incorporating a "Press Enter" buffer to prevent output screens from vanishing before you can read them.
  • Step 4: Test the Program with Different Websites

    • Implementation: Verified across multiple test beds:
      • Automated unit tests (test_scraper.py) check Simple and Structured selectors, attribute matching, and list merges.
      • Interactive CLI successfully tested and verified against the Quotes, Hacker News, and Books sandboxes.

📁 File Structure

The project directory consists of the following components:

├── Task6_WebScraper.py      # Core Scraper and Interactive CLI application
├── test_scraper.py          # Unit tests verifying bs4 selection algorithms
├── requirements.txt         # Package dependencies (requests, bs4, pandas, lxml, tabulate)
├── scraped_quotes.csv       # Sample dataset exported in CSV format
├── scraped_hackernews.json  # Sample dataset exported in JSON format
└── README.md                # Task documentation and compliance matrix (This file)

🚀 Installation & Usage Guide

Prerequisites

  • Python 3.8 or higher.
  • A terminal/command prompt.

1. Setup Virtual Environment

Run the following commands in the project folder to set up a virtual environment and install dependencies:

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Install requirements
pip install -r requirements.txt

2. Run the Scraper Application

Execute the interactive CLI using Python:

python Task6_WebScraper.py

🧪 Verification Log

Unit Test Execution

The core scraping engines were tested using unittest. All tests passed successfully:

python test_scraper.py
.
----------------------------------------------------------------------
Ran 3 tests in 0.003s

OK

Sample CLI Run Output (Hacker News Preset)

=== Current Scraping Configuration ===
  Target Website: https://news.ycombinator.com/
  Scrape Mode:    Structured
  Container:      tr.athing
  Fields:
    - Title -> Selector: 'span.titleline > a' [Text]
    - Link -> Selector: 'span.titleline > a' [Attr: href]
  Pagination:     Selector-based (Next Button: 'a.morelink') (Max Pages: 1)
  Request Delay:  1.0s
  Status:         No Scraped Data Yet

=== Main Menu Options ===
1. Select a Preset Website (Quotes / HackerNews / Books)
2. Configure Custom Target Website & Selectors
3. Execute Scraper (Run)
4. Preview Scraped Data Table
5. Export Scraped Data to File (CSV / JSON)
6. Exit Scraper CLI

Select an action (1-6): 3

=== Scraper In Progress ===
ℹ Page 1/1: Fetching https://news.ycombinator.com/...
✔ Successfully scraped 30 items from page 1.
✔ Scrape execution completed! Extracted 30 total items.

Press Enter to return to Main Menu...

About

A Python-based interactive console web scraping application featuring multi-page traversal, structured container parsing, tabular previews, and CSV/JSON export handlers.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages