Internship Task 6: Interactive Web Scraping Program

Project Name: Interactive Web Scraper Console App
Internship Provider: Cognifyz Technologies
Task Identifier:

Level 3 Task 6: Create a program for interactive web scraping. Compliance Status: 100% COMPLIANT & VERIFIED (EXCEEDS REQUIREMENTS)

📋 Assignment Requirements Mapping & Compliance Matrix

Below is a detailed verification mapping of our implementation against the core steps defined by the Cognifyz Technologies assignment sheets:

Level 3 Task 6: Interactive Web Scraping Program (`Task6_WebScraper.py`)

Step 1: Select a Website and Identify the Data to be Scraped
- Implementation: Preconfigured presets are integrated directly into the scraper CLI menu:
  1. Quotes to Scrape (https://quotes.toscrape.com/): Extracts Quote Text, Author Name, and Tag Lists.
  2. Hacker News (https://news.ycombinator.com/): Extracts Story Title and Destination Link URL.
  3. Books to Scrape (http://books.toscrape.com/): Extracts Book Titles, Prices, and Stock Status.
  - In addition, Option 2 allows you to configure any custom target website URL and define selectors interactively.
Step 2: Utilize a Web Scraping Library to Fetch the Data
- Implementation: Utilizes Python's requests library to manage network fetching (incorporating user-agents, polite delay timers, connection timeouts, and DNS error handling) and BeautifulSoup (bs4) using the fast lxml parser to extract DOM nodes.
Step 3: Design a User-Friendly Presentation Format
- Implementation:
  - Tabular Preview: Renders extracted data cleanly in column-aligned Markdown tables in the console using pandas and tabulate.
  - Export Handlers: Allows exporting scraped datasets directly into formatted CSV or JSON files in the workspace directory.
  - Clean Screen Mechanics: Automatically clears the console between navigation cycles, incorporating a "Press Enter" buffer to prevent output screens from vanishing before you can read them.
Step 4: Test the Program with Different Websites
- Implementation: Verified across multiple test beds:
  - Automated unit tests (test_scraper.py) check Simple and Structured selectors, attribute matching, and list merges.
  - Interactive CLI successfully tested and verified against the Quotes, Hacker News, and Books sandboxes.

📁 File Structure

The project directory consists of the following components:

├── Task6_WebScraper.py      # Core Scraper and Interactive CLI application
├── test_scraper.py          # Unit tests verifying bs4 selection algorithms
├── requirements.txt         # Package dependencies (requests, bs4, pandas, lxml, tabulate)
├── scraped_quotes.csv       # Sample dataset exported in CSV format
├── scraped_hackernews.json  # Sample dataset exported in JSON format
└── README.md                # Task documentation and compliance matrix (This file)

🚀 Installation & Usage Guide

Prerequisites

Python 3.8 or higher.
A terminal/command prompt.

1. Setup Virtual Environment

Run the following commands in the project folder to set up a virtual environment and install dependencies:

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Install requirements
pip install -r requirements.txt

2. Run the Scraper Application

Execute the interactive CLI using Python:

python Task6_WebScraper.py

🧪 Verification Log

Unit Test Execution

The core scraping engines were tested using unittest. All tests passed successfully:

python test_scraper.py
.
----------------------------------------------------------------------
Ran 3 tests in 0.003s

OK

Sample CLI Run Output (Hacker News Preset)

=== Current Scraping Configuration ===
  Target Website: https://news.ycombinator.com/
  Scrape Mode:    Structured
  Container:      tr.athing
  Fields:
    - Title -> Selector: 'span.titleline > a' [Text]
    - Link -> Selector: 'span.titleline > a' [Attr: href]
  Pagination:     Selector-based (Next Button: 'a.morelink') (Max Pages: 1)
  Request Delay:  1.0s
  Status:         No Scraped Data Yet

=== Main Menu Options ===
1. Select a Preset Website (Quotes / HackerNews / Books)
2. Configure Custom Target Website & Selectors
3. Execute Scraper (Run)
4. Preview Scraped Data Table
5. Export Scraped Data to File (CSV / JSON)
6. Exit Scraper CLI

Select an action (1-6): 3

=== Scraper In Progress ===
ℹ Page 1/1: Fetching https://news.ycombinator.com/...
✔ Successfully scraped 30 items from page 1.
✔ Scrape execution completed! Extracted 30 total items.

Press Enter to return to Main Menu...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Internship Task 6: Interactive Web Scraping Program

📋 Assignment Requirements Mapping & Compliance Matrix

Level 3 Task 6: Interactive Web Scraping Program (`Task6_WebScraper.py`)

📁 File Structure

🚀 Installation & Usage Guide

Prerequisites

1. Setup Virtual Environment

2. Run the Scraper Application

🧪 Verification Log

Unit Test Execution

Sample CLI Run Output (Hacker News Preset)

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
Task6_WebScraper.py		Task6_WebScraper.py
requirements.txt		requirements.txt
scraped_hackernews.json		scraped_hackernews.json
scraped_quotes.csv		scraped_quotes.csv
test_scraper.py		test_scraper.py

Folders and files

Latest commit

History

Repository files navigation

Internship Task 6: Interactive Web Scraping Program

📋 Assignment Requirements Mapping & Compliance Matrix

Level 3 Task 6: Interactive Web Scraping Program (Task6_WebScraper.py)

📁 File Structure

🚀 Installation & Usage Guide

Prerequisites

1. Setup Virtual Environment

2. Run the Scraper Application

🧪 Verification Log

Unit Test Execution

Sample CLI Run Output (Hacker News Preset)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Level 3 Task 6: Interactive Web Scraping Program (`Task6_WebScraper.py`)