🕸️ Web Scraper with Node.js, Puppeteer & Python Flask

This project demonstrates a multi-stage Docker setup where:

A Node.js script uses Puppeteer and Chromium to scrape the title and first <h1> of any website.
A Python Flask server then serves the scraped data as a JSON API.

🚀 How It Works

Scraper Stage (Node.js + Puppeteer):
- Accepts a URL from an environment variable SCRAPE_URL
- Launches headless Chromium
- Extracts the page title and first heading
- Saves the result as scraped_data.json
Web Server Stage (Python Flask):
- Reads the scraped_data.json
- Serves it on port 5000 as a JSON response

In the root project directory (where your Dockerfile is), run:

docker build -t web-scraper .

▶️ Run the Docker Container
To run the container and scrape a URL, use:

docker run -p 5000:5000 -e SCRAPE_URL="https://www.wikipedia.org" web-scraper

You can replace the URL with any valid webpage.

This will start a Flask server that hosts the scraped output.

🌐 Access the Scraped Data
Once the container is running, open your browser and visit:

http://localhost:5000

You'll see output like:

{
  "title": "Wikipedia",
  "heading": "Wikipedia
The Free Encyclopedia"
}

🧼 Stopping the Container
If running interactively: press Ctrl + C

Or list and stop it manually:

docker ps
docker stop <container_id>

📁 Project Structure

.
├── Dockerfile
├── scrape.js
├── server.py
├── scraped_data.json (auto-generated)
├── package.json
├── requirements.txt
└── README.md

✅ Requirements Summary

✨ Done by:
Jahnavi Veliganti
DevOps Assignment | ExactSpace Technologies