🕸️ Web Scraper with Node.js, Puppeteer & Python Flask

This project demonstrates a multi-stage Docker setup where:

A Node.js script uses Puppeteer and Chromium to scrape the title and first <h1> of any website.
A Python Flask server then serves the scraped data as a JSON API.

🚀 How It Works

Scraper Stage (Node.js + Puppeteer):
- Accepts a URL from an environment variable SCRAPE_URL
- Launches headless Chromium
- Extracts the page title and first heading
- Saves the result as scraped_data.json
Web Server Stage (Python Flask):
- Reads the scraped_data.json
- Serves it on port 5000 as a JSON response

🐳 Build the Docker Image

In the root project directory (where your Dockerfile is), run:

docker build -t web-scraper .

▶️ Run the Docker Container
To run the container and scrape a URL, use:

docker run -p 5000:5000 -e SCRAPE_URL="https://www.wikipedia.org" web-scraper

You can replace the URL with any valid webpage.

This will start a Flask server that hosts the scraped output.

🌐 Access the Scraped Data
Once the container is running, open your browser and visit:

http://localhost:5000

You'll see output like:

{
  "title": "Wikipedia",
  "heading": "Wikipedia
The Free Encyclopedia"
}

🧼 Stopping the Container
If running interactively: press Ctrl + C

Or list and stop it manually:

docker ps
docker stop <container_id>

📁 Project Structure

.
├── Dockerfile
├── scrape.js
├── server.py
├── scraped_data.json (auto-generated)
├── package.json
├── requirements.txt
└── README.md

✅ Requirements Summary

Node.js 18-slim + Puppeteer + Chromium ✅
Python 3.10-slim + Flask ✅
Multi-stage Docker build ✅
Accepts dynamic input via env variable ✅
Serves data over HTTP as JSON ✅

✨ Done by:
Jahnavi Veliganti
DevOps Assignment | ExactSpace Technologies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕸️ Web Scraper with Node.js, Puppeteer & Python Flask

🚀 How It Works

🐳 Build the Docker Image

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Dockerfile		Dockerfile
README.md		README.md
package.json		package.json
requirements.txt		requirements.txt
scrape.js		scrape.js
server.py		server.py

Folders and files

Latest commit

History

Repository files navigation

🕸️ Web Scraper with Node.js, Puppeteer & Python Flask

🚀 How It Works

🐳 Build the Docker Image

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages