This project demonstrates a multi-stage Docker setup where:
- A Node.js script uses Puppeteer and Chromium to scrape the title and first
<h1>of any website. - A Python Flask server then serves the scraped data as a JSON API.
-
Scraper Stage (Node.js + Puppeteer):
- Accepts a URL from an environment variable
SCRAPE_URL - Launches headless Chromium
- Extracts the page title and first heading
- Saves the result as
scraped_data.json
- Accepts a URL from an environment variable
-
Web Server Stage (Python Flask):
- Reads the
scraped_data.json - Serves it on port
5000as a JSON response
- Reads the
In the root project directory (where your Dockerfile is), run:
docker build -t web-scraper .
To run the container and scrape a URL, use:
docker run -p 5000:5000 -e SCRAPE_URL="https://www.wikipedia.org" web-scraperYou can replace the URL with any valid webpage.
This will start a Flask server that hosts the scraped output.
🌐 Access the Scraped Data
Once the container is running, open your browser and visit:
http://localhost:5000You'll see output like:
{
"title": "Wikipedia",
"heading": "Wikipedia
The Free Encyclopedia"
}🧼 Stopping the Container
If running interactively: press Ctrl + C
Or list and stop it manually:
docker ps
docker stop <container_id>📁 Project Structure
.
├── Dockerfile
├── scrape.js
├── server.py
├── scraped_data.json (auto-generated)
├── package.json
├── requirements.txt
└── README.md
✅ Requirements Summary
- Node.js 18-slim + Puppeteer + Chromium ✅
- Python 3.10-slim + Flask ✅
- Multi-stage Docker build ✅
- Accepts dynamic input via env variable ✅
- Serves data over HTTP as JSON ✅
✨ Done by:
Jahnavi Veliganti
DevOps Assignment | ExactSpace Technologies