This project demonstrates a multi-stage Docker setup where:
- A Node.js script uses Puppeteer and Chromium to scrape the title and first
<h1>of any website. - A Python Flask server then serves the scraped data as a JSON API.
-
Scraper Stage (Node.js + Puppeteer):
- Accepts a URL from an environment variable
SCRAPE_URL - Launches headless Chromium
- Extracts the page title and first heading
- Saves the result as
scraped_data.json
- Accepts a URL from an environment variable
-
Web Server Stage (Python Flask):
- Reads the
scraped_data.json - Serves it on port
5000as a JSON response
- Reads the
In the root project directory (where your Dockerfile is), run:
docker build -t web-scraper .
To run the container and scrape a URL, use:
docker run -p 5000:5000 -e SCRAPE_URL="https://www.wikipedia.org" web-scraperYou can replace the URL with any valid webpage.
This will start a Flask server that hosts the scraped output.
π Access the Scraped Data
Once the container is running, open your browser and visit:
http://localhost:5000You'll see output like:
{
"title": "Wikipedia",
"heading": "Wikipedia
The Free Encyclopedia"
}π§Ό Stopping the Container
If running interactively: press Ctrl + C
Or list and stop it manually:
docker ps
docker stop <container_id>π Project Structure
.
βββ Dockerfile
βββ scrape.js
βββ server.py
βββ scraped_data.json (auto-generated)
βββ package.json
βββ requirements.txt
βββ README.md
β Requirements Summary
- Node.js 18-slim + Puppeteer + Chromium β
- Python 3.10-slim + Flask β
- Multi-stage Docker build β
- Accepts dynamic input via env variable β
- Serves data over HTTP as JSON β
β¨ Done by:
Jahnavi Veliganti
DevOps Assignment | ExactSpace Technologies