Skip to content

Latest commit

 

History

History
308 lines (212 loc) · 11.4 KB

File metadata and controls

308 lines (212 loc) · 11.4 KB

Puppeteer Amazon Scrapers (Node.js)

Production-ready Amazon scraping scripts in Node.js using Puppeteer for extracting structured ecommerce data.
Includes scrapers for product pages (ASIN /dp/), search results (SERP /s?k=...), and category/browse pages (node=...).

All scrapers in this directory are built for reliability at scale with built-in proxy rotation, retries, and anti-bot handling via ScrapeOps. Generated automatically from real Amazon URLs using AI, designed to survive common Amazon anti-bot defenses, and intended as reference-quality scrapers, not demos or proof of concepts.

See also: ../README.md (Node.js overview) and ../../../README.md (Amazon scrapers overview).

✅ Choose the right scraper

Scraper Best for Start here
Product pages (/dp/{ASIN}) Full product details (price, rating, images, seller, etc.) product/product_data/README.md
Search results (/s?k=...) SERP analysis + many products per page product/product_search/README.md
Category pages (node= / browse) Category/browse listing extraction product/product_category/README.md
Reviews Review extraction Coming soon
Sellers Seller profile extraction Coming soon

📑 Table of Contents


📦 Available Scrapers

This directory contains the following Amazon scrapers:

✅ Product Scrapers

  • Product Data Scraper — Extract product data from Amazon product pages

    • Individual product pages (ASIN-based) with full details, reviews, ratings, specifications
    • Uses Puppeteer with stealth plugin for browser automation
    • Handles dynamic content and JavaScript-rendered pages
  • Product Search Scraper — Extract search results from Amazon SERP

    • Search results pages (SERP) with multiple products, pagination, sponsored ads
    • Extracts organic products, sponsored placements, search metadata, and related searches
    • Supports pagination for scraping multiple pages of results
  • Product Category Scraper — Extract category/browse page data

    • Category/browse pages (node ID-based) with category information, products, and navigation links
    • Extracts category metadata, products listed on category pages, subcategories, and filters

📋 Reviews Scraper

  • Reviews Scraper — Extract product reviews and ratings
    • Coming soon — Scraper implementation in development

👥 Sellers Scraper

  • Sellers Scraper — Extract seller information and profiles
    • Coming soon — Scraper implementation in development

📁 Directory layout

  • product/ — Product page + search results + category scrapers (code + examples)
    • product_data/ — Product page scraper (ASIN-based)
    • product_search/ — Search results scraper (SERP)
    • product_category/ — Category page scraper (node ID-based)
  • reviews/ — Reviews documentation (scraper implementation coming soon)
  • sellers/ — Sellers documentation (scraper implementation coming soon)

🚀 Quick Start

What You'll Need

  • ✅ Node.js 14+
  • ✅ ScrapeOps API key (Get one free)
  • ✅ Dependencies: puppeteer-extra, puppeteer-extra-plugin-stealth, cheerio

Setup Steps

  1. Install dependencies:

    npm install puppeteer-extra puppeteer-extra-plugin-stealth cheerio

    Note: Puppeteer automatically downloads Chromium when installed, so no separate browser installation step is needed.

  2. Get your ScrapeOps API key:

    • Sign up at ScrapeOps (free account)
    • Copy your API key from the dashboard
    • Set it as an environment variable:
      # macOS/Linux
      export SCRAPEOPS_API_KEY="your-api-key"
      
      # Windows PowerShell
      $env:SCRAPEOPS_API_KEY="your-api-key"
    • Or edit the API_KEY variable in the scraper file directly
  3. Run a scraper:

    # Product Page Scraper
    node product/product_data/amazon.com_scraper_product_v1.js
    
    # Search Results Scraper
    node product/product_search/amazon.com_scraper_product_search_v1.js
    
    # Category Page Scraper
    node product/product_category/amazon.com_scraper_product_category_v1.js

👉 Start with product/product_data/README.md for usage, schemas, examples.


🛠️ Why Puppeteer?

Puppeteer is an excellent choice for Amazon scraping because:

  • Browser Automation — Full browser automation with headless Chrome/Chromium
  • JavaScript Support — Handles JavaScript-heavy sites and dynamic content
  • Stealth Mode — Built-in stealth plugin to avoid detection
  • Realistic Behavior — Mimics real browser behavior for better success rates
  • Auto-Download — Automatically downloads Chromium, no manual setup needed
  • Production-Ready — Robust error handling and retry mechanisms

Puppeteer vs. Other Frameworks

Framework Best For Complexity
Cheerio & Axios Simple HTML parsing, quick scripts ⭐ Low
Playwright Modern browser automation, multi-browser support ⭐⭐⭐ High
Puppeteer Browser automation, headless Chrome ⭐⭐⭐ High

Puppeteer is ideal when you need browser automation with headless Chrome for JavaScript-heavy pages or dynamic content.

All scrapers in this directory use Puppeteer with stealth plugin and output structured JSONL (see per-scraper docs).


📋 Common Use Cases

  • Amazon price monitoring and tracking
  • Product catalog ingestion
  • Competitive pricing analysis
  • Review and rating aggregation
  • Search results analysis (SERP)
  • Product discovery and catalog building
  • Category hierarchy mapping and navigation
  • Market research and trend analysis
  • Ecommerce data pipelines
  • Category-based product listings
  • Dynamic content extraction (JavaScript-rendered pages)
  • Browser-based scraping for anti-bot protected sites

🔑 Get ScrapeOps API Key

All Puppeteer scrapers require a ScrapeOps API key to access the proxy service.

Register for Free Account

  1. Visit the ScrapeOps registration page
  2. Sign up for a free account
  3. Navigate to your dashboard to retrieve your API key

Add API Key to Your Code

Method 1: Direct Assignment (Quick Start)

  1. Open the scraper file you want to use
  2. Locate the API_KEY variable near the top of the file
  3. Replace the placeholder with your actual ScrapeOps API key:
    const API_KEY = "your-actual-api-key-here";

Method 2: Environment Variable (Recommended for Production)

For better security, use environment variables:

  1. Set the environment variable:

    # macOS/Linux
    export SCRAPEOPS_API_KEY="your-actual-api-key-here"
    
    # Windows PowerShell
    $env:SCRAPEOPS_API_KEY="your-actual-api-key-here"
  2. Modify the code to read from environment:

    const API_KEY = process.env.SCRAPEOPS_API_KEY || "your-default-key";

Note: Some v1 scripts read from a hardcoded API_KEY constant. If so, either edit API_KEY directly or update the script to use process.env.SCRAPEOPS_API_KEY.


⚙️ Dependencies & Setup

Required Dependencies

All Puppeteer scrapers require the following npm packages:

npm install puppeteer-extra puppeteer-extra-plugin-stealth cheerio

Package Details

  • puppeteer-extra — Enhanced Puppeteer with plugin support
  • puppeteer-extra-plugin-stealth — Stealth plugin to avoid detection by anti-bot systems
  • cheerio — Fast, flexible, and lean implementation of core jQuery for server-side HTML parsing

Browser Installation

Puppeteer automatically downloads Chromium when installed, so no separate browser installation step is needed. The first time you run a scraper, it will automatically download the appropriate Chromium version.

Installation Options

Using npm:

npm install puppeteer-extra puppeteer-extra-plugin-stealth cheerio

Using package.json:

npm install

(If a package.json file is available in the scraper directory)


📚 Scraper Documentation

Product Scrapers

Comprehensive documentation for product page, search results, and category page scrapers:

  • Product Data Scraper README — Complete guide for product page scraping

    • Product page scraping (ASIN-based)
    • Full product details extraction
    • Output schemas and examples
    • Configuration and usage examples
    • Browser automation setup
  • Product Search Scraper README — Complete guide for search results scraping

    • Search results scraping (SERP)
    • Multiple products per page extraction
    • Pagination strategies
    • Sponsored products and related searches
    • Output schemas and examples
  • Product Category Scraper README — Complete guide for category page scraping

    • Category page scraping (node ID-based)
    • Category information and metadata
    • Products listed on category pages
    • Subcategories and navigation links
    • Output schemas and examples

Reviews Scraper

  • Reviews Scraper — Documentation for product reviews scraping
    • Implementation in development

Sellers Scraper

  • Sellers Scraper — Documentation for seller information scraping
    • Implementation in development

🔄 Alternative Implementations

These Amazon scrapers are available in multiple Node.js frameworks. Explore alternative implementations that may better suit your needs:

Node.js Framework Options

  • [Cheerio & Axios Framework](../cheerio & Axios/README.md) — Simple HTML parsing, fast and lightweight
  • Playwright Framework — Browser automation for JavaScript-heavy sites
  • Puppeteer (This directory) — Browser automation with Puppeteer

Other Language Implementations

Website-Level Documentation


Legal Notice

These scrapers are provided for educational and research purposes.
You are responsible for ensuring your use complies with Amazon's terms of service and applicable laws in your jurisdiction.