Skip to content

ClaudiuIO/wayback-api-downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wayback Machine Lost Articles Downloader

A CLI tool to download, parse, and present archived articles about the TV series "Lost" from TVBlog.ro via the Wayback Machine.

Features

  • Search the Wayback Machine CDX API for all archived Lost articles
  • Download HTML pages with rate limiting to respect archive.org resources
  • Parse articles to extract title, author, date, content, and images
  • Download images locally for offline viewing
  • Generate a static presentation page with season/episode navigation

Requirements

  • Node.js 18+ (uses native fetch)

Installation

npm install

Usage

Individual Commands

# Search Wayback Machine for all Lost articles
npm run dev -- search

# Download HTML pages (with rate limiting)
npm run dev -- download                 # download all pending
npm run dev -- download --limit 10      # download only 10 pages
npm run dev -- download --delay 2000    # 2 second delay between requests
npm run dev -- download --retry-failed  # retry previously failed downloads

# Parse downloaded pages to extract article content
npm run dev -- parse
npm run dev -- parse --force            # re-parse already parsed pages

# Download images from articles
npm run dev -- images
npm run dev -- images --delay 2000      # 2 second delay between requests

# Build the presentation page
npm run dev -- build

# Check current status
npm run dev -- status

Full Pipeline

Run all steps in sequence:

npm run dev -- all
npm run dev -- all --delay 2000         # with custom delay

Project Structure

wayback-api-downloader/
├── src/
│   ├── cli.ts              # CLI entry point
│   ├── index.ts            # Main exports
│   ├── search.ts           # Wayback CDX API search
│   ├── download.ts         # Page downloading
│   ├── parser.ts           # HTML content extraction
│   ├── imageDownloader.ts  # Image downloading
│   ├── builder.ts          # Presentation page generator
│   ├── types.ts            # TypeScript interfaces
│   └── utils.ts            # Helper functions
├── data/
│   ├── pages/              # Downloaded raw HTML files
│   ├── articles/           # Parsed JSON files (one per article)
│   ├── images/             # Downloaded images
│   └── status.json         # Progress tracking file
├── presentation/
│   ├── index.html          # Generated presentation page
│   ├── style.css           # Styling
│   └── script.js           # Navigation logic
├── package.json
└── tsconfig.json

Data Format

status.json

Tracks the status of each page:

{
  "lastSearchDate": "2024-01-20T10:00:00Z",
  "totalPages": 639,
  "pages": [
    {
      "url": "http://www.tvblog.ro/lost-3x17-catch-22",
      "timestamp": "20071007031805",
      "status": "parsed",
      "downloadedFile": "20071007031805_lost-3x17-catch-22.html",
      "parsedFile": "lost-3x17-catch-22.json",
      "error": null
    }
  ]
}

Page statuses: pending, downloaded, parsed, failed, not_found

Article JSON

Each parsed article is saved as a JSON file:

{
  "id": "lost-3x17-catch-22",
  "title": "Lost - 3×17 "Catch-22"",
  "author": {
    "name": "biotudor",
    "url": "http://www.tvblog.ro/author/biotudor/"
  },
  "date": "April 19, 2007",
  "dateISO": "2007-04-19",
  "episode": {
    "season": 3,
    "episode": 17,
    "title": "Catch-22"
  },
  "content": {
    "html": "<p>...</p>",
    "text": "Plain text version..."
  },
  "images": [
    {
      "originalUrl": "https://web.archive.org/web/.../3x17.jpg",
      "localPath": "images/lost-3x17-catch-22_3x17.jpg",
      "alt": "Lost",
      "downloaded": true
    }
  ],
  "sourceUrl": "http://www.tvblog.ro/lost-3x17-catch-22/",
  "waybackUrl": "https://web.archive.org/web/20071007031805/...",
  "waybackTimestamp": "20071007031805"
}

Presentation

After running build, open presentation/index.html in a browser. The presentation features:

  • Dark theme optimized for reading
  • Sidebar navigation organized by season
  • Episode list with season/episode numbers
  • Article display with author, date, and content
  • Links to original Wayback Machine pages
  • Responsive design for mobile viewing

Rate Limiting

The tool respects Wayback Machine resources:

  • Default 1 second delay between requests
  • Exponential backoff on 429/503 responses
  • Maximum 3 retries per request
  • Progress saved after each download (resumable)

Building for Production

npm run build

This compiles TypeScript to dist/ directory.

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors