A CLI tool to download, parse, and present archived articles about the TV series "Lost" from TVBlog.ro via the Wayback Machine.
- Search the Wayback Machine CDX API for all archived Lost articles
- Download HTML pages with rate limiting to respect archive.org resources
- Parse articles to extract title, author, date, content, and images
- Download images locally for offline viewing
- Generate a static presentation page with season/episode navigation
- Node.js 18+ (uses native fetch)
npm install# Search Wayback Machine for all Lost articles
npm run dev -- search
# Download HTML pages (with rate limiting)
npm run dev -- download # download all pending
npm run dev -- download --limit 10 # download only 10 pages
npm run dev -- download --delay 2000 # 2 second delay between requests
npm run dev -- download --retry-failed # retry previously failed downloads
# Parse downloaded pages to extract article content
npm run dev -- parse
npm run dev -- parse --force # re-parse already parsed pages
# Download images from articles
npm run dev -- images
npm run dev -- images --delay 2000 # 2 second delay between requests
# Build the presentation page
npm run dev -- build
# Check current status
npm run dev -- statusRun all steps in sequence:
npm run dev -- all
npm run dev -- all --delay 2000 # with custom delaywayback-api-downloader/
├── src/
│ ├── cli.ts # CLI entry point
│ ├── index.ts # Main exports
│ ├── search.ts # Wayback CDX API search
│ ├── download.ts # Page downloading
│ ├── parser.ts # HTML content extraction
│ ├── imageDownloader.ts # Image downloading
│ ├── builder.ts # Presentation page generator
│ ├── types.ts # TypeScript interfaces
│ └── utils.ts # Helper functions
├── data/
│ ├── pages/ # Downloaded raw HTML files
│ ├── articles/ # Parsed JSON files (one per article)
│ ├── images/ # Downloaded images
│ └── status.json # Progress tracking file
├── presentation/
│ ├── index.html # Generated presentation page
│ ├── style.css # Styling
│ └── script.js # Navigation logic
├── package.json
└── tsconfig.json
Tracks the status of each page:
{
"lastSearchDate": "2024-01-20T10:00:00Z",
"totalPages": 639,
"pages": [
{
"url": "http://www.tvblog.ro/lost-3x17-catch-22",
"timestamp": "20071007031805",
"status": "parsed",
"downloadedFile": "20071007031805_lost-3x17-catch-22.html",
"parsedFile": "lost-3x17-catch-22.json",
"error": null
}
]
}Page statuses: pending, downloaded, parsed, failed, not_found
Each parsed article is saved as a JSON file:
{
"id": "lost-3x17-catch-22",
"title": "Lost - 3×17 "Catch-22"",
"author": {
"name": "biotudor",
"url": "http://www.tvblog.ro/author/biotudor/"
},
"date": "April 19, 2007",
"dateISO": "2007-04-19",
"episode": {
"season": 3,
"episode": 17,
"title": "Catch-22"
},
"content": {
"html": "<p>...</p>",
"text": "Plain text version..."
},
"images": [
{
"originalUrl": "https://web.archive.org/web/.../3x17.jpg",
"localPath": "images/lost-3x17-catch-22_3x17.jpg",
"alt": "Lost",
"downloaded": true
}
],
"sourceUrl": "http://www.tvblog.ro/lost-3x17-catch-22/",
"waybackUrl": "https://web.archive.org/web/20071007031805/...",
"waybackTimestamp": "20071007031805"
}After running build, open presentation/index.html in a browser. The presentation features:
- Dark theme optimized for reading
- Sidebar navigation organized by season
- Episode list with season/episode numbers
- Article display with author, date, and content
- Links to original Wayback Machine pages
- Responsive design for mobile viewing
The tool respects Wayback Machine resources:
- Default 1 second delay between requests
- Exponential backoff on 429/503 responses
- Maximum 3 retries per request
- Progress saved after each download (resumable)
npm run buildThis compiles TypeScript to dist/ directory.
MIT