Handle external URL references in archives to enable offline use in Kolibri

### Description

Archive-based content types (H5P, HTML5 zip, IMSCP) may contain references to external URLs — images, videos, fonts, stylesheets, scripts — that won't be available in offline Kolibri deployments. The pipeline conversion handlers need to scan archive contents for these references, download the resources, bundle them into the archive, and rewrite the references to point to the local copies.

### Context

#### Existing logic in `downloader.py`

We already have most of this URL extraction and rewriting logic in `downloader.py`. `download_static_assets()` and its inner functions extract URLs from:
- HTML attributes: `img[src]`, `link[href]`, `script[src]`, `source[src]`, `img[srcset]`, `[style*="background-image"]`
- CSS: `url()` references for fonts, images, etc.
- Recursive resource following (CSS that references fonts, etc.)

And `ArchiveDownloader` downloads pages with all their resources, rewrites paths, and creates ZIP archives.

However, this logic is **not reusable by the pipeline** because:
- It's tightly coupled to web scraping (takes URLs, not archive contents)
- The extraction/rewriting logic is buried in inner functions that can't be unit tested (see #303)
- It has known bugs being fixed in #636 and #639

#### kolibri-zip as reference spec

Kolibri's [`kolibri-zip`](https://github.com/learningequality/kolibri/blob/release-v0.19.x/packages/kolibri-zip/src/index.js) package handles the *runtime* side of this: when rendering ZIP-based content, it extracts files, resolves internal path references, and rewrites them to blob URLs. Its [`fileUtils.js`](https://github.com/learningequality/kolibri/blob/release-v0.19.x/packages/kolibri-zip/src/fileUtils.js) provides a comprehensive spec for which reference types need handling:

**HTML/XML files** (`src`, `href`, `srcset`, inline `style`, `<style>` blocks):
```html
<img src="images/photo.jpg">
<link href="styles/main.css">
<script src="https://cdn.example.com/lib.js"></script>
<img srcset="img-300.jpg 300w, img-600.jpg 600w">
<div style="background: url('bg.png')">
```

**CSS files** (`url()`, `@import` — both `url()` and bare string forms):
```css
@import 'fonts/custom.css';
background-image: url('../images/bg.png');
@font-face { src: url('https://fonts.example.com/font.woff2'); }
```

**H5P JSON** (`path` attributes in `content/content.json`):
```json
{
  "video": {
    "files": [
      { "path": "https://h5p.org/sites/default/files/h5p/iv.mp4", "mime": "video/mp4" }
    ]
  }
}
```

kolibri-zip handles *internal* references at runtime, but cannot fetch *external* URLs (especially offline). That's ricecooker's job at import time.

### Key architectural constraint

External resource downloading and reference rewriting must happen **before** `create_predictable_zip` is called in `ArchiveProcessingBaseHandler.handle_file()` ([convert.py](https://github.com/learningequality/ricecooker/blob/develop/ricecooker/utils/pipeline/convert.py)). Once `create_predictable_zip` runs, the archive is sealed — it iterates existing files and can only transform them via `file_converter` (currently used for media compression), not add new ones.

The flow in `handle_file()` would become:

```
validate_archive(path)
path = download_and_rewrite_external_refs(path)   # <-- new step
create_predictable_zip(path, file_converter=...)   # existing step
```

Where `download_and_rewrite_external_refs` would:
1. Extract the archive to a temp directory
2. Scan text-based files for external URL references
3. Download those resources into the temp directory
4. Rewrite references in the source files to point to local copies
5. Return the temp directory path (which `create_predictable_zip` already accepts — it handles both directories and zip files)

### Related issues and PRs

- **#303** — Refactor `downloader.py` to make URL detection and rewriting unit-testable. **This issue supersedes #303** — the extraction of URL logic into shared utilities serves both the testability goal and the pipeline integration goal.
- **#636** — Fix CSS `link` tag filtering (`"rel" in node` vs `"rel" in node.attrs`) and extensionless URL path placement. By @jaltekruse.
- **#639** — Fix CSS `@import` with bare strings (not wrapped in `url()`). By @jaltekruse.

### Approach

#### Phase 1: Land bug fixes from #636 and #639

Merge the open PRs first to preserve @jaltekruse's contributor attribution in git history before the refactor changes the code structure. These fix real bugs in the URL extraction logic that the shared utilities will need to carry forward.

#### Phase 2: Extract shared utilities from `downloader.py` (supersedes #303)

Extract the URL extraction and rewriting logic from `download_static_assets()` inner functions into standalone, testable utility functions that operate on file contents (strings/bytes) rather than requiring a live HTTP session:

- **URL extraction**: Given HTML/CSS/JSON content, return a list of referenced URLs
- **URL rewriting**: Given content and a URL mapping (old → new), return rewritten content
- **External URL filtering**: Distinguish external (http/https) from internal (relative paths already in archive) references

This directly addresses #303's concern about untestable inner functions. The extracted functions can be unit tested with plain strings — no HTTP server, no filesystem, no platform-specific path issues. This also resolves the Windows test failures in #636, since the core logic tests won't depend on filesystem paths.

#### Phase 3: Create archive processing utility

Build on the Phase 2 utilities to create an archive-level processor:
- Open an archive and iterate its text-based files (HTML, CSS, JSON, XML)
- Use Phase 2 extractors to find external URL references
- Download external resources into the extracted archive directory
- Use Phase 2 rewriters to update references to local paths
- Loop detection for recursive references (à la kolibri-zip's `visitedPaths`)

#### Phase 4: Integrate into pipeline conversion handlers

Wire the archive processor into the existing handlers, running before `create_predictable_zip`:
- **`H5PConversionHandler`**: Scan `content/content.json` for external `path` values, plus HTML/CSS in content (highest priority — videos and images commonly external)
- **`HTML5ConversionHandler`**: Scan HTML/CSS files for external references
- **`IMSCPConversionHandler`**: Scan entry point HTML files and their CSS for external references

### Reference types to handle

From kolibri-zip's `fileUtils.js` and existing `downloader.py` logic:

| File type | Reference patterns | Source |
|---|---|---|
| HTML/XML | `src`, `href`, `srcset` attributes; inline `style`; `<style>` blocks | kolibri-zip `DOMMapper`, downloader.py `download_static_assets()` |
| CSS | `url()`, `@import` (both `url()` and bare string forms) | kolibri-zip `CSSMapper`, downloader.py `_CSS_URL_RE`, PR #639 `_CSS_IMPORT_RE` |
| H5P JSON | `path` attributes in `content/content.json` | H5P-specific |

### References

- kolibri-zip source: https://github.com/learningequality/kolibri/tree/release-v0.19.x/packages/kolibri-zip/src
- Current pipeline handlers: [`ricecooker/utils/pipeline/convert.py`](https://github.com/learningequality/ricecooker/blob/develop/ricecooker/utils/pipeline/convert.py)
- Existing download logic: [`ricecooker/utils/downloader.py`](https://github.com/learningequality/ricecooker/blob/develop/ricecooker/utils/downloader.py)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle external URL references in archives to enable offline use in Kolibri #233

Description

Context

Existing logic in `downloader.py`

kolibri-zip as reference spec

Key architectural constraint

Related issues and PRs

Approach

Phase 1: Land bug fixes from #636 and #639

Phase 2: Extract shared utilities from `downloader.py` (supersedes #303)

Phase 3: Create archive processing utility

Phase 4: Integrate into pipeline conversion handlers

Reference types to handle

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File type	Reference patterns	Source
HTML/XML	`src`, `href`, `srcset` attributes; inline `style`; `<style>` blocks	kolibri-zip `DOMMapper`, downloader.py `download_static_assets()`
CSS	`url()`, `@import` (both `url()` and bare string forms)	kolibri-zip `CSSMapper`, downloader.py `_CSS_URL_RE`, PR #639 `_CSS_IMPORT_RE`
H5P JSON	`path` attributes in `content/content.json`	H5P-specific

Handle external URL references in archives to enable offline use in Kolibri #233

Description

Description

Context

Existing logic in downloader.py

kolibri-zip as reference spec

Key architectural constraint

Related issues and PRs

Approach

Phase 1: Land bug fixes from #636 and #639

Phase 2: Extract shared utilities from downloader.py (supersedes #303)

Phase 3: Create archive processing utility

Phase 4: Integrate into pipeline conversion handlers

Reference types to handle

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Existing logic in `downloader.py`

Phase 2: Extract shared utilities from `downloader.py` (supersedes #303)