-
Notifications
You must be signed in to change notification settings - Fork 76
Description
Description
Archive-based content types (H5P, HTML5 zip, IMSCP) may contain references to external URLs — images, videos, fonts, stylesheets, scripts — that won't be available in offline Kolibri deployments. The pipeline conversion handlers need to scan archive contents for these references, download the resources, bundle them into the archive, and rewrite the references to point to the local copies.
Context
Existing logic in downloader.py
We already have most of this URL extraction and rewriting logic in downloader.py. download_static_assets() and its inner functions extract URLs from:
- HTML attributes:
img[src],link[href],script[src],source[src],img[srcset],[style*="background-image"] - CSS:
url()references for fonts, images, etc. - Recursive resource following (CSS that references fonts, etc.)
And ArchiveDownloader downloads pages with all their resources, rewrites paths, and creates ZIP archives.
However, this logic is not reusable by the pipeline because:
- It's tightly coupled to web scraping (takes URLs, not archive contents)
- The extraction/rewriting logic is buried in inner functions that can't be unit tested (see Refactor downloader.py to make more functions unit testable #303)
- It has known bugs being fixed in Fix downloading a linked css resource with no extension #636 and Fix rewriting of CSS @import statements that use strings instead of url() #639
kolibri-zip as reference spec
Kolibri's kolibri-zip package handles the runtime side of this: when rendering ZIP-based content, it extracts files, resolves internal path references, and rewrites them to blob URLs. Its fileUtils.js provides a comprehensive spec for which reference types need handling:
HTML/XML files (src, href, srcset, inline style, <style> blocks):
<img src="images/photo.jpg">
<link href="styles/main.css">
<script src="https://cdn.example.com/lib.js"></script>
<img srcset="img-300.jpg 300w, img-600.jpg 600w">
<div style="background: url('bg.png')">CSS files (url(), @import — both url() and bare string forms):
@import 'fonts/custom.css';
background-image: url('../images/bg.png');
@font-face { src: url('https://fonts.example.com/font.woff2'); }H5P JSON (path attributes in content/content.json):
{
"video": {
"files": [
{ "path": "https://h5p.org/sites/default/files/h5p/iv.mp4", "mime": "video/mp4" }
]
}
}kolibri-zip handles internal references at runtime, but cannot fetch external URLs (especially offline). That's ricecooker's job at import time.
Key architectural constraint
External resource downloading and reference rewriting must happen before create_predictable_zip is called in ArchiveProcessingBaseHandler.handle_file() (convert.py). Once create_predictable_zip runs, the archive is sealed — it iterates existing files and can only transform them via file_converter (currently used for media compression), not add new ones.
The flow in handle_file() would become:
validate_archive(path)
path = download_and_rewrite_external_refs(path) # <-- new step
create_predictable_zip(path, file_converter=...) # existing step
Where download_and_rewrite_external_refs would:
- Extract the archive to a temp directory
- Scan text-based files for external URL references
- Download those resources into the temp directory
- Rewrite references in the source files to point to local copies
- Return the temp directory path (which
create_predictable_zipalready accepts — it handles both directories and zip files)
Related issues and PRs
- Refactor downloader.py to make more functions unit testable #303 — Refactor
downloader.pyto make URL detection and rewriting unit-testable. This issue supersedes Refactor downloader.py to make more functions unit testable #303 — the extraction of URL logic into shared utilities serves both the testability goal and the pipeline integration goal. - Fix downloading a linked css resource with no extension #636 — Fix CSS
linktag filtering ("rel" in nodevs"rel" in node.attrs) and extensionless URL path placement. By @jaltekruse. - Fix rewriting of CSS @import statements that use strings instead of url() #639 — Fix CSS
@importwith bare strings (not wrapped inurl()). By @jaltekruse.
Approach
Phase 1: Land bug fixes from #636 and #639
Merge the open PRs first to preserve @jaltekruse's contributor attribution in git history before the refactor changes the code structure. These fix real bugs in the URL extraction logic that the shared utilities will need to carry forward.
Phase 2: Extract shared utilities from downloader.py (supersedes #303)
Extract the URL extraction and rewriting logic from download_static_assets() inner functions into standalone, testable utility functions that operate on file contents (strings/bytes) rather than requiring a live HTTP session:
- URL extraction: Given HTML/CSS/JSON content, return a list of referenced URLs
- URL rewriting: Given content and a URL mapping (old → new), return rewritten content
- External URL filtering: Distinguish external (http/https) from internal (relative paths already in archive) references
This directly addresses #303's concern about untestable inner functions. The extracted functions can be unit tested with plain strings — no HTTP server, no filesystem, no platform-specific path issues. This also resolves the Windows test failures in #636, since the core logic tests won't depend on filesystem paths.
Phase 3: Create archive processing utility
Build on the Phase 2 utilities to create an archive-level processor:
- Open an archive and iterate its text-based files (HTML, CSS, JSON, XML)
- Use Phase 2 extractors to find external URL references
- Download external resources into the extracted archive directory
- Use Phase 2 rewriters to update references to local paths
- Loop detection for recursive references (à la kolibri-zip's
visitedPaths)
Phase 4: Integrate into pipeline conversion handlers
Wire the archive processor into the existing handlers, running before create_predictable_zip:
H5PConversionHandler: Scancontent/content.jsonfor externalpathvalues, plus HTML/CSS in content (highest priority — videos and images commonly external)HTML5ConversionHandler: Scan HTML/CSS files for external referencesIMSCPConversionHandler: Scan entry point HTML files and their CSS for external references
Reference types to handle
From kolibri-zip's fileUtils.js and existing downloader.py logic:
| File type | Reference patterns | Source |
|---|---|---|
| HTML/XML | src, href, srcset attributes; inline style; <style> blocks |
kolibri-zip DOMMapper, downloader.py download_static_assets() |
| CSS | url(), @import (both url() and bare string forms) |
kolibri-zip CSSMapper, downloader.py _CSS_URL_RE, PR #639 _CSS_IMPORT_RE |
| H5P JSON | path attributes in content/content.json |
H5P-specific |
References
- kolibri-zip source: https://github.com/learningequality/kolibri/tree/release-v0.19.x/packages/kolibri-zip/src
- Current pipeline handlers:
ricecooker/utils/pipeline/convert.py - Existing download logic:
ricecooker/utils/downloader.py