Skip to content

Handle external URL references in archives to enable offline use in Kolibri #233

@ivanistheone

Description

@ivanistheone

Description

Archive-based content types (H5P, HTML5 zip, IMSCP) may contain references to external URLs — images, videos, fonts, stylesheets, scripts — that won't be available in offline Kolibri deployments. The pipeline conversion handlers need to scan archive contents for these references, download the resources, bundle them into the archive, and rewrite the references to point to the local copies.

Context

Existing logic in downloader.py

We already have most of this URL extraction and rewriting logic in downloader.py. download_static_assets() and its inner functions extract URLs from:

  • HTML attributes: img[src], link[href], script[src], source[src], img[srcset], [style*="background-image"]
  • CSS: url() references for fonts, images, etc.
  • Recursive resource following (CSS that references fonts, etc.)

And ArchiveDownloader downloads pages with all their resources, rewrites paths, and creates ZIP archives.

However, this logic is not reusable by the pipeline because:

kolibri-zip as reference spec

Kolibri's kolibri-zip package handles the runtime side of this: when rendering ZIP-based content, it extracts files, resolves internal path references, and rewrites them to blob URLs. Its fileUtils.js provides a comprehensive spec for which reference types need handling:

HTML/XML files (src, href, srcset, inline style, <style> blocks):

<img src="images/photo.jpg">
<link href="styles/main.css">
<script src="https://cdn.example.com/lib.js"></script>
<img srcset="img-300.jpg 300w, img-600.jpg 600w">
<div style="background: url('bg.png')">

CSS files (url(), @import — both url() and bare string forms):

@import 'fonts/custom.css';
background-image: url('../images/bg.png');
@font-face { src: url('https://fonts.example.com/font.woff2'); }

H5P JSON (path attributes in content/content.json):

{
  "video": {
    "files": [
      { "path": "https://h5p.org/sites/default/files/h5p/iv.mp4", "mime": "video/mp4" }
    ]
  }
}

kolibri-zip handles internal references at runtime, but cannot fetch external URLs (especially offline). That's ricecooker's job at import time.

Key architectural constraint

External resource downloading and reference rewriting must happen before create_predictable_zip is called in ArchiveProcessingBaseHandler.handle_file() (convert.py). Once create_predictable_zip runs, the archive is sealed — it iterates existing files and can only transform them via file_converter (currently used for media compression), not add new ones.

The flow in handle_file() would become:

validate_archive(path)
path = download_and_rewrite_external_refs(path)   # <-- new step
create_predictable_zip(path, file_converter=...)   # existing step

Where download_and_rewrite_external_refs would:

  1. Extract the archive to a temp directory
  2. Scan text-based files for external URL references
  3. Download those resources into the temp directory
  4. Rewrite references in the source files to point to local copies
  5. Return the temp directory path (which create_predictable_zip already accepts — it handles both directories and zip files)

Related issues and PRs

Approach

Phase 1: Land bug fixes from #636 and #639

Merge the open PRs first to preserve @jaltekruse's contributor attribution in git history before the refactor changes the code structure. These fix real bugs in the URL extraction logic that the shared utilities will need to carry forward.

Phase 2: Extract shared utilities from downloader.py (supersedes #303)

Extract the URL extraction and rewriting logic from download_static_assets() inner functions into standalone, testable utility functions that operate on file contents (strings/bytes) rather than requiring a live HTTP session:

  • URL extraction: Given HTML/CSS/JSON content, return a list of referenced URLs
  • URL rewriting: Given content and a URL mapping (old → new), return rewritten content
  • External URL filtering: Distinguish external (http/https) from internal (relative paths already in archive) references

This directly addresses #303's concern about untestable inner functions. The extracted functions can be unit tested with plain strings — no HTTP server, no filesystem, no platform-specific path issues. This also resolves the Windows test failures in #636, since the core logic tests won't depend on filesystem paths.

Phase 3: Create archive processing utility

Build on the Phase 2 utilities to create an archive-level processor:

  • Open an archive and iterate its text-based files (HTML, CSS, JSON, XML)
  • Use Phase 2 extractors to find external URL references
  • Download external resources into the extracted archive directory
  • Use Phase 2 rewriters to update references to local paths
  • Loop detection for recursive references (à la kolibri-zip's visitedPaths)

Phase 4: Integrate into pipeline conversion handlers

Wire the archive processor into the existing handlers, running before create_predictable_zip:

  • H5PConversionHandler: Scan content/content.json for external path values, plus HTML/CSS in content (highest priority — videos and images commonly external)
  • HTML5ConversionHandler: Scan HTML/CSS files for external references
  • IMSCPConversionHandler: Scan entry point HTML files and their CSS for external references

Reference types to handle

From kolibri-zip's fileUtils.js and existing downloader.py logic:

File type Reference patterns Source
HTML/XML src, href, srcset attributes; inline style; <style> blocks kolibri-zip DOMMapper, downloader.py download_static_assets()
CSS url(), @import (both url() and bare string forms) kolibri-zip CSSMapper, downloader.py _CSS_URL_RE, PR #639 _CSS_IMPORT_RE
H5P JSON path attributes in content/content.json H5P-specific

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions