Skip to content

feat(search): Scalable Local Search Architecture (Worker offload + Binary Indexing) for large documentation sites #5077

@seekskyworld

Description

@seekskyworld

Is your feature request related to a problem? Please describe.

Currently, the built-in local search (based on MiniSearch) faces significant performance bottlenecks when scaling to larger documentation sites (e.g., > 16,000 pages or large content volume).

Key Pain Points:

1.Main Thread Blocking: The search index construction and query execution happen on the main thread, causing UI freezes/jank during initialization or typing.

2.High Memory Usage: The index is loaded as a large JSON object. For large sites, this can consume 1GB+ RAM, leading to browser crashes on mobile devices.

3.Slow Initialization: Parsing huge JSON files and building the index at runtime results in a noticeable delay before the search becomes usable.

4.Network Overhead: Transferring full-text data as JSON is inefficient compared to optimized binary formats.

Describe the solution you'd like

I propose (and have implemented a proof-of-concept for) an optimized local search architecture designed for high-performance scenarios. This could be an advanced configuration option or a potential future replacement for the current implementation.

Proposed Architecture:

1.Web Worker Offloading: Move the heavy lifting (index parsing and fuzzy matching) to a Web Worker. This ensures the UI thread remains 60fps responsive, regardless of the dataset size.

2.Static Pre-indexing: Instead of building the index at runtime in the browser, generate the index (e.g., using FlexSearch) during the Node.js build process.

3.Compact Binary/Array Format: Replace verbose JSON objects with an "Array of Arrays" structure (Row-based) and Dictionary Encoding for URLs.
Benchmark in my usage: Reduced memory usage from 1.7GB to ~510MB for the same dataset.

4.Artifact Splitting: Split the index into core (titles/headers) and content (full text). Load core immediately for instant interactivity, and lazy-load content in the background.

5.Native Intl.Segmenter: Use the browser's native Intl.Segmenter for CJK (Chinese/Japanese/Korean) tokenization to remove the dependency on heavy third-party libraries like jieba.

I have implemented this architecture in a private project with excellent results (0ms search latency, <50ms TTI). I am willing to share more technical details or contribute to a PR if the team is interested in this direction.

Describe alternatives you've considered

1.Algolia / DocSearch: Excellent performance but requires a paid subscription for commercial closed-source projects and involves data privacy concerns (data must be uploaded).

2.Server-side Search (Elasticsearch/Meilisearch): Requires deploying and maintaining backend services, losing the "static site" deployment simplicity.

3.Optimization of MiniSearch: Tried tuning MiniSearch options, but the fundamental bottleneck of Main Thread JSON parsing remains hard to overcome for very large datasets.

Additional context

No response

Validations

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions