-
Notifications
You must be signed in to change notification settings - Fork 18
feat: introduce parallelization worker for chunked generation processing #507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #507 +/- ##
==========================================
+ Coverage 74.35% 75.71% +1.36%
==========================================
Files 112 115 +3
Lines 10747 11181 +434
Branches 722 756 +34
==========================================
+ Hits 7991 8466 +475
+ Misses 2753 2711 -42
- Partials 3 4 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a comprehensive parallelization infrastructure for processing documentation generation tasks across multiple worker threads. The implementation enables both generator-level and chunk-level parallelization to improve performance when processing large documentation sets.
Key changes:
- Refactored
WorkerPoolto support configurable worker scripts and improved queue processing - Introduced
ParallelWorkerabstraction for chunk-level parallelization within generators - Updated generator interface with optional
processChunkmethod for parallel processing - Migrated multiple generators (metadata, legacy-json, legacy-html, jsx-ast, ast-js) to support chunked processing
Reviewed changes
Copilot reviewed 18 out of 20 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| src/threading/index.mjs | Refactored WorkerPool to accept configurable worker scripts and simplified API by moving thread count to constructor |
| src/threading/parallel.mjs | New module providing ParallelWorker abstraction for distributing work across chunk workers with optimal chunking strategy |
| src/threading/generator-worker.mjs | New worker script for executing full generators in worker threads, replacing the old worker.mjs |
| src/threading/chunk-worker.mjs | New worker script for processing individual chunks via generator's processChunk method |
| src/threading/worker.mjs | Removed - replaced by generator-worker.mjs and chunk-worker.mjs for more flexible parallelization |
| src/generators/types.d.ts | Added ParallelWorker interface, chunkSize option, and optional processChunk method to generator metadata |
| src/utils/generators.mjs | Extracted reusable utility functions (getHeadNodes, getSortedHeadNodes, buildDocPages) from individual generators |
| src/generators/metadata/index.mjs | Implemented processChunk for parallel processing of API doc files |
| src/generators/legacy-json/index.mjs | Refactored to support chunk-based parallel processing with worker threads |
| src/generators/legacy-html/index.mjs | Significant refactor introducing processNode helper, template utilities, and chunk-based parallelization |
| src/generators/legacy-html/utils/template.mjs | New utility module for template caching and replacement logic extraction |
| src/generators/jsx-ast/index.mjs | Refactored to support chunk processing and moved shared utilities to common module |
| src/generators/ast-js/index.mjs | Added chunk processing support for parallel JavaScript file parsing |
| src/generators/web/index.mjs | Updated with better comments explaining why it doesn't implement processChunk (requires all entries for bundling) |
| src/generators/web/utils/processing.mjs | Fixed JSDoc parameter documentation for options parameter |
| src/generators/web/utils/bundle.mjs | Added jsx-runtime aliases and node_modules resolution configuration for external execution |
| src/generators/legacy-html-all/index.mjs | Updated to use getRemarkRehype instead of getRemarkRehypeWithShiki |
| src/generators/orama-db/index.mjs | Minor formatting improvements (blank lines) |
| src/generators.mjs | Updated to create separate worker pools for generators and chunks, with ParallelWorker instances |
| bin/commands/generate.mjs | Added chunkSize parameter and improved thread count calculation for optimal performance |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Blocking) This PR contains a lot of unrelated changes, which makes it hard to identify what's part of the chunking and what's not. While I'm admittedly a victim of making this same mistake my PRs, it would be nice to isolate this to parallelization.
(Non-blocking, at the moment) Additionally, I feel like this creates a lot of workers1, and in the current format (if I understand correctly), every worker receives the entire inputs' AST structure and options. That's a lot of serialization[^2].
It might be more memory efficient to use a batch promise for the chunks (i.e. p-limit), which keeps them on the same worker, and avoids the serialization.
Footnotes
-
According to the current configuration, on my machine, up to 100 workers ↩
Could you point out what these unrelated changes are? All changes here are purely related to the parallelization. |
That's exactly what this PR is doing, it has batch sizes. Also the parallelization will never run more workers than the allowed amount of threads. I have no idea where you got 100 "workers" |
Appreciate you pointing these out, I acknowledge that, the issues of asking AI to simplify the changes, it touches unrelated things. Let me work on that. |
|
@avivkeller re-review? 🙇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a fair amount of unrelated changes, but regarding the related changes: looks good with a few non-blocking notes
Almost there!
If and when I use AI, I usually tell it that if a simplification change can stand on its own, it's likely unrelated, and the model should reconsider whether or not it's needed. |
Yeah, on my work MacBook I have system prompts and other settings for that, this work was done on my personal Desktop, gotta configure ol' VS Code to not do unrelated things. |
This PR introduces a new implementation for parallelization of input processing per worker:
This PR is still missing unit tests as I believe there will be a few rounds of feedback before we reach a final version.