Skip to content

Conversation

@Henrrypg
Copy link

No description provided.

…actors

- Introduced `component_extractors.py` with dedicated extractors for HTML, video, problem, discussion, and generic blocks.
- Refactored `get_unit_content` to use the new extractor layer for cleaner, more consistent unit serialization.
- Refined HTML extractor for better handling of embedded content
- Added dedicated helper functions for HTML extraction
- Improved _truncate_unit_text logic to safely respect char_limit
- Fixed import ordering and added comprehensive docstrings
- Introduced configurable `AI_EXTENSIONS_FIELD_FILTERS` to define allowed fields
- Added `DEFAULT_FIELD_FILTERS` in `common.py`
- Refined `extract_generic_info` logic for safer and more flexible field extraction
@Henrrypg Henrrypg force-pushed the feature/improve-unit-content-readability branch from a711367 to 0ac8872 Compare November 28, 2025 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants