diff --git a/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/.openspec.yaml b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/.openspec.yaml new file mode 100644 index 0000000..8f0b869 --- /dev/null +++ b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-03-05 diff --git a/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/design.md b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/design.md new file mode 100644 index 0000000..2be4b55 --- /dev/null +++ b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/design.md @@ -0,0 +1,76 @@ +## Context + +`packages/search-core/src/grogbot_search/chunking.py` currently chunks markdown by section and paragraph, then emits plaintext chunks used by both FTS (`chunks_fts`) and vector embeddings (`chunks_vec`). Current behavior can emit heading-only or weakly contextual chunks, which reduces relevance for section-oriented queries. We want to inline section context directly into chunk text while preserving existing ingestion and search architecture (same schema, same rank-fusion pipeline). + +Constraints: +- Keep chunking deterministic and inexpensive. +- Keep compatibility with `SearchService._insert_plaintext_chunks`, which stores a single `content_text` field. +- Maintain existing chunk-size tuning semantics (`TARGET_WORDS`, `MAX_WORDS`) for body text. +- Assume fresh database ingestion (no backfill/migration of existing chunk rows required). + +## Goals / Non-Goals + +**Goals:** +- Add section-aware context to each emitted chunk by prepending a plain heading path. +- Limit context to top two heading levels to prevent noisy prefixes. +- Keep chunk budget decisions based on body text only. +- Avoid heading-only chunks and avoid mixing multiple section contexts in one chunk. +- Preserve oversized paragraph fallback behavior, with context retained on sentence-based splits. + +**Non-Goals:** +- No database schema changes (no separate context column). +- No ranking formula changes in `SearchService.search`. +- No migration workflow for legacy databases. +- No overlap-window chunking redesign in this change. + +## Decisions + +1. **Inline context in `content_text` with no marker** + - Decision: Prepend plain context text in `H1 > H2` form (or single heading when only one level exists), followed by body content. + - Rationale: Works immediately with existing FTS + embedding pipeline and avoids schema/search changes. + - Alternatives considered: + - Context marker like `[CTX] ...`: rejected to reduce lexical noise and formatting overhead. + - Separate context column: rejected for this change due to schema and query complexity. + +2. **Context source is heading stack truncated to top 2 levels** + - Decision: Build context from active markdown heading hierarchy, keeping only first two levels. + - Rationale: Keeps high-signal topic labels while avoiding long/deep heading paths. + - Alternatives considered: + - Deepest two levels: can drop broad domain context. + - Full hierarchy: higher noise and repeated text. + +3. **Chunk budgeting excludes context words** + - Decision: `TARGET_WORDS`/`MAX_WORDS` calculations use body words only. + - Rationale: Preserves current chunk-size behavior and avoids shrinking body payload for long headings. + - Alternatives considered: + - Include context in budget: simpler accounting but unstable body capacity. + +4. **Flush on context change and suppress heading-only output** + - Decision: When heading path changes, flush active body chunk before accumulating blocks under the new context. Do not emit chunks containing only headings. + - Rationale: Prevents mixed-topic chunks and removes low-value fragments. + - Alternatives considered: + - Allow mixed-context chunks until size threshold: risks ambiguous retrieval matches. + +5. **Sentence fallback keeps inherited context** + - Decision: If a body block exceeds `MAX_WORDS`, split by sentence groups as today and prepend the same context to each emitted chunk. + - Rationale: Preserves oversized-content handling while maintaining topic cues. + - Alternatives considered: + - Drop context for fallback chunks: creates inconsistent retrieval behavior. + +## Risks / Trade-offs + +- **[Risk] Context text dominates short-body chunks** → **Mitigation:** cap depth at two levels and avoid heading-only chunks. +- **[Risk] Plain prefix may alter lexical scoring distribution** → **Mitigation:** validate with focused retrieval tests (heading-term and body-term queries). +- **[Risk] Regex sentence splitting remains imperfect for abbreviations/edge punctuation** → **Mitigation:** preserve existing behavior in this change and isolate improvements for a follow-up. +- **[Trade-off] No marker means less explicit machine parsing** → **Mitigation:** deterministic prefix format (`H1 > H2`) keeps behavior predictable for tests. + +## Migration Plan + +- Fresh-ingest assumption means no data migration is needed. +- Rollout consists of deploying updated chunking logic, then ingesting documents into a new database. +- Rollback is straightforward: deploy previous chunking logic and re-ingest into a fresh database. + +## Open Questions + +- Should extremely generic headings (for example, "Overview") be filtered from context in a follow-up? +- Should future work add optional overlap windows after context inlining quality is measured? diff --git a/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/proposal.md b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/proposal.md new file mode 100644 index 0000000..9da49c5 --- /dev/null +++ b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/proposal.md @@ -0,0 +1,26 @@ +## Why + +Search chunks currently rely on paragraph text alone and can emit heading-only fragments, which weakens retrieval for section-oriented queries and introduces low-value noise. We want each body chunk to carry stable section context so both FTS and vector ranking can use topical signals without changing the search schema. + +## What Changes + +- Inline section context into each chunk’s `content_text` as a plain prefix using the top two heading levels (for example, `API > Auth`), with no context marker. +- Exclude inline context words from chunk size budgeting; `TARGET_WORDS` and `MAX_WORDS` continue to apply to body text only. +- Stop emitting heading-only chunks and flush active chunks when section context changes to avoid mixed-topic chunks. +- Preserve oversized-block sentence fallback behavior, while carrying the same section context into each emitted sentence-group chunk. +- Add/adjust tests to lock formatting, budget rules, context transitions, and oversized split behavior. + +## Capabilities + +### New Capabilities +- `search-chunk-context`: Produces context-aware plaintext chunks that prepend top-level section path information to body content during ingestion. + +### Modified Capabilities +- None. + +## Impact + +- Affected code: `packages/search-core/src/grogbot_search/chunking.py` +- Affected tests: `packages/search-core/tests/test_chunking.py` and relevant retrieval assertions in `packages/search-core/tests/test_service.py` +- Data/index impact: newly ingested documents will store context-enriched chunk text in `chunks.content_text`, which feeds both FTS and vector embedding generation +- No API shape changes expected; behavior change is in ingestion/chunk text composition diff --git a/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/specs/search-chunk-context/spec.md b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/specs/search-chunk-context/spec.md new file mode 100644 index 0000000..364efc1 --- /dev/null +++ b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/specs/search-chunk-context/spec.md @@ -0,0 +1,45 @@ +## ADDED Requirements + +### Requirement: Chunks SHALL inline top-level section context into emitted text +The chunking pipeline SHALL prepend section context to each emitted body chunk using the active heading hierarchy, formatted as plain text with ` > ` separators and no marker token. The context SHALL include at most the top two heading levels from the current heading stack. + +#### Scenario: Chunk includes two-level heading context +- **WHEN** body content is chunked under an active `H1` and `H2` +- **THEN** each emitted chunk text begins with `H1 > H2` followed by the chunk body text + +#### Scenario: Deep headings are truncated to top two levels +- **WHEN** body content appears under `H1`, `H2`, and deeper headings such as `H3` +- **THEN** emitted chunk context includes only `H1 > H2` + +#### Scenario: Single-level context is preserved +- **WHEN** body content has only an active `H1` heading +- **THEN** emitted chunk context begins with only the `H1` text + +### Requirement: Chunk size budgeting SHALL be based on body words only +`TARGET_WORDS` and `MAX_WORDS` enforcement SHALL use body-text word counts only and SHALL exclude inline context words from budget accounting. + +#### Scenario: Context does not force early chunk split +- **WHEN** body text remains within `MAX_WORDS` but context text is long +- **THEN** the chunk is not split due to context word count + +#### Scenario: Context may exceed apparent total words above max +- **WHEN** body text exactly fits the configured limit and context is prepended +- **THEN** emitted chunk text may exceed `MAX_WORDS` in total words while remaining valid + +### Requirement: Chunker SHALL avoid heading-only output and mixed-context chunks +The chunker SHALL emit chunks only for body content and SHALL flush active chunk accumulation when heading context changes so one chunk does not mix body from different contexts. + +#### Scenario: Heading-only sections produce no standalone chunk +- **WHEN** a heading has no body text before the next heading or document end +- **THEN** no chunk is emitted containing only heading text + +#### Scenario: Context transition flushes current chunk +- **WHEN** body has been accumulated under one heading context and a new heading context begins +- **THEN** the current chunk is finalized before accumulating body under the new context + +### Requirement: Oversized body fallback SHALL preserve section context +When a body block exceeds `MAX_WORDS` and sentence-group fallback is used, each emitted sentence-group chunk SHALL include the same inherited section context prefix. + +#### Scenario: Sentence-split chunks keep identical context +- **WHEN** a large body block is split into multiple chunks by sentence grouping +- **THEN** each emitted chunk begins with the same context path that applied to the oversized block diff --git a/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/tasks.md b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/tasks.md new file mode 100644 index 0000000..c0e3f04 --- /dev/null +++ b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/tasks.md @@ -0,0 +1,22 @@ +## 1. Context-aware block modeling + +- [x] 1.1 Update `chunking.py` parsing flow to track active heading hierarchy and associate each body block with its heading context. +- [x] 1.2 Implement heading-path normalization that emits at most the top two heading levels and supports single-level context. +- [x] 1.3 Ensure heading-only segments do not produce body blocks eligible for chunk emission. + +## 2. Chunk emission and sizing behavior + +- [x] 2.1 Update chunk accumulation logic to flush when context changes so one chunk does not mix body from multiple contexts. +- [x] 2.2 Prepend plain context text (`H1 > H2` with no marker) to emitted chunk text while keeping body-only word budgeting for `TARGET_WORDS` and `MAX_WORDS`. +- [x] 2.3 Preserve oversized-block sentence fallback and ensure each sentence-group chunk inherits the same context prefix. + +## 3. Test coverage updates + +- [x] 3.1 Extend `test_chunking.py` with golden-output tests for top-two context formatting, context-change flush behavior, and heading-only suppression. +- [x] 3.2 Add tests proving context is excluded from budget calculations and may increase total emitted words above `MAX_WORDS`. +- [x] 3.3 Update oversized-block tests to assert context preservation across sentence-split outputs. + +## 4. Retrieval validation + +- [x] 4.1 Add or adjust `test_service.py` assertions to verify ingested `chunks.content_text` includes inlined context for headed markdown content. +- [x] 4.2 Run targeted test suites (`test_chunking.py`, relevant `test_service.py` cases) and resolve regressions. diff --git a/packages/search-core/src/grogbot_search/chunking.py b/packages/search-core/src/grogbot_search/chunking.py index 1bb90cc..1fdf101 100644 --- a/packages/search-core/src/grogbot_search/chunking.py +++ b/packages/search-core/src/grogbot_search/chunking.py @@ -1,7 +1,8 @@ from __future__ import annotations +from dataclasses import dataclass import re -from typing import Iterable, List +from typing import List, Optional from bs4 import BeautifulSoup import markdown as markdown_lib @@ -10,6 +11,13 @@ MAX_WORDS = 1024 +@dataclass +class BodyBlock: + text: str + words: int + context: str + + def markdown_to_text(markdown: str) -> str: html = markdown_lib.markdown(markdown) soup = BeautifulSoup(html, "html.parser") @@ -43,56 +51,137 @@ def _word_count(text: str) -> int: return len(text.split()) +def _parse_heading_line(line: str) -> Optional[tuple[int, str]]: + match = re.match(r"^\s*(#{1,6})\s+(.*?)\s*$", line) + if not match: + return None + + level = len(match.group(1)) + heading_raw = re.sub(r"\s+#+\s*$", "", match.group(2)).strip() + heading_text = markdown_to_text(heading_raw) + if not heading_text: + return None + return level, heading_text + + +def _normalize_context_path(heading_stack: List[Optional[str]]) -> str: + top_two = [heading for heading in heading_stack if heading][:2] + return " > ".join(top_two) + + +def _parse_body_blocks(markdown: str) -> List[BodyBlock]: + blocks: List[BodyBlock] = [] + heading_stack: List[Optional[str]] = [] + paragraph_lines: List[str] = [] + + def flush_paragraph() -> None: + nonlocal paragraph_lines + if not paragraph_lines: + return + + paragraph = "\n".join(paragraph_lines).strip() + paragraph_lines = [] + if not paragraph: + return + + text = markdown_to_text(paragraph) + if not text: + return + + blocks.append( + BodyBlock( + text=text, + words=_word_count(text), + context=_normalize_context_path(heading_stack), + ) + ) + + for line in markdown.splitlines(): + heading = _parse_heading_line(line) + if heading is not None: + flush_paragraph() + level, heading_text = heading + + while len(heading_stack) < level: + heading_stack.append(None) + heading_stack = heading_stack[:level] + heading_stack[level - 1] = heading_text + continue + + if not line.strip(): + flush_paragraph() + continue + + paragraph_lines.append(line) + + flush_paragraph() + return blocks + + +def _compose_chunk_text(*, context: str, body: str) -> str: + if context: + return f"{context} {body}".strip() + return body.strip() + + def chunk_markdown(markdown: str) -> List[str]: - sections = _split_sections(markdown) - blocks: List[str] = [] - for section in sections: - blocks.extend(_split_paragraphs(section)) + blocks = _parse_body_blocks(markdown) chunks: List[str] = [] current: List[str] = [] current_words = 0 + current_context: Optional[str] = None + + def emit(body: str, context: str) -> None: + body = body.strip() + if not body: + return + chunks.append(_compose_chunk_text(context=context, body=body)) def flush_current() -> None: - nonlocal current, current_words + nonlocal current, current_words, current_context if current: - chunks.append("\n\n".join(current).strip()) + emit(" ".join(current), current_context or "") current = [] current_words = 0 + current_context = None for block in blocks: - block_text = markdown_to_text(block) - block_words = _word_count(block_text) - if block_words > MAX_WORDS: - if current: - flush_current() - sentences = _split_sentences(block_text) + if block.words > MAX_WORDS: + flush_current() + sentences = _split_sentences(block.text) sentence_group: List[str] = [] sentence_words = 0 for sentence in sentences: word_count = _word_count(sentence) if sentence_words + word_count > MAX_WORDS and sentence_group: - chunks.append(" ".join(sentence_group).strip()) + emit(" ".join(sentence_group), block.context) sentence_group = [] sentence_words = 0 sentence_group.append(sentence) sentence_words += word_count if sentence_words >= TARGET_WORDS: - chunks.append(" ".join(sentence_group).strip()) + emit(" ".join(sentence_group), block.context) sentence_group = [] sentence_words = 0 if sentence_group: - chunks.append(" ".join(sentence_group).strip()) + emit(" ".join(sentence_group), block.context) continue - if current_words + block_words > MAX_WORDS and current: + if current and block.context != current_context: flush_current() - current.append(block) - current_words += block_words + if current_words + block.words > MAX_WORDS and current: + flush_current() + + if not current: + current_context = block.context + + current.append(block.text) + current_words += block.words if current_words >= TARGET_WORDS: flush_current() flush_current() - return [markdown_to_text(chunk) for chunk in chunks if chunk] + return chunks diff --git a/packages/search-core/tests/test_chunking.py b/packages/search-core/tests/test_chunking.py index cb33961..db35981 100644 --- a/packages/search-core/tests/test_chunking.py +++ b/packages/search-core/tests/test_chunking.py @@ -31,7 +31,86 @@ def test_split_sentences_strips_whitespace_and_empties(): assert sentences == ["First sentence.", "Second sentence!", "Third?"] -def test_chunk_markdown_splits_oversized_block_by_sentences(monkeypatch): +def test_chunk_markdown_formats_top_two_context_and_truncates_deeper_levels(monkeypatch): + monkeypatch.setattr(chunking, "TARGET_WORDS", 100) + monkeypatch.setattr(chunking, "MAX_WORDS", 100) + + markdown = """# API + +intro text + +## Auth + +token flow + +### Refresh + +refresh tokens +""" + + chunks = chunking.chunk_markdown(markdown) + + assert chunks == [ + "API intro text", + "API > Auth token flow refresh tokens", + ] + + +def test_chunk_markdown_flushes_when_context_changes(monkeypatch): + monkeypatch.setattr(chunking, "TARGET_WORDS", 100) + monkeypatch.setattr(chunking, "MAX_WORDS", 100) + + markdown = """# Alpha + +first block + +## Beta + +second block + +# Gamma + +third block +""" + + chunks = chunking.chunk_markdown(markdown) + + assert chunks == [ + "Alpha first block", + "Alpha > Beta second block", + "Gamma third block", + ] + + +def test_chunk_markdown_skips_heading_only_sections(): + markdown = """# One + +## Two + +### Three +""" + + chunks = chunking.chunk_markdown(markdown) + + assert chunks == [] + + +def test_chunk_markdown_context_words_do_not_count_toward_max(monkeypatch): + monkeypatch.setattr(chunking, "TARGET_WORDS", 100) + monkeypatch.setattr(chunking, "MAX_WORDS", 4) + + markdown = """# very long heading context for this section + +one two three four +""" + + chunks = chunking.chunk_markdown(markdown) + + assert chunks == ["very long heading context for this section one two three four"] + assert chunking._word_count(chunks[0]) > chunking.MAX_WORDS + + +def test_chunk_markdown_splits_oversized_block_by_sentences_with_context(monkeypatch): monkeypatch.setattr(chunking, "TARGET_WORDS", 4) monkeypatch.setattr(chunking, "MAX_WORDS", 6) @@ -46,8 +125,8 @@ def test_chunk_markdown_splits_oversized_block_by_sentences(monkeypatch): assert chunks == [ "Heading preface words", - "one two three. four five six.", - "seven eight nine.", + "Heading one two three. four five six.", + "Heading seven eight nine.", ] @@ -65,7 +144,7 @@ def test_chunk_markdown_flushes_when_next_block_would_exceed_max(monkeypatch): assert chunks == ["one two three", "four five six"] -def test_chunk_markdown_sentence_group_flushes_on_max_overflow(monkeypatch): +def test_chunk_markdown_sentence_group_flushes_on_max_overflow_with_context(monkeypatch): monkeypatch.setattr(chunking, "TARGET_WORDS", 100) monkeypatch.setattr(chunking, "MAX_WORDS", 5) @@ -76,7 +155,7 @@ def test_chunk_markdown_sentence_group_flushes_on_max_overflow(monkeypatch): chunks = chunking.chunk_markdown(markdown) - assert chunks == ["Heading", "one two three.", "four five six."] + assert chunks == ["Heading one two three.", "Heading four five six."] def test_chunk_markdown_flushes_when_target_is_reached(monkeypatch): diff --git a/packages/search-core/tests/test_service.py b/packages/search-core/tests/test_service.py index 4a00242..c7ebbc0 100644 --- a/packages/search-core/tests/test_service.py +++ b/packages/search-core/tests/test_service.py @@ -135,6 +135,24 @@ def test_upsert_document_without_content_change_preserves_existing_links(service assert [row["to_document_id"] for row in updated_links] == [row["to_document_id"] for row in original_links] +def test_upsert_document_chunks_inline_heading_context(service: SearchService): + source = service.upsert_source("example.com", name="Example") + document = service.upsert_document( + source_id=source.id, + canonical_url="https://example.com/context", + title="Context", + published_at=None, + content_markdown="""# API + +## Auth + +token details +""", + ) + + assert _chunk_texts(service, document.id) == ["API > Auth token details"] + + def test_upsert_document_rejects_empty_content(service: SearchService): source = service.upsert_source("example.com", name="Example")