lurkshark · lurkshark · Mar 5, 2026 · Mar 5, 2026
diff --git a/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/.openspec.yaml b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-03-05
diff --git a/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/design.md b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/design.md
@@ -0,0 +1,76 @@
+## Context
+
+`packages/search-core/src/grogbot_search/chunking.py` currently chunks markdown by section and paragraph, then emits plaintext chunks used by both FTS (`chunks_fts`) and vector embeddings (`chunks_vec`). Current behavior can emit heading-only or weakly contextual chunks, which reduces relevance for section-oriented queries. We want to inline section context directly into chunk text while preserving existing ingestion and search architecture (same schema, same rank-fusion pipeline).
+
+Constraints:
+- Keep chunking deterministic and inexpensive.
+- Keep compatibility with `SearchService._insert_plaintext_chunks`, which stores a single `content_text` field.
+- Maintain existing chunk-size tuning semantics (`TARGET_WORDS`, `MAX_WORDS`) for body text.
+- Assume fresh database ingestion (no backfill/migration of existing chunk rows required).
+
+## Goals / Non-Goals
+
+**Goals:**
+- Add section-aware context to each emitted chunk by prepending a plain heading path.
+- Limit context to top two heading levels to prevent noisy prefixes.
+- Keep chunk budget decisions based on body text only.
+- Avoid heading-only chunks and avoid mixing multiple section contexts in one chunk.
+- Preserve oversized paragraph fallback behavior, with context retained on sentence-based splits.
+
+**Non-Goals:**
+- No database schema changes (no separate context column).
+- No ranking formula changes in `SearchService.search`.
+- No migration workflow for legacy databases.
+- No overlap-window chunking redesign in this change.
+
+## Decisions
+
+1. **Inline context in `content_text` with no marker**
+   - Decision: Prepend plain context text in `H1 > H2` form (or single heading when only one level exists), followed by body content.
+   - Rationale: Works immediately with existing FTS + embedding pipeline and avoids schema/search changes.
+   - Alternatives considered:
+     - Context marker like `[CTX] ...`: rejected to reduce lexical noise and formatting overhead.
+     - Separate context column: rejected for this change due to schema and query complexity.
+
+2. **Context source is heading stack truncated to top 2 levels**
+   - Decision: Build context from active markdown heading hierarchy, keeping only first two levels.
+   - Rationale: Keeps high-signal topic labels while avoiding long/deep heading paths.
+   - Alternatives considered:
+     - Deepest two levels: can drop broad domain context.
+     - Full hierarchy: higher noise and repeated text.
+
+3. **Chunk budgeting excludes context words**
+   - Decision: `TARGET_WORDS`/`MAX_WORDS` calculations use body words only.
+   - Rationale: Preserves current chunk-size behavior and avoids shrinking body payload for long headings.
+   - Alternatives considered:
+     - Include context in budget: simpler accounting but unstable body capacity.
+
+4. **Flush on context change and suppress heading-only output**
+   - Decision: When heading path changes, flush active body chunk before accumulating blocks under the new context. Do not emit chunks containing only headings.
+   - Rationale: Prevents mixed-topic chunks and removes low-value fragments.
+   - Alternatives considered:
+     - Allow mixed-context chunks until size threshold: risks ambiguous retrieval matches.
+
+5. **Sentence fallback keeps inherited context**
+   - Decision: If a body block exceeds `MAX_WORDS`, split by sentence groups as today and prepend the same context to each emitted chunk.
+   - Rationale: Preserves oversized-content handling while maintaining topic cues.
+   - Alternatives considered:
+     - Drop context for fallback chunks: creates inconsistent retrieval behavior.
+
+## Risks / Trade-offs
+
+- **[Risk] Context text dominates short-body chunks** → **Mitigation:** cap depth at two levels and avoid heading-only chunks.
+- **[Risk] Plain prefix may alter lexical scoring distribution** → **Mitigation:** validate with focused retrieval tests (heading-term and body-term queries).
+- **[Risk] Regex sentence splitting remains imperfect for abbreviations/edge punctuation** → **Mitigation:** preserve existing behavior in this change and isolate improvements for a follow-up.
+- **[Trade-off] No marker means less explicit machine parsing** → **Mitigation:** deterministic prefix format (`H1 > H2`) keeps behavior predictable for tests.
+
+## Migration Plan
+
+- Fresh-ingest assumption means no data migration is needed.
+- Rollout consists of deploying updated chunking logic, then ingesting documents into a new database.
+- Rollback is straightforward: deploy previous chunking logic and re-ingest into a fresh database.
+
+## Open Questions
+
+- Should extremely generic headings (for example, "Overview") be filtered from context in a follow-up?
+- Should future work add optional overlap windows after context inlining quality is measured?
diff --git a/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/proposal.md b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/proposal.md
@@ -0,0 +1,26 @@
+## Why
+
+Search chunks currently rely on paragraph text alone and can emit heading-only fragments, which weakens retrieval for section-oriented queries and introduces low-value noise. We want each body chunk to carry stable section context so both FTS and vector ranking can use topical signals without changing the search schema.
+
+## What Changes
+
+- Inline section context into each chunk’s `content_text` as a plain prefix using the top two heading levels (for example, `API > Auth`), with no context marker.
+- Exclude inline context words from chunk size budgeting; `TARGET_WORDS` and `MAX_WORDS` continue to apply to body text only.
+- Stop emitting heading-only chunks and flush active chunks when section context changes to avoid mixed-topic chunks.
+- Preserve oversized-block sentence fallback behavior, while carrying the same section context into each emitted sentence-group chunk.
+- Add/adjust tests to lock formatting, budget rules, context transitions, and oversized split behavior.
+
+## Capabilities
+
+### New Capabilities
+- `search-chunk-context`: Produces context-aware plaintext chunks that prepend top-level section path information to body content during ingestion.
+
+### Modified Capabilities
+- None.
+
+## Impact
+
+- Affected code: `packages/search-core/src/grogbot_search/chunking.py`
+- Affected tests: `packages/search-core/tests/test_chunking.py` and relevant retrieval assertions in `packages/search-core/tests/test_service.py`
+- Data/index impact: newly ingested documents will store context-enriched chunk text in `chunks.content_text`, which feeds both FTS and vector embedding generation
+- No API shape changes expected; behavior change is in ingestion/chunk text composition
diff --git a/.../2026-03-05-inline-section-context-in-chunks/specs/search-chunk-context/spec.md b/.../2026-03-05-inline-section-context-in-chunks/specs/search-chunk-context/spec.md
@@ -0,0 +1,45 @@
+## ADDED Requirements
+
+### Requirement: Chunks SHALL inline top-level section context into emitted text
+The chunking pipeline SHALL prepend section context to each emitted body chunk using the active heading hierarchy, formatted as plain text with ` > ` separators and no marker token. The context SHALL include at most the top two heading levels from the current heading stack.
+
+#### Scenario: Chunk includes two-level heading context
+- **WHEN** body content is chunked under an active `H1` and `H2`
+- **THEN** each emitted chunk text begins with `H1 > H2` followed by the chunk body text
+
+#### Scenario: Deep headings are truncated to top two levels
+- **WHEN** body content appears under `H1`, `H2`, and deeper headings such as `H3`
+- **THEN** emitted chunk context includes only `H1 > H2`
+
+#### Scenario: Single-level context is preserved
+- **WHEN** body content has only an active `H1` heading
+- **THEN** emitted chunk context begins with only the `H1` text
+
+### Requirement: Chunk size budgeting SHALL be based on body words only
+`TARGET_WORDS` and `MAX_WORDS` enforcement SHALL use body-text word counts only and SHALL exclude inline context words from budget accounting.
+
+#### Scenario: Context does not force early chunk split
+- **WHEN** body text remains within `MAX_WORDS` but context text is long
+- **THEN** the chunk is not split due to context word count
+
+#### Scenario: Context may exceed apparent total words above max
+- **WHEN** body text exactly fits the configured limit and context is prepended
+- **THEN** emitted chunk text may exceed `MAX_WORDS` in total words while remaining valid
+
+### Requirement: Chunker SHALL avoid heading-only output and mixed-context chunks
+The chunker SHALL emit chunks only for body content and SHALL flush active chunk accumulation when heading context changes so one chunk does not mix body from different contexts.
+
+#### Scenario: Heading-only sections produce no standalone chunk
+- **WHEN** a heading has no body text before the next heading or document end
+- **THEN** no chunk is emitted containing only heading text
+
+#### Scenario: Context transition flushes current chunk
+- **WHEN** body has been accumulated under one heading context and a new heading context begins
+- **THEN** the current chunk is finalized before accumulating body under the new context
+
+### Requirement: Oversized body fallback SHALL preserve section context
+When a body block exceeds `MAX_WORDS` and sentence-group fallback is used, each emitted sentence-group chunk SHALL include the same inherited section context prefix.
+
+#### Scenario: Sentence-split chunks keep identical context
+- **WHEN** a large body block is split into multiple chunks by sentence grouping
+- **THEN** each emitted chunk begins with the same context path that applied to the oversized block
diff --git a/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/tasks.md b/openspec/changes/archive/2026-03-05-inline-section-context-in-chunks/tasks.md
@@ -0,0 +1,22 @@
+## 1. Context-aware block modeling
+
+- [x] 1.1 Update `chunking.py` parsing flow to track active heading hierarchy and associate each body block with its heading context.
+- [x] 1.2 Implement heading-path normalization that emits at most the top two heading levels and supports single-level context.
+- [x] 1.3 Ensure heading-only segments do not produce body blocks eligible for chunk emission.
+
+## 2. Chunk emission and sizing behavior
+
+- [x] 2.1 Update chunk accumulation logic to flush when context changes so one chunk does not mix body from multiple contexts.
+- [x] 2.2 Prepend plain context text (`H1 > H2` with no marker) to emitted chunk text while keeping body-only word budgeting for `TARGET_WORDS` and `MAX_WORDS`.
+- [x] 2.3 Preserve oversized-block sentence fallback and ensure each sentence-group chunk inherits the same context prefix.
+
+## 3. Test coverage updates
+
+- [x] 3.1 Extend `test_chunking.py` with golden-output tests for top-two context formatting, context-change flush behavior, and heading-only suppression.
+- [x] 3.2 Add tests proving context is excluded from budget calculations and may increase total emitted words above `MAX_WORDS`.
+- [x] 3.3 Update oversized-block tests to assert context preservation across sentence-split outputs.
+
+## 4. Retrieval validation
+
+- [x] 4.1 Add or adjust `test_service.py` assertions to verify ingested `chunks.content_text` includes inlined context for headed markdown content.
+- [x] 4.2 Run targeted test suites (`test_chunking.py`, relevant `test_service.py` cases) and resolve regressions.
diff --git a/packages/search-core/src/grogbot_search/chunking.py b/packages/search-core/src/grogbot_search/chunking.py
@@ -1,7 +1,8 @@
 from __future__ import annotations
 
+from dataclasses import dataclass
 import re
-from typing import Iterable, List
+from typing import List, Optional
 
 from bs4 import BeautifulSoup
 import markdown as markdown_lib
@@ -10,6 +11,13 @@
 MAX_WORDS = 1024
 
 
+@dataclass
+class BodyBlock:
+    text: str
+    words: int
+    context: str
+
+
 def markdown_to_text(markdown: str) -> str:
     html = markdown_lib.markdown(markdown)
     soup = BeautifulSoup(html, "html.parser")
@@ -43,56 +51,137 @@ def _word_count(text: str) -> int:
     return len(text.split())
 
 
+def _parse_heading_line(line: str) -> Optional[tuple[int, str]]:
+    match = re.match(r"^\s*(#{1,6})\s+(.*?)\s*$", line)
+    if not match:
+        return None
+
+    level = len(match.group(1))
+    heading_raw = re.sub(r"\s+#+\s*$", "", match.group(2)).strip()
+    heading_text = markdown_to_text(heading_raw)
+    if not heading_text:
+        return None
+    return level, heading_text
+
+
+def _normalize_context_path(heading_stack: List[Optional[str]]) -> str:
+    top_two = [heading for heading in heading_stack if heading][:2]
+    return " > ".join(top_two)
+
+
+def _parse_body_blocks(markdown: str) -> List[BodyBlock]:
+    blocks: List[BodyBlock] = []
+    heading_stack: List[Optional[str]] = []
+    paragraph_lines: List[str] = []
+
+    def flush_paragraph() -> None:
+        nonlocal paragraph_lines
+        if not paragraph_lines:
+            return
+
+        paragraph = "\n".join(paragraph_lines).strip()
+        paragraph_lines = []
+        if not paragraph:
+            return
+
+        text = markdown_to_text(paragraph)
+        if not text:
+            return
+
+        blocks.append(
+            BodyBlock(
+                text=text,
+                words=_word_count(text),
+                context=_normalize_context_path(heading_stack),
+            )
+        )
+
+    for line in markdown.splitlines():
+        heading = _parse_heading_line(line)
+        if heading is not None:
+            flush_paragraph()
+            level, heading_text = heading
+
+            while len(heading_stack) < level:
+                heading_stack.append(None)
+            heading_stack = heading_stack[:level]
+            heading_stack[level - 1] = heading_text
+            continue
+
+        if not line.strip():
+            flush_paragraph()
+            continue
+
+        paragraph_lines.append(line)
+
+    flush_paragraph()
+    return blocks
+
+
+def _compose_chunk_text(*, context: str, body: str) -> str:
+    if context:
+        return f"{context} {body}".strip()
+    return body.strip()
+
+
 def chunk_markdown(markdown: str) -> List[str]:
-    sections = _split_sections(markdown)
-    blocks: List[str] = []
-    for section in sections:
-        blocks.extend(_split_paragraphs(section))
+    blocks = _parse_body_blocks(markdown)
 
     chunks: List[str] = []
     current: List[str] = []
     current_words = 0
+    current_context: Optional[str] = None
+
+    def emit(body: str, context: str) -> None:
+        body = body.strip()
+        if not body:
+            return
+        chunks.append(_compose_chunk_text(context=context, body=body))
 
     def flush_current() -> None:
-        nonlocal current, current_words
+        nonlocal current, current_words, current_context
         if current:
-            chunks.append("\n\n".join(current).strip())
+            emit(" ".join(current), current_context or "")
         current = []
         current_words = 0
+        current_context = None
 
     for block in blocks:
-        block_text = markdown_to_text(block)
-        block_words = _word_count(block_text)
-        if block_words > MAX_WORDS:
-            if current:
-                flush_current()
-            sentences = _split_sentences(block_text)
+        if block.words > MAX_WORDS:
+            flush_current()
+            sentences = _split_sentences(block.text)
             sentence_group: List[str] = []
             sentence_words = 0
             for sentence in sentences:
                 word_count = _word_count(sentence)
                 if sentence_words + word_count > MAX_WORDS and sentence_group:
-                    chunks.append(" ".join(sentence_group).strip())
+                    emit(" ".join(sentence_group), block.context)
                     sentence_group = []
                     sentence_words = 0
                 sentence_group.append(sentence)
                 sentence_words += word_count
                 if sentence_words >= TARGET_WORDS:
-                    chunks.append(" ".join(sentence_group).strip())
+                    emit(" ".join(sentence_group), block.context)
                     sentence_group = []
                     sentence_words = 0
             if sentence_group:
-                chunks.append(" ".join(sentence_group).strip())
+                emit(" ".join(sentence_group), block.context)
             continue
 
-        if current_words + block_words > MAX_WORDS and current:
+        if current and block.context != current_context:
             flush_current()
 
-        current.append(block)
-        current_words += block_words
+        if current_words + block.words > MAX_WORDS and current:
+            flush_current()
+
+        if not current:
+            current_context = block.context
+
+        current.append(block.text)
+        current_words += block.words
 
         if current_words >= TARGET_WORDS:
             flush_current()
 
     flush_current()
-    return [markdown_to_text(chunk) for chunk in chunks if chunk]
+    return chunks