Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-05
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
## Context

`packages/search-core/src/grogbot_search/chunking.py` currently chunks markdown by section and paragraph, then emits plaintext chunks used by both FTS (`chunks_fts`) and vector embeddings (`chunks_vec`). Current behavior can emit heading-only or weakly contextual chunks, which reduces relevance for section-oriented queries. We want to inline section context directly into chunk text while preserving existing ingestion and search architecture (same schema, same rank-fusion pipeline).

Constraints:
- Keep chunking deterministic and inexpensive.
- Keep compatibility with `SearchService._insert_plaintext_chunks`, which stores a single `content_text` field.
- Maintain existing chunk-size tuning semantics (`TARGET_WORDS`, `MAX_WORDS`) for body text.
- Assume fresh database ingestion (no backfill/migration of existing chunk rows required).

## Goals / Non-Goals

**Goals:**
- Add section-aware context to each emitted chunk by prepending a plain heading path.
- Limit context to top two heading levels to prevent noisy prefixes.
- Keep chunk budget decisions based on body text only.
- Avoid heading-only chunks and avoid mixing multiple section contexts in one chunk.
- Preserve oversized paragraph fallback behavior, with context retained on sentence-based splits.

**Non-Goals:**
- No database schema changes (no separate context column).
- No ranking formula changes in `SearchService.search`.
- No migration workflow for legacy databases.
- No overlap-window chunking redesign in this change.

## Decisions

1. **Inline context in `content_text` with no marker**
- Decision: Prepend plain context text in `H1 > H2` form (or single heading when only one level exists), followed by body content.
- Rationale: Works immediately with existing FTS + embedding pipeline and avoids schema/search changes.
- Alternatives considered:
- Context marker like `[CTX] ...`: rejected to reduce lexical noise and formatting overhead.
- Separate context column: rejected for this change due to schema and query complexity.

2. **Context source is heading stack truncated to top 2 levels**
- Decision: Build context from active markdown heading hierarchy, keeping only first two levels.
- Rationale: Keeps high-signal topic labels while avoiding long/deep heading paths.
- Alternatives considered:
- Deepest two levels: can drop broad domain context.
- Full hierarchy: higher noise and repeated text.

3. **Chunk budgeting excludes context words**
- Decision: `TARGET_WORDS`/`MAX_WORDS` calculations use body words only.
- Rationale: Preserves current chunk-size behavior and avoids shrinking body payload for long headings.
- Alternatives considered:
- Include context in budget: simpler accounting but unstable body capacity.

4. **Flush on context change and suppress heading-only output**
- Decision: When heading path changes, flush active body chunk before accumulating blocks under the new context. Do not emit chunks containing only headings.
- Rationale: Prevents mixed-topic chunks and removes low-value fragments.
- Alternatives considered:
- Allow mixed-context chunks until size threshold: risks ambiguous retrieval matches.

5. **Sentence fallback keeps inherited context**
- Decision: If a body block exceeds `MAX_WORDS`, split by sentence groups as today and prepend the same context to each emitted chunk.
- Rationale: Preserves oversized-content handling while maintaining topic cues.
- Alternatives considered:
- Drop context for fallback chunks: creates inconsistent retrieval behavior.

## Risks / Trade-offs

- **[Risk] Context text dominates short-body chunks** → **Mitigation:** cap depth at two levels and avoid heading-only chunks.
- **[Risk] Plain prefix may alter lexical scoring distribution** → **Mitigation:** validate with focused retrieval tests (heading-term and body-term queries).
- **[Risk] Regex sentence splitting remains imperfect for abbreviations/edge punctuation** → **Mitigation:** preserve existing behavior in this change and isolate improvements for a follow-up.
- **[Trade-off] No marker means less explicit machine parsing** → **Mitigation:** deterministic prefix format (`H1 > H2`) keeps behavior predictable for tests.

## Migration Plan

- Fresh-ingest assumption means no data migration is needed.
- Rollout consists of deploying updated chunking logic, then ingesting documents into a new database.
- Rollback is straightforward: deploy previous chunking logic and re-ingest into a fresh database.

## Open Questions

- Should extremely generic headings (for example, "Overview") be filtered from context in a follow-up?
- Should future work add optional overlap windows after context inlining quality is measured?
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## Why

Search chunks currently rely on paragraph text alone and can emit heading-only fragments, which weakens retrieval for section-oriented queries and introduces low-value noise. We want each body chunk to carry stable section context so both FTS and vector ranking can use topical signals without changing the search schema.

## What Changes

- Inline section context into each chunk’s `content_text` as a plain prefix using the top two heading levels (for example, `API > Auth`), with no context marker.
- Exclude inline context words from chunk size budgeting; `TARGET_WORDS` and `MAX_WORDS` continue to apply to body text only.
- Stop emitting heading-only chunks and flush active chunks when section context changes to avoid mixed-topic chunks.
- Preserve oversized-block sentence fallback behavior, while carrying the same section context into each emitted sentence-group chunk.
- Add/adjust tests to lock formatting, budget rules, context transitions, and oversized split behavior.

## Capabilities

### New Capabilities
- `search-chunk-context`: Produces context-aware plaintext chunks that prepend top-level section path information to body content during ingestion.

### Modified Capabilities
- None.

## Impact

- Affected code: `packages/search-core/src/grogbot_search/chunking.py`
- Affected tests: `packages/search-core/tests/test_chunking.py` and relevant retrieval assertions in `packages/search-core/tests/test_service.py`
- Data/index impact: newly ingested documents will store context-enriched chunk text in `chunks.content_text`, which feeds both FTS and vector embedding generation
- No API shape changes expected; behavior change is in ingestion/chunk text composition
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
## ADDED Requirements

### Requirement: Chunks SHALL inline top-level section context into emitted text
The chunking pipeline SHALL prepend section context to each emitted body chunk using the active heading hierarchy, formatted as plain text with ` > ` separators and no marker token. The context SHALL include at most the top two heading levels from the current heading stack.

#### Scenario: Chunk includes two-level heading context
- **WHEN** body content is chunked under an active `H1` and `H2`
- **THEN** each emitted chunk text begins with `H1 > H2` followed by the chunk body text

#### Scenario: Deep headings are truncated to top two levels
- **WHEN** body content appears under `H1`, `H2`, and deeper headings such as `H3`
- **THEN** emitted chunk context includes only `H1 > H2`

#### Scenario: Single-level context is preserved
- **WHEN** body content has only an active `H1` heading
- **THEN** emitted chunk context begins with only the `H1` text

### Requirement: Chunk size budgeting SHALL be based on body words only
`TARGET_WORDS` and `MAX_WORDS` enforcement SHALL use body-text word counts only and SHALL exclude inline context words from budget accounting.

#### Scenario: Context does not force early chunk split
- **WHEN** body text remains within `MAX_WORDS` but context text is long
- **THEN** the chunk is not split due to context word count

#### Scenario: Context may exceed apparent total words above max
- **WHEN** body text exactly fits the configured limit and context is prepended
- **THEN** emitted chunk text may exceed `MAX_WORDS` in total words while remaining valid

### Requirement: Chunker SHALL avoid heading-only output and mixed-context chunks
The chunker SHALL emit chunks only for body content and SHALL flush active chunk accumulation when heading context changes so one chunk does not mix body from different contexts.

#### Scenario: Heading-only sections produce no standalone chunk
- **WHEN** a heading has no body text before the next heading or document end
- **THEN** no chunk is emitted containing only heading text

#### Scenario: Context transition flushes current chunk
- **WHEN** body has been accumulated under one heading context and a new heading context begins
- **THEN** the current chunk is finalized before accumulating body under the new context

### Requirement: Oversized body fallback SHALL preserve section context
When a body block exceeds `MAX_WORDS` and sentence-group fallback is used, each emitted sentence-group chunk SHALL include the same inherited section context prefix.

#### Scenario: Sentence-split chunks keep identical context
- **WHEN** a large body block is split into multiple chunks by sentence grouping
- **THEN** each emitted chunk begins with the same context path that applied to the oversized block
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
## 1. Context-aware block modeling

- [x] 1.1 Update `chunking.py` parsing flow to track active heading hierarchy and associate each body block with its heading context.
- [x] 1.2 Implement heading-path normalization that emits at most the top two heading levels and supports single-level context.
- [x] 1.3 Ensure heading-only segments do not produce body blocks eligible for chunk emission.

## 2. Chunk emission and sizing behavior

- [x] 2.1 Update chunk accumulation logic to flush when context changes so one chunk does not mix body from multiple contexts.
- [x] 2.2 Prepend plain context text (`H1 > H2` with no marker) to emitted chunk text while keeping body-only word budgeting for `TARGET_WORDS` and `MAX_WORDS`.
- [x] 2.3 Preserve oversized-block sentence fallback and ensure each sentence-group chunk inherits the same context prefix.

## 3. Test coverage updates

- [x] 3.1 Extend `test_chunking.py` with golden-output tests for top-two context formatting, context-change flush behavior, and heading-only suppression.
- [x] 3.2 Add tests proving context is excluded from budget calculations and may increase total emitted words above `MAX_WORDS`.
- [x] 3.3 Update oversized-block tests to assert context preservation across sentence-split outputs.

## 4. Retrieval validation

- [x] 4.1 Add or adjust `test_service.py` assertions to verify ingested `chunks.content_text` includes inlined context for headed markdown content.
- [x] 4.2 Run targeted test suites (`test_chunking.py`, relevant `test_service.py` cases) and resolve regressions.
129 changes: 109 additions & 20 deletions packages/search-core/src/grogbot_search/chunking.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
from __future__ import annotations

from dataclasses import dataclass
import re
from typing import Iterable, List
from typing import List, Optional

from bs4 import BeautifulSoup
import markdown as markdown_lib
Expand All @@ -10,6 +11,13 @@
MAX_WORDS = 1024


@dataclass
class BodyBlock:
text: str
words: int
context: str


def markdown_to_text(markdown: str) -> str:
html = markdown_lib.markdown(markdown)
soup = BeautifulSoup(html, "html.parser")
Expand Down Expand Up @@ -43,56 +51,137 @@ def _word_count(text: str) -> int:
return len(text.split())


def _parse_heading_line(line: str) -> Optional[tuple[int, str]]:
match = re.match(r"^\s*(#{1,6})\s+(.*?)\s*$", line)
if not match:
return None

level = len(match.group(1))
heading_raw = re.sub(r"\s+#+\s*$", "", match.group(2)).strip()
heading_text = markdown_to_text(heading_raw)
if not heading_text:
return None
return level, heading_text


def _normalize_context_path(heading_stack: List[Optional[str]]) -> str:
top_two = [heading for heading in heading_stack if heading][:2]
return " > ".join(top_two)


def _parse_body_blocks(markdown: str) -> List[BodyBlock]:
blocks: List[BodyBlock] = []
heading_stack: List[Optional[str]] = []
paragraph_lines: List[str] = []

def flush_paragraph() -> None:
nonlocal paragraph_lines
if not paragraph_lines:
return

paragraph = "\n".join(paragraph_lines).strip()
paragraph_lines = []
if not paragraph:
return

text = markdown_to_text(paragraph)
if not text:
return

blocks.append(
BodyBlock(
text=text,
words=_word_count(text),
context=_normalize_context_path(heading_stack),
)
)

for line in markdown.splitlines():
heading = _parse_heading_line(line)
if heading is not None:
flush_paragraph()
level, heading_text = heading

while len(heading_stack) < level:
heading_stack.append(None)
heading_stack = heading_stack[:level]
heading_stack[level - 1] = heading_text
continue

if not line.strip():
flush_paragraph()
continue

paragraph_lines.append(line)

flush_paragraph()
return blocks


def _compose_chunk_text(*, context: str, body: str) -> str:
if context:
return f"{context} {body}".strip()
return body.strip()


def chunk_markdown(markdown: str) -> List[str]:
sections = _split_sections(markdown)
blocks: List[str] = []
for section in sections:
blocks.extend(_split_paragraphs(section))
blocks = _parse_body_blocks(markdown)

chunks: List[str] = []
current: List[str] = []
current_words = 0
current_context: Optional[str] = None

def emit(body: str, context: str) -> None:
body = body.strip()
if not body:
return
chunks.append(_compose_chunk_text(context=context, body=body))

def flush_current() -> None:
nonlocal current, current_words
nonlocal current, current_words, current_context
if current:
chunks.append("\n\n".join(current).strip())
emit(" ".join(current), current_context or "")
current = []
current_words = 0
current_context = None

for block in blocks:
block_text = markdown_to_text(block)
block_words = _word_count(block_text)
if block_words > MAX_WORDS:
if current:
flush_current()
sentences = _split_sentences(block_text)
if block.words > MAX_WORDS:
flush_current()
sentences = _split_sentences(block.text)
sentence_group: List[str] = []
sentence_words = 0
for sentence in sentences:
word_count = _word_count(sentence)
if sentence_words + word_count > MAX_WORDS and sentence_group:
chunks.append(" ".join(sentence_group).strip())
emit(" ".join(sentence_group), block.context)
sentence_group = []
sentence_words = 0
sentence_group.append(sentence)
sentence_words += word_count
if sentence_words >= TARGET_WORDS:
chunks.append(" ".join(sentence_group).strip())
emit(" ".join(sentence_group), block.context)
sentence_group = []
sentence_words = 0
if sentence_group:
chunks.append(" ".join(sentence_group).strip())
emit(" ".join(sentence_group), block.context)
continue

if current_words + block_words > MAX_WORDS and current:
if current and block.context != current_context:
flush_current()

current.append(block)
current_words += block_words
if current_words + block.words > MAX_WORDS and current:
flush_current()

if not current:
current_context = block.context

current.append(block.text)
current_words += block.words

if current_words >= TARGET_WORDS:
flush_current()

flush_current()
return [markdown_to_text(chunk) for chunk in chunks if chunk]
return chunks
Loading