Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,22 @@ GET /search/sources
POST /search/sources
GET /search/documents/{document_id}
POST /search/ingest/url
POST /search/documents/embed
POST /search/documents/embed/sync
GET /search/query?q=hello+world
```

## Document storage and embedding workflow

- `content_markdown` is accepted on upsert/ingest inputs, but it is **not persisted** in the `documents` table.
- Documents now persist a compact `content_hash` (6-character lowercase hex digest).
- Upsert/ingestion regenerates plaintext chunks and outbound links when content changes.
- Embeddings are generated explicitly:
- CLI: `grogbot search document embed <document_id>`
- CLI (bulk): `grogbot search document embed-sync --maximum 100`
- API: `POST /search/documents/embed`
- API (bulk): `POST /search/documents/embed/sync`

## Development

Install test dependencies for the search core and run pytest:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-04
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
## Context

The current search-core persistence model stores full markdown in `documents.content_markdown` and also stores plaintext chunk content in `chunks.content_text`. This duplicates large payloads and increases SQLite file size over time. The current `chunk_document` and `synchronize_document_chunks` operations also couple chunk regeneration with embedding generation, which makes expensive vector work harder to control independently.

This change needs to preserve ingestion behavior (document metadata + link graph + chunk plaintext) while moving vector creation into a separate explicit phase. Existing API/CLI consumers currently receive `Document.content_markdown`, so this is an intentional breaking contract change.

## Goals / Non-Goals

**Goals:**
- Remove persisted markdown copies from documents.
- Add compact content change detection with `content_hash` as a 6-character lowercase hex digest.
- Make document upsert/ingest persist metadata and regenerate chunks/links without running embeddings.
- Add explicit embedding operations (single-document and bulk sync) that mirror current chunk sync ergonomics.
- Keep search behavior stable by continuing to read plaintext chunks from `chunks` and vectors from `chunks_vec`.

**Non-Goals:**
- Changing chunking heuristics, embedding model, or rank-fusion formulas.
- Introducing async/background workers.
- Adding new ingestion source types.

## Decisions

1. **Replace `documents.content_markdown` with `documents.content_hash`**
- Decision: Store a 6-character lowercase hex digest derived from incoming markdown content.
- Rationale: Enables fast change detection without storing full markdown text.
- Constraint: Enforce shape at the DB layer (`length = 6`, lowercase hex only).
- Alternative considered: keep full markdown plus hash; rejected because it does not reduce primary storage pressure.

2. **Regenerate chunks and links inside document upsert when content changes**
- Decision: `upsert_document` receives markdown input, computes hash, and if changed (or new doc), clears existing chunks/links and recreates plaintext chunks and outbound links immediately.
- Rationale: Keeps chunk/link data current without requiring a follow-up operation and aligns with requested document creation behavior.
- Alternative considered: leave chunk regeneration as a separate call; rejected because requested flow requires document + links + plaintext chunks at creation/upsert time.

3. **Split embeddings into dedicated operations**
- Decision: Add explicit methods for embedding generation (single document + bulk sync for missing vectors).
- Rationale: Embeddings are resource intensive and should be independently controllable like current chunk sync workflows.
- Alternative considered: keep embeddings in `chunk_document`; rejected because it preserves current coupling and does not satisfy resource-control goals.

4. **Define embedding sync by missing vector rows, not by chunkless documents**
- Decision: Bulk embedding sync selects chunks/documents where chunks exist but corresponding `chunks_vec` rows are missing.
- Rationale: Embedding state is now independent; chunk existence alone is insufficient.
- Alternative considered: always re-embed all chunks for selected documents; rejected due to unnecessary compute and cost.

5. **Migrate existing databases via table rebuild + backfill hash**
- Decision: Add a schema migration path that rebuilds `documents` to new columns and backfills `content_hash` from old `content_markdown` values before dropping markdown storage.
- Rationale: Preserves existing corpus while transitioning storage layout.
- Alternative considered: destructive reset; rejected because it would force complete re-ingestion.

## Risks / Trade-offs

- **[Risk] 6-hex hash collisions can suppress needed chunk/link refreshes** → **Mitigation:** accept as an explicit compactness trade-off; scope is local corpus, and operators can force refresh by re-upserting with a changed hash input pattern if ever needed.
- **[Risk] No stored markdown prevents offline re-chunking from DB alone** → **Mitigation:** accepted product trade-off; re-chunking requires re-ingestion from source content.
- **[Risk] Breaking API/CLI contract for `Document.content_markdown`** → **Mitigation:** mark as BREAKING in proposal/specs and update response models/docs in the same change.
- **[Risk] Migration complexity around legacy schema** → **Mitigation:** implement deterministic startup migration with transactional table swap and tests over pre-migration fixtures.

## Migration Plan

1. Detect legacy `documents` schema containing `content_markdown` and missing `content_hash`.
2. Create replacement `documents` table with new shape and constraints.
3. Copy rows from old table into new table while computing `content_hash` from legacy markdown (`sha256(markdown)[:6]`).
4. Swap tables atomically in a transaction and recreate dependent indexes/foreign keys.
5. Keep existing chunks/links/vectors intact; future upserts maintain hash/chunks/links and embedding sync handles vectors.
6. Rollback strategy: restore from SQLite backup file if migration fails before commit.

## Open Questions

- None currently.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## Why

The SQLite database currently stores full document markdown alongside plaintext chunks, which duplicates large content and increases storage pressure as the corpus grows. We need to keep ingestion efficient while reducing persisted payload size and preserving explicit control over expensive embedding generation.

## What Changes

- **BREAKING** Remove persisted markdown from the `documents` table and `Document` model responses.
- Add compact document change detection via a new `content_hash` field (6-character lowercase hex digest).
- Update document creation/upsert flows to persist document metadata, regenerate outbound links, and regenerate plaintext chunks using the incoming markdown payload, without storing markdown.
- Split vector generation into explicit embedding operations so chunk plaintext creation and embedding persistence are independently triggerable.
- Add single-document and bulk synchronization operations for missing chunk embeddings, mirroring the existing chunk synchronization pattern.
- Update API and CLI surfaces to expose the new embedding operations and return payloads that no longer include full document markdown.

## Capabilities

### New Capabilities
- `document-compact-storage`: Persist document metadata plus a short content hash while dropping stored markdown and regenerating chunk/link data from ingestion inputs.
- `chunk-embedding-sync`: Provide explicit single-document and bulk operations to generate vector rows for chunk plaintext independently of chunk creation.

### Modified Capabilities
- None.

## Impact

- `packages/search-core/src/grogbot_search/service.py` (schema, upsert flow, chunk/link lifecycle, embedding sync methods).
- `packages/search-core/src/grogbot_search/models.py` (document fields).
- `packages/api/src/grogbot_api/app.py` (request/response contracts, new embedding endpoints).
- `packages/cli/src/grogbot_cli/app.py` (new embedding commands, output contract changes).
- Search result serialization and tests that currently rely on `document.content_markdown`.
- Backward compatibility: existing consumers expecting `content_markdown` will need to migrate.
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
## ADDED Requirements

### Requirement: Embedding generation is explicit and separate from chunk creation
The system SHALL provide embedding operations that generate `chunks_vec` rows from existing plaintext chunks, independent of document upsert and chunk/link generation.

#### Scenario: Upsert does not automatically embed chunks
- **WHEN** a document upsert creates or refreshes plaintext chunks
- **THEN** embedding rows are not created unless an embedding operation is invoked

### Requirement: Embed a single document's chunks
The system SHALL provide a single-document embedding operation that processes one document id and creates missing vector rows for that document's chunks.

#### Scenario: Embedding an existing document
- **WHEN** embedding is requested for a document that exists and has chunks missing vectors
- **THEN** the system creates vector rows for missing chunks and returns the number of vectors created

#### Scenario: Embedding a missing document
- **WHEN** embedding is requested for a document id that does not exist
- **THEN** the system returns a not-found error and creates no vectors

### Requirement: Synchronize embeddings in bulk
The system SHALL provide a bulk embedding synchronization operation that processes documents with missing chunk vectors in stable order and supports an optional maximum document count.

#### Scenario: Bulk embedding sync without maximum
- **WHEN** bulk embedding synchronization runs without a maximum
- **THEN** the system processes all documents with at least one chunk missing a vector and returns total vectors created

#### Scenario: Bulk embedding sync with maximum
- **WHEN** bulk embedding synchronization runs with a maximum of N
- **THEN** the system processes at most N eligible documents in stable order and returns total vectors created

### Requirement: Embedding operations are exposed through API and CLI
The system SHALL expose single-document and bulk embedding synchronization through API and CLI interfaces.

#### Scenario: API embedding request
- **WHEN** a client calls the single-document or bulk embedding API endpoint
- **THEN** the response includes the number of vectors created

#### Scenario: CLI embedding command
- **WHEN** a user runs the single-document or bulk embedding CLI command
- **THEN** the command outputs the number of vectors created
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
## ADDED Requirements

### Requirement: Documents store compact content metadata
The system SHALL persist documents without storing full markdown content and MUST store a `content_hash` field as a 6-character lowercase hexadecimal digest.

#### Scenario: Document is persisted without markdown copy
- **WHEN** a document is created or updated
- **THEN** the persisted document record includes metadata fields and `content_hash`, and does not include a full markdown-content column

#### Scenario: Content hash format is enforced
- **WHEN** a document record is persisted
- **THEN** `content_hash` is exactly 6 characters and contains only lowercase hexadecimal characters (`0-9`, `a-f`)

### Requirement: Upsert refreshes chunks and links based on content hash changes
The system SHALL compute `content_hash` from incoming markdown during upsert and MUST use hash changes to control chunk/link regeneration.

#### Scenario: New document upsert generates chunks and links
- **WHEN** a document is upserted for a canonical URL that does not yet exist
- **THEN** the system stores the document with `content_hash`, creates plaintext chunks, and inserts outbound links derived from the provided markdown

#### Scenario: Changed content hash triggers refresh
- **WHEN** an existing document is upserted and the computed `content_hash` differs from the stored hash
- **THEN** existing chunks and outbound links for that document are deleted and regenerated from the incoming markdown

#### Scenario: Unchanged content hash preserves chunks and links
- **WHEN** an existing document is upserted and the computed `content_hash` matches the stored hash
- **THEN** existing chunks and outbound links are retained and only document metadata updates are applied

### Requirement: Ingestion paths produce chunk-ready documents without embeddings
All ingestion flows that create or update documents SHALL leave the document with current plaintext chunks and links while deferring vector generation to explicit embedding operations.

#### Scenario: URL or feed ingestion creates chunk-ready document
- **WHEN** ingestion creates or updates a document from URL/feed/opml/sitemap content
- **THEN** document metadata, content hash, plaintext chunks, and links are updated in the same ingestion call
- **THEN** no new vector rows are required to be created by that ingestion call
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
## 1. Schema and model updates

- [x] 1.1 Replace `documents.content_markdown` with `documents.content_hash` in schema initialization and add DB-level 6-char lowercase-hex constraints.
- [x] 1.2 Implement startup migration for legacy databases that rebuilds `documents` and backfills `content_hash` from prior markdown values.
- [x] 1.3 Update `Document` model and related serialization/query mapping to remove `content_markdown` and include `content_hash`.

## 2. Document upsert and ingestion flow refactor

- [x] 2.1 Update `upsert_document` to compute `content_hash` from incoming markdown and persist metadata + hash.
- [x] 2.2 Regenerate plaintext chunks and outbound links during upsert for new documents or changed hashes, while preserving existing chunks/links on unchanged hashes.
- [x] 2.3 Ensure ingestion helpers (`create_document_from_url`, feed/opml/sitemap ingestion) continue to pass markdown input but no longer rely on persisted markdown.

## 3. Separate embedding lifecycle operations

- [x] 3.1 Refactor chunk creation helpers to support plaintext chunk insertion without immediate vector writes.
- [x] 3.2 Add single-document embedding operation that generates missing vector rows for one document and returns vectors created.
- [x] 3.3 Add bulk embedding synchronization operation over documents with missing vectors, with optional `maximum` and stable ordering.
- [x] 3.4 Expose embedding operations through API routes and CLI commands with consistent response payloads (`vectors_created`).

## 4. Search path, compatibility, and validation

- [x] 4.1 Update search result hydration and any document fetch/list paths to align with the new document shape (no markdown field).
- [x] 4.2 Add/adjust tests for schema migration, hash-based upsert behavior, chunk/link regeneration rules, and embedding sync operations.
- [x] 4.3 Add/adjust API and CLI tests for embedding endpoints/commands and breaking output contract changes.
- [x] 4.4 Update README or user-facing docs to describe the new embedding workflow and removed `content_markdown` field.
47 changes: 22 additions & 25 deletions packages/api/src/grogbot_api/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,34 +25,34 @@ class DocumentUpsertRequest(BaseModel):
published_at: Optional[datetime] = None


class ChunkDocumentRequest(BaseModel):
class EmbedDocumentRequest(BaseModel):
document_id: str


class SyncChunkDocumentsRequest(BaseModel):
class SyncEmbedDocumentsRequest(BaseModel):
maximum: Optional[int] = None


class IngestUrlRequest(BaseModel):
url: str
chunk: bool = False
embed: bool = False


class IngestFeedRequest(BaseModel):
feed_url: str
paginate: bool = False
chunk: bool = False
embed: bool = False


class IngestOpmlRequest(BaseModel):
opml_url: str
paginate: bool = False
chunk: bool = False
embed: bool = False


class IngestSitemapRequest(BaseModel):
sitemap_url: str
chunk: bool = False
embed: bool = False


def get_service():
Expand Down Expand Up @@ -123,19 +123,19 @@ def delete_document(document_id: str, service: SearchService = Depends(get_servi
return {"deleted": service.delete_document(document_id)}


@app.post("/search/documents/chunk")
def chunk_document(payload: ChunkDocumentRequest, service: SearchService = Depends(get_service)):
@app.post("/search/documents/embed")
def embed_document(payload: EmbedDocumentRequest, service: SearchService = Depends(get_service)):
try:
count = service.chunk_document(payload.document_id)
count = service.embed_document_chunks(payload.document_id)
except DocumentNotFoundError as exc:
raise HTTPException(status_code=404, detail="Document not found") from exc
return {"chunks_created": count}
return {"vectors_created": count}


@app.post("/search/documents/chunk/sync")
def chunk_documents(payload: SyncChunkDocumentsRequest, service: SearchService = Depends(get_service)):
count = service.synchronize_document_chunks(maximum=payload.maximum)
return {"chunks_created": count}
@app.post("/search/documents/embed/sync")
def embed_documents(payload: SyncEmbedDocumentsRequest, service: SearchService = Depends(get_service)):
count = service.synchronize_document_embeddings(maximum=payload.maximum)
return {"vectors_created": count}


@app.post("/search/ingest/url")
Expand All @@ -144,38 +144,35 @@ def ingest_url(payload: IngestUrlRequest, service: SearchService = Depends(get_s
document = service.create_document_from_url(payload.url)
except ValueError as exc:
raise HTTPException(status_code=422, detail=str(exc)) from exc
if payload.chunk and not service.document_has_chunks(document.id):
service.chunk_document(document.id)
if payload.embed:
service.embed_document_chunks(document.id)
return document


@app.post("/search/ingest/feed")
def ingest_feed(payload: IngestFeedRequest, service: SearchService = Depends(get_service)):
documents = service.create_documents_from_feed(payload.feed_url, paginate=payload.paginate)
if payload.chunk:
if payload.embed:
for document in documents:
if not service.document_has_chunks(document.id):
service.chunk_document(document.id)
service.embed_document_chunks(document.id)
return documents


@app.post("/search/ingest/opml")
def ingest_opml(payload: IngestOpmlRequest, service: SearchService = Depends(get_service)):
documents = service.create_documents_from_opml(payload.opml_url, paginate=payload.paginate)
if payload.chunk:
if payload.embed:
for document in documents:
if not service.document_has_chunks(document.id):
service.chunk_document(document.id)
service.embed_document_chunks(document.id)
return documents


@app.post("/search/ingest/sitemap")
def ingest_sitemap(payload: IngestSitemapRequest, service: SearchService = Depends(get_service)):
documents = service.create_documents_from_sitemap(payload.sitemap_url)
if payload.chunk:
if payload.embed:
for document in documents:
if not service.document_has_chunks(document.id):
service.chunk_document(document.id)
service.embed_document_chunks(document.id)
return documents


Expand Down
Loading