Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Tests
name: Verify PR

on:
pull_request:
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Grogbot

Grogbot is a uv-based Python monorepo for multiple systems. The first system, **search**, provides local storage and rank-fused search over markdown documents, exposed through both a CLI and a FastAPI service.
Grogbot is a uv-based Python monorepo for multiple systems. The first system, **search**, provides local storage and rank-fused search over markdown documents using FTS, vector, and link authority signals, exposed through both a CLI and a FastAPI service.

## Packages

- **`grogbot-search-core`** (`packages/search-core`): Core models, SQLite persistence, ingestion, chunking, embeddings, and rank-fused search.
- **`grogbot-search-core`** (`packages/search-core`): Core models, SQLite persistence, ingestion, chunking, embeddings, document-link graph storage, and three-signal rank-fused search.
- **`grogbot-cli`** (`packages/cli`): Typer-powered CLI (`grogbot`) that surfaces search functionality.
- **`grogbot-api`** (`packages/api`): FastAPI app exposing the search system over HTTP.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-04
90 changes: 90 additions & 0 deletions openspec/changes/archive/2026-03-04-add-link-rank-signal/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
## Context

Search ranking currently fuses two chunk-level retrieval streams (FTS and vector) using reciprocal row-number scoring. The system has no graph signal for document authority, even though ingested markdown often contains outbound links. We need a lightweight page-rank-style signal where documents linked by more distinct source documents receive better rank.

Existing behavior already separates document upsert from chunking: content changes delete chunks, and `chunk_document(document_id)` regenerates chunks. This change should align link lifecycle with that same pattern.

Constraints:
- Link identity is exactly `(from_document_id, to_document_id)` with no per-link count.
- Multiple links from one document to the same target collapse to one edge.
- Outbound links from a document must be cleared when content changes or when document is deleted.
- Link extraction should run as part of `chunk_document`.
- Search must expose `link_score` and treat it as equal-weight to FTS and vector scores.
- Documents with zero inbound links must have `link_score = 0.0`.

## Goals / Non-Goals

**Goals:**
- Add persistent link graph storage with uniqueness by `(from,to)`.
- Keep links in sync with document lifecycle (content change/delete/chunk regenerate).
- Derive `to_document_id` for unknown targets using `_canonicalize_url` + `document_id_for_url`.
- Add a third reciprocal-rank search signal (`link_score`) with equal additive weight.
- Return `link_score` in `SearchResult` payloads.

**Non-Goals:**
- Implement iterative/global PageRank or damping-factor graph algorithms.
- Track per-link multiplicity within one source document.
- Add new API endpoints for direct link CRUD.
- Change query endpoint/CLI shape beyond additional `link_score` field in results.

## Decisions

1. **Store links in a dedicated `links` table keyed by `(from_document_id, to_document_id)`**
- Decision: Add table:
- `from_document_id TEXT NOT NULL`
- `to_document_id TEXT NOT NULL`
- `PRIMARY KEY (from_document_id, to_document_id)`
- `FOREIGN KEY (from_document_id) REFERENCES documents(id) ON DELETE CASCADE`
- index on `to_document_id`
- Rationale: Enforces one edge per source-target pair and supports efficient inbound counting.
- Alternative considered: `id` surrogate plus unique index; rejected as unnecessary complexity.

2. **Extract outbound links during `chunk_document` and fully refresh per source document**
- Decision: In `chunk_document(document_id)`, delete existing outbound links for `document_id`, then extract links from `document.content_markdown`, dedupe targets, ignore self-links, and insert with `INSERT OR IGNORE`.
- Rationale: Mirrors chunk regeneration semantics and guarantees graph consistency with current content.
- Alternative considered: extract during `upsert_document`; rejected because chunking is already the indexing boundary.

3. **Clear outbound links when document content changes**
- Decision: In `upsert_document`, when `content_changed` is true, delete both chunks and outbound links for that document before commit.
- Rationale: Prevents stale link edges between upsert and next chunk run.
- Alternative considered: only clear on next `chunk_document`; rejected because stale links would affect ranking in the interim.

4. **Treat unknown targets as first-class link destinations**
- Decision: For each extracted href, compute `target_url = _canonicalize_url(href)` and `to_document_id = document_id_for_url(target_url)` even if no matching `documents` row exists.
- Rationale: Preserves graph evidence ahead of ingestion order and matches requested behavior.
- Alternative considered: only store links to known documents; rejected by requirement.

5. **Compute link signal as query-time reciprocal rank over candidate documents with inbound links**
- Decision: In `search`, add CTEs to:
- map candidate chunks to candidate documents,
- count inbound edges per candidate document (`COUNT(*)` on distinct `(from,to)` table rows),
- rank only documents with inbound count > 0 by `inbound_count DESC, document_id ASC`,
- compute `link_score = 1.0 / (1 + row_number)`,
- `COALESCE(link_score, 0.0)` for documents with zero inbound links.
- Rationale: Keeps scoring query-local, deterministic, and directly combinable with existing reciprocal FTS/vector channels.
- Alternative considered: global precomputed link ranks; rejected for additional complexity and staleness management.

6. **Expose `link_score` in the public search model**
- Decision: Extend `SearchResult` with a required `link_score: float` and populate it in `SearchService.search`.
- Rationale: Required for transparency and downstream tuning/inspection.
- Alternative considered: keep link score internal to `score`; rejected by requirement.

## Risks / Trade-offs

- **[Risk] Relative/fragment links may map to hashed IDs that never resolve cleanly** → **Mitigation:** canonicalize uniformly and accept unknown targets; behavior remains deterministic and non-blocking.
- **[Risk] Additional SQL CTE/join complexity may affect query latency** → **Mitigation:** keep candidate pool bounded (`limit * 10`), index `links.to_document_id`, and validate via test coverage.
- **[Risk] Query-local link ranking means scores are relative to candidate set, not global authority** → **Mitigation:** intentional for equal-weight reciprocal fusion; revisit with global precompute only if relevance data demands it.
- **[Risk] Stale links if ingestion writes content but chunking is deferred indefinitely** → **Mitigation:** content-change path clears outbound links immediately; synchronization jobs can rebuild when needed.

## Migration Plan

1. Add `links` table/index in `_init_schema` (idempotent `CREATE TABLE/INDEX IF NOT EXISTS`).
2. Update link lifecycle in `upsert_document`, `chunk_document`, and delete behavior (via FK cascade on `from_document_id`).
3. Add outbound link extraction + insertion helper(s) used by `chunk_document`.
4. Extend search SQL and `SearchResult` model with `link_score`.
5. Update tests for link persistence/lifecycle and three-signal fusion semantics.
6. Rollback path: remove link-score CTE integration and `link_score` model field while leaving table unused (safe backward rollback without destructive migration).

## Open Questions

- None.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
## Why

Search currently ranks chunks using only FTS and vector signals, so it misses an important authority cue: documents that are frequently linked by other documents should generally rank higher. Adding a link-based rank signal now improves relevance using data we already ingest in document content.

## What Changes

- Add persistent document-to-document link storage using `from_document_id` and `to_document_id` pairs, with uniqueness per pair (no intra-document duplicate counts).
- Generate and refresh outbound links for a document during `chunk_document(document_id)` by parsing links from the document content.
- Remove outbound links from a document whenever its content changes or the document is deleted, matching chunk lifecycle semantics.
- Store links to not-yet-ingested targets by canonicalizing target URLs and deriving `to_document_id` via `document_id_for_url`.
- Extend search rank fusion with a third, equal-weight signal (`link_score`) based on inbound-link rank (in-degree), where documents with zero inbound links receive `link_score = 0.0`.
- Expose `link_score` in search results alongside `fts_score`, `vector_score`, and total `score`.

## Capabilities

### New Capabilities
- `document-link-graph`: Manage outbound document links derived from document content and kept in sync with document chunking/deletion lifecycle.
- `search-link-rank-fusion`: Add a link-based reciprocal-rank signal to search scoring and expose per-result link scoring metadata.

### Modified Capabilities
- None.

## Impact

- Affected code: `packages/search-core/src/grogbot_search/service.py`, `packages/search-core/src/grogbot_search/models.py`, `packages/search-core/src/grogbot_search/__init__.py`.
- Affected tests: `packages/search-core/tests/test_service.py` (link persistence/lifecycle, scoring, and result payload assertions).
- API/CLI contracts: search response payload includes new `link_score` field; query endpoints/commands remain unchanged.
- Database/schema: new `links` table and related indexes/constraints/triggers as needed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## ADDED Requirements

### Requirement: Outbound document links SHALL be stored as unique directed pairs
The system SHALL persist document links as directed `(from_document_id, to_document_id)` pairs, and it MUST prevent duplicate pairs from the same source document to the same target document.

#### Scenario: Multiple links to the same target in one document
- **WHEN** a document contains multiple outbound links that resolve to the same target URL
- **THEN** the system stores exactly one link pair for that `(from_document_id, to_document_id)` relationship

#### Scenario: Links to different targets from one document
- **WHEN** a document contains outbound links that resolve to different target URLs
- **THEN** the system stores one link pair per unique target document id

### Requirement: Link targets SHALL be derived even when target documents are not ingested
For each outbound link extracted from document content, the system SHALL canonicalize the URL and MUST derive `to_document_id` via `document_id_for_url` regardless of whether a corresponding `documents` row exists.

#### Scenario: Outbound link points to unknown target URL
- **WHEN** a chunked document links to a URL that has not been ingested as a document
- **THEN** the system stores a link pair using `to_document_id = document_id_for_url(_canonicalize_url(url))`

### Requirement: Outbound links SHALL follow chunk lifecycle and ignore self-links
The system SHALL refresh outbound links for a document during `chunk_document(document_id)` by deleting existing links from that document and inserting links extracted from current content. The system MUST delete outbound links from a document when its content changes or the document is deleted. The system MUST ignore self-links where `from_document_id == to_document_id`.

#### Scenario: Chunking regenerates outbound links from current content
- **WHEN** `chunk_document(document_id)` is invoked for a document with previously stored outbound links
- **THEN** existing links from that document are deleted before new outbound links are inserted

#### Scenario: Content change clears stale outbound links before re-chunking
- **WHEN** `upsert_document` updates an existing document with changed `content_markdown`
- **THEN** all links where `from_document_id` equals that document id are deleted

#### Scenario: Self-links are excluded
- **WHEN** an outbound link resolves to the same document id as the source document
- **THEN** the system does not store that link pair
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## ADDED Requirements

### Requirement: Search SHALL include a link-based reciprocal-rank signal
The search ranking pipeline SHALL compute a `link_score` channel from inbound-link counts and MUST add it to existing reciprocal FTS and vector scores with equal weight.

#### Scenario: Final score includes all three signals
- **WHEN** a query returns ranked chunk candidates
- **THEN** each result score is computed as `score = fts_score + vector_score + link_score`

### Requirement: Link score SHALL rank by inbound links and zero-fill missing link authority
For ranked candidate documents with one or more inbound links, the system SHALL assign link row numbers ordered by inbound link count descending and document id ascending for deterministic ties, and MUST compute `link_score = 1.0 / (1 + row_number)`. Documents with zero inbound links MUST receive `link_score = 0.0`.

#### Scenario: Document with inbound links gets reciprocal link score
- **WHEN** a candidate document has at least one inbound link and ranks first among candidate documents by inbound count
- **THEN** its `link_score` is `1.0 / (1 + 1)`

#### Scenario: Document without inbound links gets zero link score
- **WHEN** a candidate document has zero inbound links
- **THEN** its `link_score` is `0.0`

### Requirement: Search results SHALL expose link_score
The search result model SHALL include `link_score` for every returned result, alongside `fts_score`, `vector_score`, and final `score`.

#### Scenario: Query response contains link_score field
- **WHEN** search results are returned from the service
- **THEN** each result includes a numeric `link_score` field
32 changes: 32 additions & 0 deletions openspec/changes/archive/2026-03-04-add-link-rank-signal/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
## 1. Schema and model updates

- [x] 1.1 Add `links` table creation to `SearchService._init_schema` with `(from_document_id, to_document_id)` primary key and FK cascade on `from_document_id`
- [x] 1.2 Add an index on `links.to_document_id` for inbound count lookups
- [x] 1.3 Extend `SearchResult` in `packages/search-core/src/grogbot_search/models.py` with required `link_score: float`

## 2. Link extraction and lifecycle behavior

- [x] 2.1 Implement markdown outbound-link extraction helper(s) in `service.py` and normalize each href with `_canonicalize_url`
- [x] 2.2 Derive `to_document_id` via `document_id_for_url` for every extracted target and dedupe pairs per source document
- [x] 2.3 Update `upsert_document` to delete outbound links for a document when `content_markdown` changes
- [x] 2.4 Update `chunk_document` to clear existing outbound links for the source document, ignore self-links, and insert refreshed links alongside chunk regeneration

## 3. Search rank fusion integration

- [x] 3.1 Extend the `search` SQL CTE pipeline to compute candidate-document inbound-link counts from `links`
- [x] 3.2 Add reciprocal `link_score` ranking (`1.0 / (1 + row_number)`) ordered by inbound count DESC then document id ASC
- [x] 3.3 Ensure documents with zero inbound links receive `link_score = 0.0` via `COALESCE`
- [x] 3.4 Update final score computation to `fts_score + vector_score + link_score` and map `link_score` into returned `SearchResult` objects

## 4. Test coverage

- [x] 4.1 Add tests verifying unique `(from,to)` storage and collapse of multiple same-target links within one source document
- [x] 4.2 Add tests verifying unknown target URLs are stored via `document_id_for_url(_canonicalize_url(url))`
- [x] 4.3 Add tests verifying self-links are ignored and outbound links are cleared on content change/document delete/chunk refresh
- [x] 4.4 Add ranking tests verifying three-signal additive scoring, deterministic tie handling, and `link_score = 0.0` for zero-inbound documents
- [x] 4.5 Add result-shape tests verifying `link_score` is present in query results (service/API/CLI JSON output paths as applicable)

## 5. Validation and readiness

- [x] 5.1 Run `uv run pytest packages/search-core/tests` and confirm passing coverage for updated behavior
- [x] 5.2 Update any remaining user/developer-facing wording that describes search scoring to include the new `link_score` signal
20 changes: 10 additions & 10 deletions packages/cli/src/grogbot_cli/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,16 +236,6 @@ def bootstrap(

sources_list = list(sources)
with _service() as service:
if not skip_sitemaps:
for source in sources_list:
sitemap = source.get("sitemap")
if not sitemap:
continue
typer.echo(f"Scraping sitemap {sitemap}")
try:
service.create_documents_from_sitemap(sitemap, bootstrap=True)
except Exception as exc:
print(f"Bootstrap failed for sitemap {sitemap}: {exc}", file=sys.stderr)
if not skip_feeds:
for source in sources_list:
feed = source.get("feed")
Expand All @@ -256,6 +246,16 @@ def bootstrap(
service.create_documents_from_feed(feed, paginate=True)
except Exception as exc:
print(f"Bootstrap failed for feed {feed}: {exc}", file=sys.stderr)
if not skip_sitemaps:
for source in sources_list:
sitemap = source.get("sitemap")
if not sitemap:
continue
typer.echo(f"Scraping sitemap {sitemap}")
try:
service.create_documents_from_sitemap(sitemap, bootstrap=True)
except Exception as exc:
print(f"Bootstrap failed for sitemap {sitemap}: {exc}", file=sys.stderr)


@search_app.command("query")
Expand Down
11 changes: 10 additions & 1 deletion packages/search-core/src/grogbot_search/embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,22 @@

from sentence_transformers import SentenceTransformer

_EMBEDDING_BATCH_SIZE = 8


@lru_cache(maxsize=1)
def _load_model() -> SentenceTransformer:
return SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)


def embed_texts(texts: Iterable[str], *, prompt: str) -> List[list[float]]:
text_list = list(texts)
if not text_list:
return []

model = _load_model()
embeddings = model.encode(list(texts), normalize_embeddings=True, prompt=prompt)
embeddings = []
for start in range(0, len(text_list), _EMBEDDING_BATCH_SIZE):
batch = text_list[start : start + _EMBEDDING_BATCH_SIZE]
embeddings.extend(model.encode(batch, normalize_embeddings=True, prompt=prompt))
return [embedding.tolist() for embedding in embeddings]
3 changes: 2 additions & 1 deletion packages/search-core/src/grogbot_search/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ class Chunk(BaseModel):
class SearchResult(BaseModel):
chunk: Chunk
document: Document
score: float = Field(..., description="Final rank-fusion score combining FTS and vector rankings")
score: float = Field(..., description="Final rank-fusion score combining FTS, vector, and link rankings")
fts_score: float
vector_score: float
link_score: float
Loading