Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-04
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
## Context

The link graph is populated during `chunk_document(document_id)` by extracting markdown links and persisting unique `(from_document_id, to_document_id)` edges. Current filtering excludes self-links but still records links between different documents on the same canonical domain, which can over-amplify intra-site structures in the link-based ranking signal.

This change is constrained to search-core link extraction and tests. No schema change is required.

## Goals / Non-Goals

**Goals:**
- Persist outbound edges only when source and target canonical domains differ.
- Treat relative outbound links as same-domain by resolving against the source document canonical URL before filtering.
- Preserve existing behavior for dedupe, unknown target handling (`document_id_for_url`), and lifecycle refresh semantics.
- Update tests so link graph and ranking assertions reflect cross-domain-only edge persistence.

**Non-Goals:**
- Redefining canonical domain normalization (`netloc` remains the canonical domain key).
- Changing DB schema or link table shape.
- Introducing weighting by domain category or any other ranking formula changes beyond input edge set.

## Decisions

1. **Apply domain filtering in URL-to-target derivation before insertion**
- Decision: Extend the outbound link derivation helper to accept the source document canonical URL and skip targets whose normalized domain equals the source domain.
- Rationale: Keeps filtering in one place with self-link exclusion and dedupe, minimizing lifecycle-path drift.
- Alternative considered: Filter at SQL insert time by joining source document data per edge; rejected as more complex and less explicit for relative-link handling.

2. **Resolve relative links against source canonical URL for domain checks**
- Decision: Use `urljoin(source_canonical_url, href)` prior to canonicalization/domain extraction when deriving targets.
- Rationale: Ensures internal relative links are consistently treated as same-domain and skipped.
- Alternative considered: Ignore relative links entirely; rejected because it changes existing link extraction semantics and may drop legitimate cross-domain absolute forms in mixed content.

3. **Keep unknown-target behavior unchanged after filtering**
- Decision: After passing domain filter, derive `to_document_id` via `document_id_for_url(_canonicalize_url(resolved_url))` even if no target document exists.
- Rationale: Preserves current contract and avoids coupling to ingestion state.
- Alternative considered: Require target documents to exist before storing edges; rejected as scope creep and behavior regression.

4. **Refactor tests to explicit multi-domain fixtures**
- Decision: Update link graph and link-score tests that currently rely on same-domain edges to use at least two domains.
- Rationale: Makes expectations align with the new rule and prevents accidental reliance on intra-domain edge persistence.
- Alternative considered: Keep same fixtures and loosen assertions; rejected because it obscures intended behavior.

## Risks / Trade-offs

- **[Risk] `netloc`-based equality treats `www.example.com` and `example.com` as different domains** → **Mitigation:** Preserve current canonical-domain contract for now; document behavior in spec scenarios and revisit in a separate normalization change if needed.
- **[Risk] Relative-link resolution may expose malformed markdown href values** → **Mitigation:** Continue canonicalization and skip empty/unusable URLs; maintain deterministic helper-level filtering.
- **[Risk] Fewer stored edges may reduce link-score differentiation on single-site corpora** → **Mitigation:** Intentional trade-off to avoid intra-site self-reinforcement; ranking still uses FTS/vector signals.

## Migration Plan

1. Update link-derivation helper signature and call sites to include source canonical URL.
2. Implement same-domain skip with relative-link resolution.
3. Update unit tests for link persistence/lifecycle and ranking expectations.
4. Run test suite and verify no API/CLI contract changes.

Rollback: revert helper/filter changes and corresponding test updates; schema remains unchanged so rollback is code-only.

## Open Questions

- None for this iteration.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## Why

Link-based ranking currently counts links between documents on the same canonical domain. This lets intra-site link structures inflate authority and can drown out cross-site signals, reducing relevance quality.

## What Changes

- Update outbound link extraction to skip link creation when source and target resolve to the same canonical domain.
- Keep existing self-link exclusion and duplicate `(from_document_id, to_document_id)` collapse behavior.
- Resolve relative links against the source document canonical URL before domain comparison so internal relative links are also skipped.
- Apply the same-domain skip rule even when the target document has not been ingested yet (compare using canonicalized target URL).
- Update link-graph and ranking tests to use multi-domain fixtures and validate same-domain exclusion.

## Capabilities

### New Capabilities
- `document-link-domain-filtering`: Filters outbound link graph edges so only cross-domain links are persisted for ranking.

### Modified Capabilities
- *(none)*

## Impact

- Affected code: `packages/search-core/src/grogbot_search/service.py` link extraction/insertion helpers.
- Affected tests: `packages/search-core/tests/test_service.py` link-graph lifecycle tests and link-score ranking fixtures/assertions.
- API/CLI shape: no contract changes expected; link persistence and derived `link_score` values change.
- Dependencies/systems: no new dependencies; SQLite schema remains unchanged.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## ADDED Requirements

### Requirement: Outbound link graph SHALL exclude same-canonical-domain targets
When generating outbound document links, the system SHALL compare canonical domains for source and target URLs and MUST skip persistence when both domains are equal.

#### Scenario: Absolute same-domain target is skipped
- **WHEN** a chunked document at `https://example.com/a` contains an outbound link to `https://example.com/b`
- **THEN** no `(from_document_id, to_document_id)` edge is stored for that link

#### Scenario: Cross-domain target is persisted
- **WHEN** a chunked document at `https://example.com/a` contains an outbound link to `https://other.example/b`
- **THEN** one directed edge is stored for the source document and resolved target document id

### Requirement: Relative outbound links MUST be resolved before domain filtering
For outbound links extracted from markdown, the system MUST resolve relative href values against the source document canonical URL before canonicalization and domain comparison.

#### Scenario: Relative internal path is treated as same-domain
- **WHEN** a chunked document at `https://example.com/posts/1` contains `[x](/about)`
- **THEN** the target resolves to `https://example.com/about` for domain comparison and no edge is stored

#### Scenario: Relative traversal path is treated as same-domain
- **WHEN** a chunked document at `https://example.com/posts/1` contains `[x](../archive)`
- **THEN** the resolved target domain matches the source domain and no edge is stored

### Requirement: Cross-domain unknown targets SHALL still derive target ids
After applying same-domain filtering, the system MUST derive `to_document_id` from the canonicalized target URL even when no target document is currently ingested.

#### Scenario: Unknown cross-domain URL stores derived target id
- **WHEN** a chunked document links to `https://external.site/not-ingested` and no document exists for that URL
- **THEN** the system stores the edge with `to_document_id = document_id_for_url(_canonicalize_url("https://external.site/not-ingested"))`
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## 1. Link derivation updates

- [x] 1.1 Update outbound link target derivation helper(s) to accept the source document canonical URL.
- [x] 1.2 Resolve markdown href values with `urljoin(source_canonical_url, href)` before canonicalization/domain comparison.
- [x] 1.3 Skip target IDs whose normalized domain matches the source canonical domain while preserving self-link and dedupe behavior.
- [x] 1.4 Keep unknown cross-domain target handling unchanged (`to_document_id = document_id_for_url(_canonicalize_url(resolved_url))`).

## 2. Service integration

- [x] 2.1 Update `chunk_document` link insertion flow to provide source canonical URL to link-derivation logic.
- [x] 2.2 Verify no schema or API contract changes are introduced by the filtering update.

## 3. Test updates

- [x] 3.1 Update link-graph persistence tests to assert same-domain absolute links are skipped and cross-domain links persist.
- [x] 3.2 Add/adjust tests for relative-link resolution (`/path`, `../path`) being treated as same-domain and skipped.
- [x] 3.3 Update unknown-target tests to validate cross-domain unknown URLs still store derived `to_document_id`.
- [x] 3.4 Refactor link-score ranking fixtures/assertions to multi-domain documents so expected inbound-link ordering remains deterministic.

## 4. Verification

- [x] 4.1 Run `packages/search-core` tests and confirm all link graph/ranking assertions pass with same-domain filtering enabled.
- [x] 4.2 Manually sanity-check that link rows now represent cross-domain edges only for updated fixtures.
164 changes: 91 additions & 73 deletions packages/search-core/src/grogbot_search/service.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,23 +34,18 @@ class SearchScores:

_BACKOFF_STATUS_CODES = {401, 403, 429, 503}
_CAPTCHA_MARKERS = (
"captcha",
"cf-chl",
"recaptcha",
"attention required",
"verify you are human",
)

_DEFAULT_HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:147.0) Gecko/20100101 Firefox/147.0",
"User-Agent": "Mozilla/5.0 (compatible; Grogbot/1.0; +https://www.hauntedspice.com)",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Connection": "keep-alive",
"Accept-Encoding": "gzip, deflate",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Priority": "u=0, i",
}


Expand Down Expand Up @@ -151,10 +146,21 @@ def _extract_markdown_links(content_markdown: str) -> List[str]:
return links


def _to_document_ids_from_markdown(*, source_document_id: str, content_markdown: str) -> set[str]:
def _to_document_ids_from_markdown(
*,
source_document_id: str,
source_canonical_url: str,
content_markdown: str,
) -> set[str]:
to_document_ids: set[str] = set()
source_domain = _normalize_domain(_canonicalize_url(source_canonical_url))
for href in _extract_markdown_links(content_markdown):
to_document_id = document_id_for_url(_canonicalize_url(href))
resolved_url = _canonicalize_url(urljoin(source_canonical_url, href))
if not resolved_url:
continue
if _normalize_domain(resolved_url) == source_domain:
continue
to_document_id = document_id_for_url(_canonicalize_url(resolved_url))
if to_document_id == source_document_id:
continue
to_document_ids.add(to_document_id)
Expand Down Expand Up @@ -467,7 +473,11 @@ def chunk_document(self, document_id: str) -> int:
self.connection.execute("DELETE FROM chunks WHERE document_id = ?", (document_id,))
self.connection.execute("DELETE FROM links WHERE from_document_id = ?", (document_id,))
created = self._create_chunks(document_id, document.content_markdown)
self._insert_document_links(document_id=document_id, content_markdown=document.content_markdown)
self._insert_document_links(
document_id=document_id,
source_canonical_url=document.canonical_url,
content_markdown=document.content_markdown,
)
self.connection.commit()
return len(created)

Expand All @@ -491,9 +501,10 @@ def synchronize_document_chunks(self, maximum: Optional[int] = None) -> int:
total_created += self.chunk_document(row["id"])
return total_created

def _insert_document_links(self, *, document_id: str, content_markdown: str) -> None:
def _insert_document_links(self, *, document_id: str, source_canonical_url: str, content_markdown: str) -> None:
to_document_ids = _to_document_ids_from_markdown(
source_document_id=document_id,
source_canonical_url=source_canonical_url,
content_markdown=content_markdown,
)
for to_document_id in sorted(to_document_ids):
Expand Down Expand Up @@ -595,76 +606,83 @@ def _next_wordpress_url(base_url: str) -> Optional[str]:
seen_feed_urls.add(normalized_url)
pages_processed += 1

start_time = time.monotonic() if paginate else None
try:
feed = feedparser.parse(current_url)
except Exception:
if pages_processed == 1:
raise
break

if pages_processed > 1:
status = getattr(feed, "status", None)
if status is not None and status >= 400:
break
if getattr(feed, "bozo", 0) and not feed.entries:
try:
feed = feedparser.parse(current_url)
except Exception:
if pages_processed == 1:
raise
break

page_feed_name = feed.feed.get("title")
if page_feed_name:
feed_name = feed_name or page_feed_name

for entry in feed.entries:
entry_url = entry.get("link") or entry.get("id")
if not entry_url:
continue
canonical_url = _canonicalize_url(entry_url)
canonical_domain = _normalize_domain(canonical_url)
source = self._get_source_by_domain(canonical_domain)
if not source:
source = self.upsert_source(
canonical_domain=canonical_domain,
name=feed_name,
rss_feed=feed_url,
)
else:
updated_name = source.name or feed_name
updated_rss_feed = source.rss_feed or feed_url
if updated_name != source.name or updated_rss_feed != source.rss_feed:
if pages_processed > 1:
status = getattr(feed, "status", None)
if status is not None and status >= 400:
break
if getattr(feed, "bozo", 0) and not feed.entries:
break

page_feed_name = feed.feed.get("title")
if page_feed_name:
feed_name = feed_name or page_feed_name

for entry in feed.entries:
entry_url = entry.get("link") or entry.get("id")
if not entry_url:
continue
canonical_url = _canonicalize_url(entry_url)
canonical_domain = _normalize_domain(canonical_url)
source = self._get_source_by_domain(canonical_domain)
if not source:
source = self.upsert_source(
canonical_domain=canonical_domain,
name=updated_name,
rss_feed=updated_rss_feed,
name=feed_name,
rss_feed=feed_url,
)
else:
updated_name = source.name or feed_name
updated_rss_feed = source.rss_feed or feed_url
if updated_name != source.name or updated_rss_feed != source.rss_feed:
source = self.upsert_source(
canonical_domain=canonical_domain,
name=updated_name,
rss_feed=updated_rss_feed,
)
content = None
if entry.get("content"):
content = entry.content[0].value
content = content or entry.get("summary") or ""
content_markdown = html_to_markdown(content)
if not content_markdown or not content_markdown.strip():
continue
title = entry.get("title")
published_at = _parse_datetime(entry.get("published") or entry.get("updated"))
documents.append(
self.upsert_document(
source_id=source.id,
canonical_url=canonical_url,
title=title,
published_at=published_at,
content_markdown=content_markdown,
)
content = None
if entry.get("content"):
content = entry.content[0].value
content = content or entry.get("summary") or ""
content_markdown = html_to_markdown(content)
if not content_markdown or not content_markdown.strip():
continue
title = entry.get("title")
published_at = _parse_datetime(entry.get("published") or entry.get("updated"))
documents.append(
self.upsert_document(
source_id=source.id,
canonical_url=canonical_url,
title=title,
published_at=published_at,
content_markdown=content_markdown,
)
)

if not paginate:
break
if pages_processed >= 100:
break
if not paginate:
break
if pages_processed >= 100:
break

next_url = _next_feed_url(feed, current_url)
if not next_url and _is_wordpress_feed(feed):
next_url = _next_wordpress_url(current_url)
if not next_url:
break
current_url = next_url
next_url = _next_feed_url(feed, current_url)
if not next_url and _is_wordpress_feed(feed):
next_url = _next_wordpress_url(current_url)
if not next_url:
break
current_url = next_url
finally:
if start_time is not None:
elapsed = time.monotonic() - start_time
if elapsed < 1.0:
time.sleep(1.0 - elapsed)

return documents

Expand Down
2 changes: 1 addition & 1 deletion packages/search-core/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -391,7 +391,7 @@ def log_message(self, format, *args): # noqa: A003 - match base signature
"body": """
<html>
<head><title>Attention Required</title></head>
<body>Please verify you are human (captcha challenge)</body>
<body>Please verify you are human (reCAPTCHA challenge)</body>
</html>
""",
}
Expand Down
Loading