diff --git a/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/.openspec.yaml b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/.openspec.yaml new file mode 100644 index 0000000..5aae5cf --- /dev/null +++ b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-03-04 diff --git a/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/design.md b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/design.md new file mode 100644 index 0000000..41581c2 --- /dev/null +++ b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/design.md @@ -0,0 +1,59 @@ +## Context + +The link graph is populated during `chunk_document(document_id)` by extracting markdown links and persisting unique `(from_document_id, to_document_id)` edges. Current filtering excludes self-links but still records links between different documents on the same canonical domain, which can over-amplify intra-site structures in the link-based ranking signal. + +This change is constrained to search-core link extraction and tests. No schema change is required. + +## Goals / Non-Goals + +**Goals:** +- Persist outbound edges only when source and target canonical domains differ. +- Treat relative outbound links as same-domain by resolving against the source document canonical URL before filtering. +- Preserve existing behavior for dedupe, unknown target handling (`document_id_for_url`), and lifecycle refresh semantics. +- Update tests so link graph and ranking assertions reflect cross-domain-only edge persistence. + +**Non-Goals:** +- Redefining canonical domain normalization (`netloc` remains the canonical domain key). +- Changing DB schema or link table shape. +- Introducing weighting by domain category or any other ranking formula changes beyond input edge set. + +## Decisions + +1. **Apply domain filtering in URL-to-target derivation before insertion** + - Decision: Extend the outbound link derivation helper to accept the source document canonical URL and skip targets whose normalized domain equals the source domain. + - Rationale: Keeps filtering in one place with self-link exclusion and dedupe, minimizing lifecycle-path drift. + - Alternative considered: Filter at SQL insert time by joining source document data per edge; rejected as more complex and less explicit for relative-link handling. + +2. **Resolve relative links against source canonical URL for domain checks** + - Decision: Use `urljoin(source_canonical_url, href)` prior to canonicalization/domain extraction when deriving targets. + - Rationale: Ensures internal relative links are consistently treated as same-domain and skipped. + - Alternative considered: Ignore relative links entirely; rejected because it changes existing link extraction semantics and may drop legitimate cross-domain absolute forms in mixed content. + +3. **Keep unknown-target behavior unchanged after filtering** + - Decision: After passing domain filter, derive `to_document_id` via `document_id_for_url(_canonicalize_url(resolved_url))` even if no target document exists. + - Rationale: Preserves current contract and avoids coupling to ingestion state. + - Alternative considered: Require target documents to exist before storing edges; rejected as scope creep and behavior regression. + +4. **Refactor tests to explicit multi-domain fixtures** + - Decision: Update link graph and link-score tests that currently rely on same-domain edges to use at least two domains. + - Rationale: Makes expectations align with the new rule and prevents accidental reliance on intra-domain edge persistence. + - Alternative considered: Keep same fixtures and loosen assertions; rejected because it obscures intended behavior. + +## Risks / Trade-offs + +- **[Risk] `netloc`-based equality treats `www.example.com` and `example.com` as different domains** → **Mitigation:** Preserve current canonical-domain contract for now; document behavior in spec scenarios and revisit in a separate normalization change if needed. +- **[Risk] Relative-link resolution may expose malformed markdown href values** → **Mitigation:** Continue canonicalization and skip empty/unusable URLs; maintain deterministic helper-level filtering. +- **[Risk] Fewer stored edges may reduce link-score differentiation on single-site corpora** → **Mitigation:** Intentional trade-off to avoid intra-site self-reinforcement; ranking still uses FTS/vector signals. + +## Migration Plan + +1. Update link-derivation helper signature and call sites to include source canonical URL. +2. Implement same-domain skip with relative-link resolution. +3. Update unit tests for link persistence/lifecycle and ranking expectations. +4. Run test suite and verify no API/CLI contract changes. + +Rollback: revert helper/filter changes and corresponding test updates; schema remains unchanged so rollback is code-only. + +## Open Questions + +- None for this iteration. diff --git a/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/proposal.md b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/proposal.md new file mode 100644 index 0000000..f1f28ec --- /dev/null +++ b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/proposal.md @@ -0,0 +1,26 @@ +## Why + +Link-based ranking currently counts links between documents on the same canonical domain. This lets intra-site link structures inflate authority and can drown out cross-site signals, reducing relevance quality. + +## What Changes + +- Update outbound link extraction to skip link creation when source and target resolve to the same canonical domain. +- Keep existing self-link exclusion and duplicate `(from_document_id, to_document_id)` collapse behavior. +- Resolve relative links against the source document canonical URL before domain comparison so internal relative links are also skipped. +- Apply the same-domain skip rule even when the target document has not been ingested yet (compare using canonicalized target URL). +- Update link-graph and ranking tests to use multi-domain fixtures and validate same-domain exclusion. + +## Capabilities + +### New Capabilities +- `document-link-domain-filtering`: Filters outbound link graph edges so only cross-domain links are persisted for ranking. + +### Modified Capabilities +- *(none)* + +## Impact + +- Affected code: `packages/search-core/src/grogbot_search/service.py` link extraction/insertion helpers. +- Affected tests: `packages/search-core/tests/test_service.py` link-graph lifecycle tests and link-score ranking fixtures/assertions. +- API/CLI shape: no contract changes expected; link persistence and derived `link_score` values change. +- Dependencies/systems: no new dependencies; SQLite schema remains unchanged. diff --git a/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/specs/document-link-domain-filtering/spec.md b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/specs/document-link-domain-filtering/spec.md new file mode 100644 index 0000000..3914461 --- /dev/null +++ b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/specs/document-link-domain-filtering/spec.md @@ -0,0 +1,30 @@ +## ADDED Requirements + +### Requirement: Outbound link graph SHALL exclude same-canonical-domain targets +When generating outbound document links, the system SHALL compare canonical domains for source and target URLs and MUST skip persistence when both domains are equal. + +#### Scenario: Absolute same-domain target is skipped +- **WHEN** a chunked document at `https://example.com/a` contains an outbound link to `https://example.com/b` +- **THEN** no `(from_document_id, to_document_id)` edge is stored for that link + +#### Scenario: Cross-domain target is persisted +- **WHEN** a chunked document at `https://example.com/a` contains an outbound link to `https://other.example/b` +- **THEN** one directed edge is stored for the source document and resolved target document id + +### Requirement: Relative outbound links MUST be resolved before domain filtering +For outbound links extracted from markdown, the system MUST resolve relative href values against the source document canonical URL before canonicalization and domain comparison. + +#### Scenario: Relative internal path is treated as same-domain +- **WHEN** a chunked document at `https://example.com/posts/1` contains `[x](/about)` +- **THEN** the target resolves to `https://example.com/about` for domain comparison and no edge is stored + +#### Scenario: Relative traversal path is treated as same-domain +- **WHEN** a chunked document at `https://example.com/posts/1` contains `[x](../archive)` +- **THEN** the resolved target domain matches the source domain and no edge is stored + +### Requirement: Cross-domain unknown targets SHALL still derive target ids +After applying same-domain filtering, the system MUST derive `to_document_id` from the canonicalized target URL even when no target document is currently ingested. + +#### Scenario: Unknown cross-domain URL stores derived target id +- **WHEN** a chunked document links to `https://external.site/not-ingested` and no document exists for that URL +- **THEN** the system stores the edge with `to_document_id = document_id_for_url(_canonicalize_url("https://external.site/not-ingested"))` diff --git a/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/tasks.md b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/tasks.md new file mode 100644 index 0000000..fea9f5e --- /dev/null +++ b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/tasks.md @@ -0,0 +1,23 @@ +## 1. Link derivation updates + +- [x] 1.1 Update outbound link target derivation helper(s) to accept the source document canonical URL. +- [x] 1.2 Resolve markdown href values with `urljoin(source_canonical_url, href)` before canonicalization/domain comparison. +- [x] 1.3 Skip target IDs whose normalized domain matches the source canonical domain while preserving self-link and dedupe behavior. +- [x] 1.4 Keep unknown cross-domain target handling unchanged (`to_document_id = document_id_for_url(_canonicalize_url(resolved_url))`). + +## 2. Service integration + +- [x] 2.1 Update `chunk_document` link insertion flow to provide source canonical URL to link-derivation logic. +- [x] 2.2 Verify no schema or API contract changes are introduced by the filtering update. + +## 3. Test updates + +- [x] 3.1 Update link-graph persistence tests to assert same-domain absolute links are skipped and cross-domain links persist. +- [x] 3.2 Add/adjust tests for relative-link resolution (`/path`, `../path`) being treated as same-domain and skipped. +- [x] 3.3 Update unknown-target tests to validate cross-domain unknown URLs still store derived `to_document_id`. +- [x] 3.4 Refactor link-score ranking fixtures/assertions to multi-domain documents so expected inbound-link ordering remains deterministic. + +## 4. Verification + +- [x] 4.1 Run `packages/search-core` tests and confirm all link graph/ranking assertions pass with same-domain filtering enabled. +- [x] 4.2 Manually sanity-check that link rows now represent cross-domain edges only for updated fixtures. diff --git a/packages/search-core/src/grogbot_search/service.py b/packages/search-core/src/grogbot_search/service.py index 6e0217c..db25bb3 100644 --- a/packages/search-core/src/grogbot_search/service.py +++ b/packages/search-core/src/grogbot_search/service.py @@ -34,23 +34,18 @@ class SearchScores: _BACKOFF_STATUS_CODES = {401, 403, 429, 503} _CAPTCHA_MARKERS = ( - "captcha", "cf-chl", + "recaptcha", "attention required", "verify you are human", ) _DEFAULT_HEADERS = { - "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:147.0) Gecko/20100101 Firefox/147.0", + "User-Agent": "Mozilla/5.0 (compatible; Grogbot/1.0; +https://www.hauntedspice.com)", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", - "Accept-Encoding": "gzip, deflate, br, zstd", - "Connection": "keep-alive", + "Accept-Encoding": "gzip, deflate", "Upgrade-Insecure-Requests": "1", - "Sec-Fetch-Dest": "document", - "Sec-Fetch-Mode": "navigate", - "Sec-Fetch-Site": "none", - "Priority": "u=0, i", } @@ -151,10 +146,21 @@ def _extract_markdown_links(content_markdown: str) -> List[str]: return links -def _to_document_ids_from_markdown(*, source_document_id: str, content_markdown: str) -> set[str]: +def _to_document_ids_from_markdown( + *, + source_document_id: str, + source_canonical_url: str, + content_markdown: str, +) -> set[str]: to_document_ids: set[str] = set() + source_domain = _normalize_domain(_canonicalize_url(source_canonical_url)) for href in _extract_markdown_links(content_markdown): - to_document_id = document_id_for_url(_canonicalize_url(href)) + resolved_url = _canonicalize_url(urljoin(source_canonical_url, href)) + if not resolved_url: + continue + if _normalize_domain(resolved_url) == source_domain: + continue + to_document_id = document_id_for_url(_canonicalize_url(resolved_url)) if to_document_id == source_document_id: continue to_document_ids.add(to_document_id) @@ -467,7 +473,11 @@ def chunk_document(self, document_id: str) -> int: self.connection.execute("DELETE FROM chunks WHERE document_id = ?", (document_id,)) self.connection.execute("DELETE FROM links WHERE from_document_id = ?", (document_id,)) created = self._create_chunks(document_id, document.content_markdown) - self._insert_document_links(document_id=document_id, content_markdown=document.content_markdown) + self._insert_document_links( + document_id=document_id, + source_canonical_url=document.canonical_url, + content_markdown=document.content_markdown, + ) self.connection.commit() return len(created) @@ -491,9 +501,10 @@ def synchronize_document_chunks(self, maximum: Optional[int] = None) -> int: total_created += self.chunk_document(row["id"]) return total_created - def _insert_document_links(self, *, document_id: str, content_markdown: str) -> None: + def _insert_document_links(self, *, document_id: str, source_canonical_url: str, content_markdown: str) -> None: to_document_ids = _to_document_ids_from_markdown( source_document_id=document_id, + source_canonical_url=source_canonical_url, content_markdown=content_markdown, ) for to_document_id in sorted(to_document_ids): @@ -595,76 +606,83 @@ def _next_wordpress_url(base_url: str) -> Optional[str]: seen_feed_urls.add(normalized_url) pages_processed += 1 + start_time = time.monotonic() if paginate else None try: - feed = feedparser.parse(current_url) - except Exception: - if pages_processed == 1: - raise - break - - if pages_processed > 1: - status = getattr(feed, "status", None) - if status is not None and status >= 400: - break - if getattr(feed, "bozo", 0) and not feed.entries: + try: + feed = feedparser.parse(current_url) + except Exception: + if pages_processed == 1: + raise break - page_feed_name = feed.feed.get("title") - if page_feed_name: - feed_name = feed_name or page_feed_name - - for entry in feed.entries: - entry_url = entry.get("link") or entry.get("id") - if not entry_url: - continue - canonical_url = _canonicalize_url(entry_url) - canonical_domain = _normalize_domain(canonical_url) - source = self._get_source_by_domain(canonical_domain) - if not source: - source = self.upsert_source( - canonical_domain=canonical_domain, - name=feed_name, - rss_feed=feed_url, - ) - else: - updated_name = source.name or feed_name - updated_rss_feed = source.rss_feed or feed_url - if updated_name != source.name or updated_rss_feed != source.rss_feed: + if pages_processed > 1: + status = getattr(feed, "status", None) + if status is not None and status >= 400: + break + if getattr(feed, "bozo", 0) and not feed.entries: + break + + page_feed_name = feed.feed.get("title") + if page_feed_name: + feed_name = feed_name or page_feed_name + + for entry in feed.entries: + entry_url = entry.get("link") or entry.get("id") + if not entry_url: + continue + canonical_url = _canonicalize_url(entry_url) + canonical_domain = _normalize_domain(canonical_url) + source = self._get_source_by_domain(canonical_domain) + if not source: source = self.upsert_source( canonical_domain=canonical_domain, - name=updated_name, - rss_feed=updated_rss_feed, + name=feed_name, + rss_feed=feed_url, + ) + else: + updated_name = source.name or feed_name + updated_rss_feed = source.rss_feed or feed_url + if updated_name != source.name or updated_rss_feed != source.rss_feed: + source = self.upsert_source( + canonical_domain=canonical_domain, + name=updated_name, + rss_feed=updated_rss_feed, + ) + content = None + if entry.get("content"): + content = entry.content[0].value + content = content or entry.get("summary") or "" + content_markdown = html_to_markdown(content) + if not content_markdown or not content_markdown.strip(): + continue + title = entry.get("title") + published_at = _parse_datetime(entry.get("published") or entry.get("updated")) + documents.append( + self.upsert_document( + source_id=source.id, + canonical_url=canonical_url, + title=title, + published_at=published_at, + content_markdown=content_markdown, ) - content = None - if entry.get("content"): - content = entry.content[0].value - content = content or entry.get("summary") or "" - content_markdown = html_to_markdown(content) - if not content_markdown or not content_markdown.strip(): - continue - title = entry.get("title") - published_at = _parse_datetime(entry.get("published") or entry.get("updated")) - documents.append( - self.upsert_document( - source_id=source.id, - canonical_url=canonical_url, - title=title, - published_at=published_at, - content_markdown=content_markdown, ) - ) - if not paginate: - break - if pages_processed >= 100: - break + if not paginate: + break + if pages_processed >= 100: + break - next_url = _next_feed_url(feed, current_url) - if not next_url and _is_wordpress_feed(feed): - next_url = _next_wordpress_url(current_url) - if not next_url: - break - current_url = next_url + next_url = _next_feed_url(feed, current_url) + if not next_url and _is_wordpress_feed(feed): + next_url = _next_wordpress_url(current_url) + if not next_url: + break + current_url = next_url + finally: + if start_time is not None: + elapsed = time.monotonic() - start_time + if elapsed < 1.0: + time.sleep(1.0 - elapsed) return documents diff --git a/packages/search-core/tests/conftest.py b/packages/search-core/tests/conftest.py index 6aff749..415ee87 100644 --- a/packages/search-core/tests/conftest.py +++ b/packages/search-core/tests/conftest.py @@ -391,7 +391,7 @@ def log_message(self, format, *args): # noqa: A003 - match base signature "body": """ Attention Required - Please verify you are human (captcha challenge) + Please verify you are human (reCAPTCHA challenge) """, } diff --git a/packages/search-core/tests/test_service.py b/packages/search-core/tests/test_service.py index 5761428..71fb9bd 100644 --- a/packages/search-core/tests/test_service.py +++ b/packages/search-core/tests/test_service.py @@ -238,7 +238,7 @@ def test_synchronize_document_chunks_non_positive_maximum_is_noop(service: Searc # Link graph behavior -def test_chunk_document_stores_unique_outbound_links_per_target(service: SearchService): +def test_chunk_document_skips_same_domain_links_and_dedupes_cross_domain_targets(service: SearchService): source = service.upsert_source("example.com", name="Example") document = service.upsert_document( source_id=source.id, @@ -246,9 +246,10 @@ def test_chunk_document_stores_unique_outbound_links_per_target(service: SearchS title="Source", published_at=None, content_markdown=( - "[one](https://example.com/target) " - "[two](https://example.com/target) " - "[three](https://example.com/other-target)" + "[same](https://example.com/target) " + "[cross-one](https://other.example/target) " + "[cross-one-duplicate](https://other.example/target) " + "[cross-two](https://third.example/other-target)" ), ) @@ -267,15 +268,41 @@ def test_chunk_document_stores_unique_outbound_links_per_target(service: SearchS assert len(links) == 2 assert [row["to_document_id"] for row in links] == sorted( [ - service_module.document_id_for_url(service_module._canonicalize_url("https://example.com/target")), - service_module.document_id_for_url(service_module._canonicalize_url("https://example.com/other-target")), + service_module.document_id_for_url(service_module._canonicalize_url("https://other.example/target")), + service_module.document_id_for_url(service_module._canonicalize_url("https://third.example/other-target")), ] ) +def test_chunk_document_resolves_relative_links_before_domain_filtering(service: SearchService): + source = service.upsert_source("example.com", name="Example") + document = service.upsert_document( + source_id=source.id, + canonical_url="https://example.com/posts/entry", + title="Entry", + published_at=None, + content_markdown=( + "[root](/about) " + "[parent](../archive) " + "[cross](https://external.example/outbound)" + ), + ) + + service.chunk_document(document.id) + + links = service.connection.execute( + "SELECT to_document_id FROM links WHERE from_document_id = ? ORDER BY to_document_id", + (document.id,), + ).fetchall() + + assert [row["to_document_id"] for row in links] == [ + service_module.document_id_for_url(service_module._canonicalize_url("https://external.example/outbound")) + ] + + def test_chunk_document_stores_unknown_targets_by_canonicalized_url(service: SearchService): source = service.upsert_source("example.com", name="Example") - target_url = "https://example.com/not-ingested" + target_url = "https://external.site/not-ingested" document = service.upsert_document( source_id=source.id, canonical_url="https://example.com/source", @@ -305,7 +332,7 @@ def test_outbound_links_ignore_self_and_follow_content_delete_and_refresh_lifecy canonical_url=canonical_url, title="Lifecycle", published_at=None, - content_markdown=f"[self]({canonical_url}) [other](https://example.com/other)", + content_markdown=f"[self]({canonical_url}) [other](https://other.example/other)", ) service.chunk_document(document.id) @@ -315,7 +342,7 @@ def test_outbound_links_ignore_self_and_follow_content_delete_and_refresh_lifecy (document.id,), ).fetchall() assert [row["to_document_id"] for row in links] == [ - service_module.document_id_for_url(service_module._canonicalize_url("https://example.com/other")) + service_module.document_id_for_url(service_module._canonicalize_url("https://other.example/other")) ] updated = service.upsert_document( @@ -334,7 +361,7 @@ def test_outbound_links_ignore_self_and_follow_content_delete_and_refresh_lifecy service.connection.execute( "UPDATE documents SET content_markdown = ? WHERE id = ?", - ("[refreshed](https://example.com/refreshed)", updated.id), + ("[refreshed](https://external.example/refreshed)", updated.id), ) service.connection.commit() service.chunk_document(updated.id) @@ -344,7 +371,7 @@ def test_outbound_links_ignore_self_and_follow_content_delete_and_refresh_lifecy (updated.id,), ).fetchall() assert [row["to_document_id"] for row in refreshed_links] == [ - service_module.document_id_for_url(service_module._canonicalize_url("https://example.com/refreshed")) + service_module.document_id_for_url(service_module._canonicalize_url("https://external.example/refreshed")) ] assert service.delete_document(updated.id) is True @@ -456,32 +483,35 @@ def test_search_returns_empty_for_blank_query_or_non_positive_limit(service: Sea def test_search_includes_link_score_with_deterministic_ties_and_zero_fill(service: SearchService): - source = service.upsert_source("example.com", name="Example") + source_a = service.upsert_source("alpha.example", name="Alpha") + source_b = service.upsert_source("beta.example", name="Beta") + source_c = service.upsert_source("gamma.example", name="Gamma") + source_d = service.upsert_source("delta.example", name="Delta") doc_a = service.upsert_document( - source_id=source.id, - canonical_url="https://example.com/a", + source_id=source_a.id, + canonical_url="https://alpha.example/a", title="A", published_at=None, content_markdown="alpha", ) doc_b = service.upsert_document( - source_id=source.id, - canonical_url="https://example.com/b", + source_id=source_b.id, + canonical_url="https://beta.example/b", title="B", published_at=None, content_markdown=f"alpha [a]({doc_a.canonical_url})", ) doc_c = service.upsert_document( - source_id=source.id, - canonical_url="https://example.com/c", + source_id=source_c.id, + canonical_url="https://gamma.example/c", title="C", published_at=None, content_markdown=f"alpha [a]({doc_a.canonical_url})", ) doc_d = service.upsert_document( - source_id=source.id, - canonical_url="https://example.com/d", + source_id=source_d.id, + canonical_url="https://delta.example/d", title="D", published_at=None, content_markdown=f"alpha [b]({doc_b.canonical_url}) [c]({doc_c.canonical_url})", @@ -618,7 +648,7 @@ def test_create_document_from_url_raises_on_backoff_signals(service: SearchServi ("backoff-429", "status_code=429"), ("backoff-503", "status_code=503"), ("backoff-retry-after", "retry-after-header"), - ("backoff-captcha", "body-marker=captcha"), + ("backoff-captcha", "body-marker=recaptcha"), ], ) def test_create_document_from_url_backoff_error_includes_reason( @@ -684,6 +714,23 @@ def test_create_documents_from_feed_pagination_enabled(service: SearchService, h assert f"{http_server}/feed-paginated-entry-2" in urls +def test_create_documents_from_feed_pagination_applies_minimum_one_second_interval( + service: SearchService, + http_server, + monkeypatch, +): + monotonic_values = iter([0.0, 0.2, 1.0, 1.8]) + sleep_calls: list[float] = [] + + monkeypatch.setattr(service_module.time, "monotonic", lambda: next(monotonic_values)) + monkeypatch.setattr(service_module.time, "sleep", lambda seconds: sleep_calls.append(seconds)) + + documents = service.create_documents_from_feed(f"{http_server}/feed-paginated", paginate=True) + + assert len(documents) == 2 + assert sleep_calls == pytest.approx([0.8, 0.2]) + + def test_create_documents_from_feed_wordpress_pagination(service: SearchService, http_server): documents = service.create_documents_from_feed(f"{http_server}/wp-feed", paginate=True)