lurkshark · lurkshark · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026
diff --git a/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/.openspec.yaml b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/.openspec.yaml
@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-03-04
diff --git a/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/design.md b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/design.md
@@ -0,0 +1,59 @@
+## Context
+
+The link graph is populated during `chunk_document(document_id)` by extracting markdown links and persisting unique `(from_document_id, to_document_id)` edges. Current filtering excludes self-links but still records links between different documents on the same canonical domain, which can over-amplify intra-site structures in the link-based ranking signal.
+
+This change is constrained to search-core link extraction and tests. No schema change is required.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Persist outbound edges only when source and target canonical domains differ.
+- Treat relative outbound links as same-domain by resolving against the source document canonical URL before filtering.
+- Preserve existing behavior for dedupe, unknown target handling (`document_id_for_url`), and lifecycle refresh semantics.
+- Update tests so link graph and ranking assertions reflect cross-domain-only edge persistence.
+
+**Non-Goals:**
+- Redefining canonical domain normalization (`netloc` remains the canonical domain key).
+- Changing DB schema or link table shape.
+- Introducing weighting by domain category or any other ranking formula changes beyond input edge set.
+
+## Decisions
+
+1. **Apply domain filtering in URL-to-target derivation before insertion**
+   - Decision: Extend the outbound link derivation helper to accept the source document canonical URL and skip targets whose normalized domain equals the source domain.
+   - Rationale: Keeps filtering in one place with self-link exclusion and dedupe, minimizing lifecycle-path drift.
+   - Alternative considered: Filter at SQL insert time by joining source document data per edge; rejected as more complex and less explicit for relative-link handling.
+
+2. **Resolve relative links against source canonical URL for domain checks**
+   - Decision: Use `urljoin(source_canonical_url, href)` prior to canonicalization/domain extraction when deriving targets.
+   - Rationale: Ensures internal relative links are consistently treated as same-domain and skipped.
+   - Alternative considered: Ignore relative links entirely; rejected because it changes existing link extraction semantics and may drop legitimate cross-domain absolute forms in mixed content.
+
+3. **Keep unknown-target behavior unchanged after filtering**
+   - Decision: After passing domain filter, derive `to_document_id` via `document_id_for_url(_canonicalize_url(resolved_url))` even if no target document exists.
+   - Rationale: Preserves current contract and avoids coupling to ingestion state.
+   - Alternative considered: Require target documents to exist before storing edges; rejected as scope creep and behavior regression.
+
+4. **Refactor tests to explicit multi-domain fixtures**
+   - Decision: Update link graph and link-score tests that currently rely on same-domain edges to use at least two domains.
+   - Rationale: Makes expectations align with the new rule and prevents accidental reliance on intra-domain edge persistence.
+   - Alternative considered: Keep same fixtures and loosen assertions; rejected because it obscures intended behavior.
+
+## Risks / Trade-offs
+
+- **[Risk] `netloc`-based equality treats `www.example.com` and `example.com` as different domains** → **Mitigation:** Preserve current canonical-domain contract for now; document behavior in spec scenarios and revisit in a separate normalization change if needed.
+- **[Risk] Relative-link resolution may expose malformed markdown href values** → **Mitigation:** Continue canonicalization and skip empty/unusable URLs; maintain deterministic helper-level filtering.
+- **[Risk] Fewer stored edges may reduce link-score differentiation on single-site corpora** → **Mitigation:** Intentional trade-off to avoid intra-site self-reinforcement; ranking still uses FTS/vector signals.
+
+## Migration Plan
+
+1. Update link-derivation helper signature and call sites to include source canonical URL.
+2. Implement same-domain skip with relative-link resolution.
+3. Update unit tests for link persistence/lifecycle and ranking expectations.
+4. Run test suite and verify no API/CLI contract changes.
+
+Rollback: revert helper/filter changes and corresponding test updates; schema remains unchanged so rollback is code-only.
+
+## Open Questions
+
+- None for this iteration.
diff --git a/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/proposal.md b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/proposal.md
@@ -0,0 +1,26 @@
+## Why
+
+Link-based ranking currently counts links between documents on the same canonical domain. This lets intra-site link structures inflate authority and can drown out cross-site signals, reducing relevance quality.
+
+## What Changes
+
+- Update outbound link extraction to skip link creation when source and target resolve to the same canonical domain.
+- Keep existing self-link exclusion and duplicate `(from_document_id, to_document_id)` collapse behavior.
+- Resolve relative links against the source document canonical URL before domain comparison so internal relative links are also skipped.
+- Apply the same-domain skip rule even when the target document has not been ingested yet (compare using canonicalized target URL).
+- Update link-graph and ranking tests to use multi-domain fixtures and validate same-domain exclusion.
+
+## Capabilities
+
+### New Capabilities
+- `document-link-domain-filtering`: Filters outbound link graph edges so only cross-domain links are persisted for ranking.
+
+### Modified Capabilities
+- *(none)*
+
+## Impact
+
+- Affected code: `packages/search-core/src/grogbot_search/service.py` link extraction/insertion helpers.
+- Affected tests: `packages/search-core/tests/test_service.py` link-graph lifecycle tests and link-score ranking fixtures/assertions.
+- API/CLI shape: no contract changes expected; link persistence and derived `link_score` values change.
+- Dependencies/systems: no new dependencies; SQLite schema remains unchanged.
diff --git a/...4-skip-same-canonical-domain-links/specs/document-link-domain-filtering/spec.md b/...4-skip-same-canonical-domain-links/specs/document-link-domain-filtering/spec.md
@@ -0,0 +1,30 @@
+## ADDED Requirements
+
+### Requirement: Outbound link graph SHALL exclude same-canonical-domain targets
+When generating outbound document links, the system SHALL compare canonical domains for source and target URLs and MUST skip persistence when both domains are equal.
+
+#### Scenario: Absolute same-domain target is skipped
+- **WHEN** a chunked document at `https://example.com/a` contains an outbound link to `https://example.com/b`
+- **THEN** no `(from_document_id, to_document_id)` edge is stored for that link
+
+#### Scenario: Cross-domain target is persisted
+- **WHEN** a chunked document at `https://example.com/a` contains an outbound link to `https://other.example/b`
+- **THEN** one directed edge is stored for the source document and resolved target document id
+
+### Requirement: Relative outbound links MUST be resolved before domain filtering
+For outbound links extracted from markdown, the system MUST resolve relative href values against the source document canonical URL before canonicalization and domain comparison.
+
+#### Scenario: Relative internal path is treated as same-domain
+- **WHEN** a chunked document at `https://example.com/posts/1` contains `[x](/about)`
+- **THEN** the target resolves to `https://example.com/about` for domain comparison and no edge is stored
+
+#### Scenario: Relative traversal path is treated as same-domain
+- **WHEN** a chunked document at `https://example.com/posts/1` contains `[x](../archive)`
+- **THEN** the resolved target domain matches the source domain and no edge is stored
+
+### Requirement: Cross-domain unknown targets SHALL still derive target ids
+After applying same-domain filtering, the system MUST derive `to_document_id` from the canonicalized target URL even when no target document is currently ingested.
+
+#### Scenario: Unknown cross-domain URL stores derived target id
+- **WHEN** a chunked document links to `https://external.site/not-ingested` and no document exists for that URL
+- **THEN** the system stores the edge with `to_document_id = document_id_for_url(_canonicalize_url("https://external.site/not-ingested"))`
diff --git a/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/tasks.md b/openspec/changes/archive/2026-03-04-skip-same-canonical-domain-links/tasks.md
@@ -0,0 +1,23 @@
+## 1. Link derivation updates
+
+- [x] 1.1 Update outbound link target derivation helper(s) to accept the source document canonical URL.
+- [x] 1.2 Resolve markdown href values with `urljoin(source_canonical_url, href)` before canonicalization/domain comparison.
+- [x] 1.3 Skip target IDs whose normalized domain matches the source canonical domain while preserving self-link and dedupe behavior.
+- [x] 1.4 Keep unknown cross-domain target handling unchanged (`to_document_id = document_id_for_url(_canonicalize_url(resolved_url))`).
+
+## 2. Service integration
+
+- [x] 2.1 Update `chunk_document` link insertion flow to provide source canonical URL to link-derivation logic.
+- [x] 2.2 Verify no schema or API contract changes are introduced by the filtering update.
+
+## 3. Test updates
+
+- [x] 3.1 Update link-graph persistence tests to assert same-domain absolute links are skipped and cross-domain links persist.
+- [x] 3.2 Add/adjust tests for relative-link resolution (`/path`, `../path`) being treated as same-domain and skipped.
+- [x] 3.3 Update unknown-target tests to validate cross-domain unknown URLs still store derived `to_document_id`.
+- [x] 3.4 Refactor link-score ranking fixtures/assertions to multi-domain documents so expected inbound-link ordering remains deterministic.
+
+## 4. Verification
+
+- [x] 4.1 Run `packages/search-core` tests and confirm all link graph/ranking assertions pass with same-domain filtering enabled.
+- [x] 4.2 Manually sanity-check that link rows now represent cross-domain edges only for updated fixtures.
diff --git a/packages/search-core/src/grogbot_search/service.py b/packages/search-core/src/grogbot_search/service.py
@@ -34,23 +34,18 @@ class SearchScores:
 
 _BACKOFF_STATUS_CODES = {401, 403, 429, 503}
 _CAPTCHA_MARKERS = (
-    "captcha",
     "cf-chl",
+    "recaptcha",
     "attention required",
     "verify you are human",
 )
 
 _DEFAULT_HEADERS = {
-    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:147.0) Gecko/20100101 Firefox/147.0",
+    "User-Agent": "Mozilla/5.0 (compatible; Grogbot/1.0; +https://www.hauntedspice.com)",
     "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
     "Accept-Language": "en-US,en;q=0.9",
-    "Accept-Encoding": "gzip, deflate, br, zstd",
-    "Connection": "keep-alive",
+    "Accept-Encoding": "gzip, deflate",
     "Upgrade-Insecure-Requests": "1",
-    "Sec-Fetch-Dest": "document",
-    "Sec-Fetch-Mode": "navigate",
-    "Sec-Fetch-Site": "none",
-    "Priority": "u=0, i",
 }
 
 
@@ -151,10 +146,21 @@ def _extract_markdown_links(content_markdown: str) -> List[str]:
     return links
 
 
-def _to_document_ids_from_markdown(*, source_document_id: str, content_markdown: str) -> set[str]:
+def _to_document_ids_from_markdown(
+    *,
+    source_document_id: str,
+    source_canonical_url: str,
+    content_markdown: str,
+) -> set[str]:
     to_document_ids: set[str] = set()
+    source_domain = _normalize_domain(_canonicalize_url(source_canonical_url))
     for href in _extract_markdown_links(content_markdown):
-        to_document_id = document_id_for_url(_canonicalize_url(href))
+        resolved_url = _canonicalize_url(urljoin(source_canonical_url, href))
+        if not resolved_url:
+            continue
+        if _normalize_domain(resolved_url) == source_domain:
+            continue
+        to_document_id = document_id_for_url(_canonicalize_url(resolved_url))
         if to_document_id == source_document_id:
             continue
         to_document_ids.add(to_document_id)
@@ -467,7 +473,11 @@ def chunk_document(self, document_id: str) -> int:
         self.connection.execute("DELETE FROM chunks WHERE document_id = ?", (document_id,))
         self.connection.execute("DELETE FROM links WHERE from_document_id = ?", (document_id,))
         created = self._create_chunks(document_id, document.content_markdown)
-        self._insert_document_links(document_id=document_id, content_markdown=document.content_markdown)
+        self._insert_document_links(
+            document_id=document_id,
+            source_canonical_url=document.canonical_url,
+            content_markdown=document.content_markdown,
+        )
         self.connection.commit()
         return len(created)
 
@@ -491,9 +501,10 @@ def synchronize_document_chunks(self, maximum: Optional[int] = None) -> int:
             total_created += self.chunk_document(row["id"])
         return total_created
 
-    def _insert_document_links(self, *, document_id: str, content_markdown: str) -> None:
+    def _insert_document_links(self, *, document_id: str, source_canonical_url: str, content_markdown: str) -> None:
         to_document_ids = _to_document_ids_from_markdown(
             source_document_id=document_id,
+            source_canonical_url=source_canonical_url,
             content_markdown=content_markdown,
         )
         for to_document_id in sorted(to_document_ids):
@@ -595,76 +606,83 @@ def _next_wordpress_url(base_url: str) -> Optional[str]:
             seen_feed_urls.add(normalized_url)
             pages_processed += 1
 
+            start_time = time.monotonic() if paginate else None
             try:
-                feed = feedparser.parse(current_url)
-            except Exception:
-                if pages_processed == 1:
-                    raise
-                break
-
-            if pages_processed > 1:
-                status = getattr(feed, "status", None)
-                if status is not None and status >= 400:
-                    break
-                if getattr(feed, "bozo", 0) and not feed.entries:
+                try:
+                    feed = feedparser.parse(current_url)
+                except Exception:
+                    if pages_processed == 1:
+                        raise
                     break
 
-            page_feed_name = feed.feed.get("title")
-            if page_feed_name:
-                feed_name = feed_name or page_feed_name
-
-            for entry in feed.entries:
-                entry_url = entry.get("link") or entry.get("id")
-                if not entry_url:
-                    continue
-                canonical_url = _canonicalize_url(entry_url)
-                canonical_domain = _normalize_domain(canonical_url)
-                source = self._get_source_by_domain(canonical_domain)
-                if not source:
-                    source = self.upsert_source(
-                        canonical_domain=canonical_domain,
-                        name=feed_name,
-                        rss_feed=feed_url,
-                    )
-                else:
-                    updated_name = source.name or feed_name
-                    updated_rss_feed = source.rss_feed or feed_url
-                    if updated_name != source.name or updated_rss_feed != source.rss_feed:
+                if pages_processed > 1:
+                    status = getattr(feed, "status", None)
+                    if status is not None and status >= 400:
+                        break
+                    if getattr(feed, "bozo", 0) and not feed.entries:
+                        break
+
+                page_feed_name = feed.feed.get("title")
+                if page_feed_name:
+                    feed_name = feed_name or page_feed_name
+
+                for entry in feed.entries:
+                    entry_url = entry.get("link") or entry.get("id")
+                    if not entry_url:
+                        continue
+                    canonical_url = _canonicalize_url(entry_url)
+                    canonical_domain = _normalize_domain(canonical_url)
+                    source = self._get_source_by_domain(canonical_domain)
+                    if not source:
                         source = self.upsert_source(
                             canonical_domain=canonical_domain,
-                            name=updated_name,
-                            rss_feed=updated_rss_feed,
+                            name=feed_name,
+                            rss_feed=feed_url,
+                        )
+                    else:
+                        updated_name = source.name or feed_name
+                        updated_rss_feed = source.rss_feed or feed_url
+                        if updated_name != source.name or updated_rss_feed != source.rss_feed:
+                            source = self.upsert_source(
+                                canonical_domain=canonical_domain,
+                                name=updated_name,
+                                rss_feed=updated_rss_feed,
+                            )
+                    content = None
+                    if entry.get("content"):
+                        content = entry.content[0].value
+                    content = content or entry.get("summary") or ""
+                    content_markdown = html_to_markdown(content)
+                    if not content_markdown or not content_markdown.strip():
+                        continue
+                    title = entry.get("title")
+                    published_at = _parse_datetime(entry.get("published") or entry.get("updated"))
+                    documents.append(
+                        self.upsert_document(
+                            source_id=source.id,
+                            canonical_url=canonical_url,
+                            title=title,
+                            published_at=published_at,
+                            content_markdown=content_markdown,
                         )
-                content = None
-                if entry.get("content"):
-                    content = entry.content[0].value
-                content = content or entry.get("summary") or ""
-                content_markdown = html_to_markdown(content)
-                if not content_markdown or not content_markdown.strip():
-                    continue
-                title = entry.get("title")
-                published_at = _parse_datetime(entry.get("published") or entry.get("updated"))
-                documents.append(
-                    self.upsert_document(
-                        source_id=source.id,
-                        canonical_url=canonical_url,
-                        title=title,
-                        published_at=published_at,
-                        content_markdown=content_markdown,
                     )
-                )
 
-            if not paginate:
-                break
-            if pages_processed >= 100:
-                break
+                if not paginate:
+                    break
+                if pages_processed >= 100:
+                    break
 
-            next_url = _next_feed_url(feed, current_url)
-            if not next_url and _is_wordpress_feed(feed):
-                next_url = _next_wordpress_url(current_url)
-            if not next_url:
-                break
-            current_url = next_url
+                next_url = _next_feed_url(feed, current_url)
+                if not next_url and _is_wordpress_feed(feed):
+                    next_url = _next_wordpress_url(current_url)
+                if not next_url:
+                    break
+                current_url = next_url
+            finally:
+                if start_time is not None:
+                    elapsed = time.monotonic() - start_time
+                    if elapsed < 1.0:
+                        time.sleep(1.0 - elapsed)
 
         return documents
 

diff --git a/packages/search-core/tests/conftest.py b/packages/search-core/tests/conftest.py
@@ -391,7 +391,7 @@ def log_message(self, format, *args):  # noqa: A003 - match base signature
         "body": """
         <html>
           <head><title>Attention Required</title></head>
-          <body>Please verify you are human (captcha challenge)</body>
+          <body>Please verify you are human (reCAPTCHA challenge)</body>
         </html>
         """,
     }