A pi extension that gives your agent a proper web browser. Fetches pages via headless Chrome, extracts clean content via trafilatura, and optionally distills it with an LLM.
Claude Code's WebFetch uses Turndown to convert HTML to markdown — a simple regex-based converter that can't handle JavaScript-rendered pages, doesn't strip boilerplate well, and has no extensibility. pi-web-fetch improves on this in several ways:
| Claude Code WebFetch | pi-web-fetch | |
|---|---|---|
| Rendering | Static HTTP fetch | Headless Chrome (handles SPAs, JS-rendered content) |
| Extraction | Turndown (regex HTML→md) | trafilatura (ML-based boilerplate removal) |
| Raw content | Never — prompt is mandatory | Optional — omit prompt to get full markdown |
| Batch fetching | One URL at a time | Up to 10 URLs concurrently with per-URL progress |
| Concurrency | Sequential | Browser pool with 6 parallel tabs |
| Extensibility | None | Hook system for site-specific handling |
| Smart redirects | Generic | Context-aware (e.g. GitHub URLs → gh CLI suggestions) |
web_fetchtool — registered in pi's tool system, callable by the LLM- Headless Chrome via puppeteer — handles JavaScript-rendered pages, SPAs, pages behind cookie banners
- Content extraction — strips boilerplate (nav, ads, footers) using trafilatura's ML-based extraction, outputs clean markdown
- LLM processing — optionally distills page content to answer a specific question via a pi sub-agent
- Batch fetching — fetch up to 10 pages concurrently in a single tool call with per-URL status in the UI
- Browser pool — reuses a single Chrome instance with up to 6 parallel tabs, avoiding repeated browser startup overhead
- Smart large-page handling — when content exceeds ~50KB and no prompt is given, automatically generates a structured summary
- 15-minute cache — avoids redundant fetches; enables summarize-then-drill-down workflows
- Cross-host redirect detection — reported to the LLM for explicit follow-up rather than silently following
- HTTP→HTTPS auto-upgrade
- Extension hooks — customize fetch behavior for specific sites (redirect, replace HTML, transform markdown, override summarization)
- Built-in site handlers — GitHub URLs redirect to
ghCLI, Google Docs redirect to workspace tools
- pi coding agent
- A Python tool runner for trafilatura (auto-detected in priority order):
- Node.js 18+
npm install pi-web-fetchOr clone for development:
git clone https://github.com/georgebashi/pi-web-fetch
cd pi-web-fetch
npm installNote:
npm installwill download puppeteer's bundled Chromium (~300MB). If you already have Chrome/Chromium installed, you can setPUPPETEER_EXECUTABLE_PATHto skip the download:export PUPPETEER_EXECUTABLE_PATH=/path/to/chrome
Note: The first time
web_fetchruns,uvxwill download the trafilatura package (~10MB). Subsequent runs use the cached environment and are fast.
Add the package to your pi settings (~/.pi/agent/settings.json):
{
"extensions": [
"pi-web-fetch"
]
}Or point to a local checkout:
{
"extensions": [
"/path/to/pi-web-fetch"
]
}Or use the -e flag for quick testing:
pi -e pi-web-fetchThe extension registers a web_fetch tool. The LLM will use it automatically when it needs to fetch web content.
With a prompt (recommended):
Fetch https://docs.example.com/api and tell me what authentication methods are supported.
Without a prompt (full content):
Fetch the content of https://example.com/changelog
Fetch multiple pages concurrently by asking for several URLs at once:
Read these three pages and compare their approaches to error handling:
The agent will use the pages parameter to fetch all URLs in parallel (up to 10 per call). The UI shows live per-URL progress:
● docs.python.org/3/tutorial/errors.html
◐ go.dev/blog/error-handling-and-go · fetching
◑ doc.rust-lang.org/book/ch09-00-error-... · extracting
Each URL independently transitions through: pending → fetching → extracting → summarizing → done/error. The browser pool manages concurrency automatically (6 tabs max).
pi-web-fetch has a hook system for site-specific fetch behavior. Extensions can intercept any stage of the pipeline: before fetch, after fetch (HTML), after extraction (markdown), or at summarization time.
- GitHub redirect — matches
github.com/**, redirects toghCLI with context-aware suggestions (e.g.gh issue view 123) - Google Docs redirect — matches
docs.google.com/**, redirects to Google Workspace MCP tools
Extensions are TypeScript modules with a factory function default export:
import type { WebFetchExtension } from "pi-web-fetch/types";
export default function (): WebFetchExtension {
return {
name: "my-handler",
matches: ["example.com/**"],
async beforeFetch(ctx) {
// Return a HookResult to short-circuit, or void to continue
return ctx.redirect("Use a different tool for this site.");
},
async afterFetch(ctx) {
// ctx.html — replace HTML or short-circuit
},
async afterExtract(ctx) {
// ctx.markdown — replace markdown or short-circuit
},
async summarize(ctx) {
// Override default LLM summarization
},
};
}- Event bus — other pi extensions can register handlers via
pi.events.emit("web-fetch:register", extension) - Local — TypeScript/JS files in
~/.pi/extensions/web-fetch/(or configuredextensionsDir) - Built-in — shipped with pi-web-fetch in the
extensions/directory
Create ~/.pi/agent/web-fetch.json to override defaults:
{
"model": "provider/model-id",
"thinkingLevel": "medium",
"extensionsDir": "~/.pi/extensions/web-fetch"
}| Key | Default | Description |
|---|---|---|
model |
Current session model | Model for LLM content processing |
thinkingLevel |
Current session thinking level | Thinking level for the sub-agent |
extensionsDir |
~/.pi/extensions/web-fetch/ |
Directory for local extensions |
Without a config file, the extension uses whatever model and thinking level the current session is using.
web_fetch(url, prompt?)
│
├─ URL validation & normalization (http→https, scheme check)
├─ Cache check (15-min TTL)
├─ Extension: beforeFetch hook
├─ Browser pool → Puppeteer page (networkidle2, 30s timeout)
├─ Cross-host redirect detection
├─ Extension: afterFetch hook
├─ trafilatura extraction (HTML → clean markdown)
├─ Extension: afterExtract hook
├─ Cache store
└─ Content processing:
├─ With prompt → pi sub-agent (focused extraction)
├─ Small content, no prompt → return raw markdown
└─ Large content, no prompt → pi sub-agent (structured summary)
For batch mode, each URL runs through this pipeline independently with its own browser tab. The browser pool (6 tabs max, 60s idle timeout) provides backpressure.
MIT