Releases: D4Vinci/Scrapling
Release v0.4.9
A maintenance update packed with community-reported fixes π οΈ
π New Stuff and quality of life changes
- Updated all browsers and fingerprints. Run
scrapling install --forceafter updating to refresh them. - Added a
--versionflag to the CLI by @ETM-Code in #303 (Solves #299)
π Bug Fixes
- Fixed the session-level
proxyargument being silently ignored in HTTP sessions, which could leak your real IP (Solves #295). Note that mixing a session-levelproxywith a per-requestproxiesargument (or vice versa) now raises an error instead of one being silently dropped. - Fixed browser navigations failing when combining
init_scriptwithuser_data_dir(Solves #294). - Fixed encoding detection when websites quote the charset value in the
Content-Typeheader by @Bortlesboat in #323. - Fixed an
IndexErrorin adaptive element relocation whenauto_saveis enabled by @Mubashirrrr in #340. - Fixed spiders' checkpoint and cache saving crashing on Windows by @MrStarkEG in #344.
- Fixed incorrect similarity scoring in
find_similarfor elements with mismatched attribute counts (Solves #322).
Docs
- Clarified that the default installation includes the parser engine only, and the fetchers/spiders need the extras (Solves #343).
- Fixed the Docker image name in the remaining examples by @evanclan in #315.
- Fixed a broken link in the contribution guide by @Bortlesboat in #320.
π Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors
Release v0.4.8
A big spider update that takes the crawling framework to the next level π·οΈ
π New Stuff and quality of life changes
-
Added a
LinkExtractorprimitive inscrapling.spiders.LinkExtractorto pull URLs out of aResponse. There are a lot of controls (Check the docs)from scrapling.spiders import LinkExtractor extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"])
-
Added
CrawlSpiderandCrawlRulegeneric spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Overriderules()to return a list ofCrawlRuleobjects, each pairing aLinkExtractor. (Check the docs)from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor class QuotesSpider(CrawlSpider): name = "blog" start_urls = ["https://quotes.toscrape.com/"] def rules(self): return [ CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author), CrawlRule(LinkExtractor(allow=r"/page/\d+/")), # pagination, no callback ] async def parse_author(self, response): yield { "name": response.css(".author-title::text").get(), "birthday": response.css(".author-born-date::text").get(), "url": response.url, }
-
Added a
SitemapSpidertemplate that seeds a crawl directly from a sitemap, orrobots.txtURLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor class NewsSitemap(SitemapSpider): name = "news" sitemap_urls = ["https://example.com/robots.txt"] def rules(self): return [ CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article), ] async def parse_article(self, response): yield {"url": response.url, "title": response.css("h1::text").get()}
-
Adaptive relocation now defaults to a 40% similarity threshold instead of
0across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lowerpercentagedeliberately if needed. -
Updated all browsers and fingerprints. Run a new
scrapling install --forceafter updating to refresh the browsers and fingerprints.
π Bug Fixes
- Fixed
Fetcher.configure(...)not applying to per-request calls. Same fix applied toAsyncFetcher. - Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
- Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.
Docs
- Refreshed older code examples across the documentation to match the current version.
- Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.
π Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors
Release v0.4.7
A focused update bringing eyes to your AI agents πΈ
π New Stuff and quality of life changes
- Added a
screenshotMCP tool that captures a page and returns it as a real MCPImageContentblock so the model can actually see it. The tool requires an open browser session, so you callopen_sessionfirst (eitherdynamicorstealthy) and pass thesession_idhere. Supports PNG and JPEG, full-page captures, JPEG quality, and the usual readiness controls (wait,wait_selector,network_idle,timeout). (implements #244) - Added a custom
session_idparameter toopen_sessionso you can name sessions meaningfully ("search","checkout") instead of the random 12-character hex default. By @hauntedhost in #243
π Bug Fixes
- Fixed
FetcherSessionstate corruption and a lazy session close crash. By @yetval in #245 - Fixed
TypeError: Session.request() got an unexpected keyword argument 'block_ads'when using the CLI's--ai-targetedflag with HTTP commands. By @voidborne-d in #249 (Fixes #247)
Translations
π Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors
Release v0.4.6
A focused update on browser stealth, privacy, and developer experience π
π New Stuff and quality of life changes
- Added built-in ad blocking for browser fetchers. Pass
block_ads=Trueto block requests to ~3,500 known ad and tracker domains at the route interception level -- no DNS, no TCP, instant abort. Can be combined withblocked_domainsfor custom lists. The MCP server and CLI--ai-targetedmode enable this automatically to save tokens and speed up page loads.page = StealthyFetcher.fetch('https://example.com', block_ads=True)
- Added DNS-over-HTTPS support to prevent DNS leaks when using proxies. Pass
dns_over_https=Trueto route DNS queries through Cloudflare's DoH, so your real location isn't exposed through DNS resolution even when your HTTP traffic goes through a proxy.page = StealthyFetcher.fetch('https://example.com', proxy='http://proxy:8080', dns_over_https=True)
- Added
page_setupcallback for browser fetchers. A function that runs beforepage.goto(), letting you register event listeners, routes, or scripts that must be set up before the page navigates. Pairs withpage_action(which runs after navigation). (Solves #237)def capture_websockets(page): page.on("websocket", lambda ws: print(f"WS: {ws.url}")) page = DynamicFetcher.fetch('https://example.com', page_setup=capture_websockets)
- Added
--block-adsand--dns-over-httpsCLI options to bothfetchandstealthy-fetchcommands.
π Bug Fixes
- Fixed
Secondstype alias rejecting float values. Passingwait=1.5ortimeout=500.0to browser fetchers would fail with a type error because the type alias incorrectly treatedfloatas metadata instead of a type. by @kuishou68 in #240 - Fixed duplicate ID segments in full-path selector generation. Elements with
idattributes had their selector appended twice when generating full CSS/XPath paths, producing selectors likebody > #main > #main > #target > #target. Also fixed full-path XPath emitting bare[@id='x']predicates (invalid XPath) instead of*[@id='x']. by @sjhddh in #241 - Fixed missing shell signature parameters. The interactive shell was missing
blocked_domains,block_ads,retries,retry_delay,capture_xhr,executable_path, anddns_over_httpsfrom its function signatures.
π Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors
Release v0.4.5
A focused update with one big quality-of-life feature for spider developers and a couple of important fixes π
π New Stuff and quality of life changes
-
Spider Development Mode: Iterating on a spider's
parse()logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:class MySpider(Spider): name = "my_spider" start_urls = ["https://example.com"] development_mode = True async def parse(self, response): yield {"title": response.css("title::text").get("")}
The cache lives in
.scrapling_cache/{spider.name}/by default and can be redirected anywhere withdevelopment_cache_dir. Two new stat counters,cache_hitsandcache_misses, let you see how the cache performed. Cache replay bypassesdownload_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider withdevelopment_mode = True-- it's a development tool, not a production cache. See the docs for the full story. -
Safer redirects by default:
follow_redirectsnow defaults to"safe"across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Passfollow_redirects="all"to get the old behavior, orFalseto disable redirects entirely.
π Bug Fixes
- Force-stop no longer loses your checkpoint: Pressing Ctrl+C twice (force-stop) on a spider with
crawldirenabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leavingpaused=Falseand triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before callingcancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.
π Special thanks to the community for all the continuous testing and feedback
Release v0.4.4
A new update with important spider improvements and bug fixes π
π New Stuff and quality of life changes
- Added robots.txt compliance to the Spider framework with a new
robots_txt_obeyoption. When enabled, the spider will automatically fetch and respect robots.txt rules before crawling, includingDisallow,Crawl-delay, andRequest-ratedirectives. Robots.txt files are fetched concurrently and cached per domain for the entire crawl. By @AbdullahY36 in #226 - Added robots.txt cache pre-warming so all start_urls domains have their robots.txt fetched and parsed before the crawl loop begins, avoiding delays on the first request to each domain.
- Added a new
robots_disallowed_countstat toCrawlStatsto track how many requests were blocked by robots.txt rules during a crawl.
Check it out on the website from here
π Bug Fixes
- Fixed a critical MRO issue with
ProxyRotatorwhere the_build_context_with_proxystub was shadowing the real implementation from child classes, causing proxy rotation to always raiseNotImplementedError(Fixes #215). Thanks @yetval - Fixed a page pool leak when using per-request proxy rotation with browser sessions. Pages created inside temporary contexts were not removed from the pool on cleanup, leading to stale references accumulating over time. By @yetval in #223
- Fixed a missing type assertion in the static fetcher where
curl_cfficould returnNonefromsession.request(), causing downstream errors.
Other
- Updated dependencies, so expect the latest fingerprints and other stuff.
- Added
protegoas a new dependency under thefetchersoptional group for robots.txt parsing.
π Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors
Release v0.4.3
A new update with many important changes π
π New Stuff and quality of life changes
- Added a new MCP tool to open a persistent normal/stealthy browser to keep using it with the rest of the tools, and another new tool to close it. (Examples)
- Added a new MCP tool to list all existing browser sessions. Aimed to be used with the new tools.
- Added a new option to browser sessions to automatically collect all background requests that happen during a request (Solves #159) [Examples].
- Added a new sanitizer to protect the MCP server from common Prompt Injection attacks by removing hidden/invisible content.
- Added a new commandline option called
--ai-targetedto the Web Scraping commands to make content targeted to AI and safe against common Prompt Injection attacks like the MCP server. - Added a new option to browser sessions called
executable_pathto allow setting a custom browser path (Solves #202) - Refactored the MCP server code to be easily maintained and unified all tools to be async.
- Refactored the CLI commands code to be easily maintained and shorter by 210 lines.
π Bug Fixes
- A fix to preserve HTTP method across retries in spider session by @karesansui-u in #201
- Added a max retry limit to getting page content to prevent infinite loop by @haosenwang1018 & @D4Vinci in #197
- Replace bare
raisewithreturn Falsein_restore_from_checkpointby @haosenwang1018 in #196 - Replaced
get_allwithgetallinTexthandlerto match the Selector class.
Coverage/tests improvement
- Added
_normalize_credentialsedge case coverage tests by @Bortlesboat in #192 - Added save/retrieve round-trip and core storage coverage tests by @haosenwang1018 in #193
- Added coverage for
TextHandlerregex paths andTextHandlers.re()by @haosenwang1018 in #194 - Added edge case tests for
filter,iterancestors, andfind_similarby @awanawana in #200
Agent Skill improvement
- Fixed broken markdown links in skill references by @yetval in #204
- Improved the skill structure to be more acceptable by Clawhub validation.
- Forced the skill to use the
--ai-targetedcommandline option when scraping through commandline commands.
Docs improvement
- Added Korean README translation by @greatsk55 in #187
- CJK Latin spacing fixes for the Chinese and Japanese READMEs.
- Fixed broken links from the old website design.
π Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors
Release v0.4.2
A new maintenance update with important changes
Bug fixes
- The function
get_all_text()now captures tail text nodes. This will make the MCP server and commands see text that was missed before (#168). Thanks @mhillebrand - Referer now returns a bare Google url instead of a Google search URL. The previous logic was incorrect and may have produced a fingerprinting signal (#179). Thanks @Bortlesboat
- Fixed an issue with extra flags concatenation in all browsers. Thanks @rostchri
- Fixed a type hints issue with Python versions below 3.12 that caused it to crash. (Solves #163)
Other
- Added an Agent Skill for Claude Code / OpenClaw and other AI agentic tools.
- Added the Agent Skill to Clawhub.
- Updates all browsers and Playwright versions to the latest.
- Added a French translation to the main README file.
π Special thanks to the community for all the continuous testing and feedback
Big shoutout to our Platinum Sponsors
Release v0.4.1
A new update with many important changes
π New Stuff and quality of life changes
- Improved regex precision for Cloudflare challenge detection (Thanks to @RinZ27 #133)
- Improved the speed and efficiency of the Cloudflare solver. Now it is nearly twice as fast.
- Improved the Cloudflare solver to handle the case where websites sometimes show the Cloudflare page twice before redirecting to the main website.
- Improved the stealthy browser's stealth mode and speed by removing the injected JS files.
- Improved the MCP schema to be acceptable by OpenCode (Thanks to @robin-ede #137)
- Made the MCP schema even more MCP-friendly to be accepted by VS Code Copilot and other strict tools. (Solves #150 )
- Improved the MCP server tokens consumption by a large margin through stripping useless HTML tags while the
main_content_onlyoption is activated. - Fixed the PyPI page and added the files to register the MCP server to the MCP servers registry.
- Added a new code snippet to show how to install the browsers deps through code instead of using the commandline to allow easier automation.
- Improved all workflows by using the latest actions versions (Thanks to @salmanmkc #143/#144)
π Special thanks to the community for all the continuous testing and feedback
Release v0.4
The biggest release of Scrapling yet β introducing the Spider framework, proxy rotation, and major parser improvements
This release brings a fully async spider/crawling framework, intelligent proxy management, and significant API changes that make Scrapling more powerful and consistent. Please review the breaking changes section carefully before upgrading.
π·οΈ Spider Framework
A new async crawling framework built on top of anyio for structured, large-scale scraping:
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
MySpider().start()- Scrapy-like Spider API: Define spiders with
start_urls, asyncparsecallbacks,Request/Responseobjects, and priority queue. - Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
- Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. Supports lazy session initialization.
- Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C to gracefully shut down; then restart to resume from where you left off.
- Streaming Mode: Stream scraped items as they arrive via
async for item in spider.stream()with real-time stats - ideal for UI, pipelines, and long-running crawls. - Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
- Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with
result.items.to_json()/result.items.to_jsonl()respectively. - Lifecycle hooks:
on_start(),on_close(),on_error(),on_scraped_item(), and more hooks for full control over the crawl lifecycle. - Detailed crawl stats: track requests, responses, bytes, status codes, proxies, per-domain/session breakdowns, log level counts, and more.
- uvloop support: Pass
use_uvloop=Truetospider.start()for faster async execution when available.
A new section has been added to the website with the Full details. Click here
π Proxy Rotation
- New
ProxyRotatorclass with thread-safe rotation. Works with all fetchers and sessions:from scrapling import ProxyRotator rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"]) Fetcher.get(url, proxy_rotator=rotator)
- Custom rotation strategies: Make your own proxy rotation logic
- Per-request proxy override: Pass
proxy=to any individualget()/post()/fetch()call to override the session proxy for that request.
π Browser Fetcher Improvements
- Domain blocking: New
blocked_domainsparameter onDynamicFetcher/StealthyFetcherto block requests to specific domains (subdomains matched automatically). - Automatic retries: Browser fetchers now retry on failure with
retries(default: 3) andretry_delay(default: 1s) parameters. Includes proxy-aware error detection. - Response metadata:
Response.metadict automatically stores the proxy used, and merges request metadata. - Response.follow(): Create follow-up
Requestobjects with automatic referer flow, designed for the spider system. - No autoplay: Browser sessions are now blocking autoplay content, which caused issues before.
- Speed: Improved stealth and speed by adjusting browser flags.
π§ Bug Fixes & Improvements
- Parser optimization: Optimized the parser for repeated operations, improving performance.
- Errored pages: Fixed a bug that caused the browser to not close when pages gave errors.
- Empty body: Handle responses with empty body.
- Playwright loop: Solving an issue with leaving the Playwright loop open when CDP connection fails
- Type safety: Fixed all mypy errors and added type hints across untyped function bodies. Added mypy and pyright to the CI workflow.
β οΈ Breaking Changes
css_first/xpath_firstremoved: Usecss('.selector').first,css('.selector')[0], orcss('.selector').get()instead.- All selection now returns
Selectors:css('::text'),xpath('//text()'),css('::attr(href)'), andxpath('//@href')now returnSelectors(wrapping text nodes inSelectorobjects withtag="#text") instead ofTextHandlers. This makes the API consistent across all selection methods and the type hints. Response.bodyis alwaysbytes: Previously could bestrorbytes, now always returnsbytes.get()/getall()behavior: OnSelector:get()returnsTextHandler(serialized HTML or text value),getall()returnsTextHandlers. Aliases:extract_first = get,extract = getall. Oldget_all()onSelectorsis removed.Selectors.first/.last: Safe accessors that returnSelector | Noneinstead of raisingIndexError.- Internal constants renamed:
DEFAULT_FLAGSβDEFAULT_ARGS,DEFAULT_STEALTH_FLAGSβSTEALTH_ARGS,HARMFUL_DEFAULT_ARGSβHARMFUL_ARGS,DEFAULT_DISABLED_RESOURCESβEXTRA_RESOURCES.
π¨ Other Changes
- Dependency changes: Replaced
tldextractwithtld, removed internal_html_utils.pyin favor ofw3lib.html.replace_entities, addedtyping_extensionsas a hard requirement. - Docs overhaul: Full switch from MkDocs to Zensical, new spider documentation section, updated all existing pages, and added new API references.
π Special thanks to our Discord community for all the continuous testing and feedback




