07 Jun 18:42

github-actions

1490506

Release v0.4.9 Latest

Latest

A maintenance update packed with community-reported fixes 🛠️

Note

Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

Updated all browsers and fingerprints. Run scrapling install --force after updating to refresh them.
Added a --version flag to the CLI by @ETM-Code in #303 (Solves #299)

🐛 Bug Fixes

Fixed the session-level proxy argument being silently ignored in HTTP sessions, which could leak your real IP (Solves #295). Note that mixing a session-level proxy with a per-request proxies argument (or vice versa) now raises an error instead of one being silently dropped.
Fixed browser navigations failing when combining init_script with user_data_dir (Solves #294).
Fixed encoding detection when websites quote the charset value in the Content-Type header by @Bortlesboat in #323.
Fixed an IndexError in adaptive element relocation when auto_save is enabled by @Mubashirrrr in #340.
Fixed spiders' checkpoint and cache saving crashing on Windows by @MrStarkEG in #344.
Fixed incorrect similarity scoring in find_similar for elements with mismatched attribute counts (Solves #322).

Docs

Clarified that the default installation includes the parser engine only, and the fetchers/spiders need the extras (Solves #343).
Fixed the Docker image name in the remaining examples by @evanclan in #315.
Fixed a broken link in the contribution guide by @Bortlesboat in #320.

🙏 Special thanks to the community for all the continuous testing and feedback

Big shoutout to our Platinum Sponsors

Contributors

MrStarkEG, evanclan, and 3 other contributors

Assets 2

11 May 02:00

github-actions

v0.4.8

9ee3501

Release v0.4.8

A big spider update that takes the crawling framework to the next level 🕷️

Note

Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

Added a LinkExtractor primitive in scrapling.spiders.LinkExtractor to pull URLs out of a Response. There are a lot of controls (Check the docs)
```
from scrapling.spiders import LinkExtractor

extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"])
```

Added CrawlSpider and CrawlRule generic spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Override rules() to return a list of CrawlRule objects, each pairing a LinkExtractor. (Check the docs)

from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor

class QuotesSpider(CrawlSpider):
    name = "blog"
    start_urls = ["https://quotes.toscrape.com/"]

    def rules(self):
        return [
            CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author),
            CrawlRule(LinkExtractor(allow=r"/page/\d+/")),  # pagination, no callback
        ]

    async def parse_author(self, response):
        yield {
            "name": response.css(".author-title::text").get(),
            "birthday": response.css(".author-born-date::text").get(),
            "url": response.url,
        }

Added a SitemapSpider template that seeds a crawl directly from a sitemap, or robots.txt URLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)

from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor

class NewsSitemap(SitemapSpider):
    name = "news"
    sitemap_urls = ["https://example.com/robots.txt"]

    def rules(self):
        return [
            CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article),
        ]

    async def parse_article(self, response):
        yield {"url": response.url, "title": response.css("h1::text").get()}

Adaptive relocation now defaults to a 40% similarity threshold instead of 0 across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lower percentage deliberately if needed.
Updated all browsers and fingerprints. Run a new scrapling install --force after updating to refresh the browsers and fingerprints.

🐛 Bug Fixes

Fixed Fetcher.configure(...) not applying to per-request calls. Same fix applied to AsyncFetcher.
Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.

Docs

Refreshed older code examples across the documentation to match the current version.
Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.

🙏 Special thanks to the community for all the continuous testing and feedback

Big shoutout to our Platinum Sponsors

Contributors

yetval

Assets 2

17 Apr 21:09

github-actions

v0.4.7

54a080a

Release v0.4.7

A focused update bringing eyes to your AI agents 📸

Note

Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

Added a screenshot MCP tool that captures a page and returns it as a real MCP ImageContent block so the model can actually see it. The tool requires an open browser session, so you call open_session first (either dynamic or stealthy) and pass the session_id here. Supports PNG and JPEG, full-page captures, JPEG quality, and the usual readiness controls (wait, wait_selector, network_idle, timeout). (implements #244)
Added a custom session_id parameter to open_session so you can name sessions meaningfully ("search", "checkout") instead of the random 12-character hex default. By @hauntedhost in #243

🐛 Bug Fixes

Fixed FetcherSession state corruption and a lazy session close crash. By @yetval in #245
Fixed TypeError: Session.request() got an unexpected keyword argument 'block_ads' when using the CLI's --ai-targeted flag with HTTP commands. By @voidborne-d in #249 (Fixes #247)

Translations

Added a Brazilian Portuguese README translation By @rgomids in #250

🙏 Special thanks to the community for all the continuous testing and feedback

Big shoutout to our Platinum Sponsors

Contributors

hauntedhost, rgomids, and 2 other contributors

Assets 2

13 Apr 13:36

github-actions

v0.4.6

ced9a8d

Release v0.4.6

A focused update on browser stealth, privacy, and developer experience 🔒

Note

Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

Added built-in ad blocking for browser fetchers. Pass block_ads=True to block requests to ~3,500 known ad and tracker domains at the route interception level -- no DNS, no TCP, instant abort. Can be combined with blocked_domains for custom lists. The MCP server and CLI --ai-targeted mode enable this automatically to save tokens and speed up page loads.
```
page = StealthyFetcher.fetch('https://example.com', block_ads=True)
```
Added DNS-over-HTTPS support to prevent DNS leaks when using proxies. Pass dns_over_https=True to route DNS queries through Cloudflare's DoH, so your real location isn't exposed through DNS resolution even when your HTTP traffic goes through a proxy.
```
page = StealthyFetcher.fetch('https://example.com', proxy='http://proxy:8080', dns_over_https=True)
```
Added page_setup callback for browser fetchers. A function that runs before page.goto(), letting you register event listeners, routes, or scripts that must be set up before the page navigates. Pairs with page_action (which runs after navigation). (Solves #237)
```
def capture_websockets(page):
    page.on("websocket", lambda ws: print(f"WS: {ws.url}"))

page = DynamicFetcher.fetch('https://example.com', page_setup=capture_websockets)
```
Added --block-ads and --dns-over-https CLI options to both fetch and stealthy-fetch commands.

🐛 Bug Fixes

Fixed Seconds type alias rejecting float values. Passing wait=1.5 or timeout=500.0 to browser fetchers would fail with a type error because the type alias incorrectly treated float as metadata instead of a type. by @kuishou68 in #240
Fixed duplicate ID segments in full-path selector generation. Elements with id attributes had their selector appended twice when generating full CSS/XPath paths, producing selectors like body > #main > #main > #target > #target. Also fixed full-path XPath emitting bare [@id='x'] predicates (invalid XPath) instead of *[@id='x']. by @sjhddh in #241
Fixed missing shell signature parameters. The interactive shell was missing blocked_domains, block_ads, retries, retry_delay, capture_xhr, executable_path, and dns_over_https from its function signatures.

🙏 Special thanks to the community for all the continuous testing and feedback

Big shoutout to our Platinum Sponsors

Contributors

sjhddh and kuishou68

Assets 2

07 Apr 04:22

github-actions

v0.4.5

cb449af

Release v0.4.5

A focused update with one big quality-of-life feature for spider developers and a couple of important fixes 🎉

Note

Follow us on X for daily tips and tricks

🚀 New Stuff and quality of life changes

Spider Development Mode: Iterating on a spider's parse() logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:
```
class MySpider(Spider):
    name = "my_spider"
    start_urls = ["https://example.com"]
    development_mode = True

    async def parse(self, response):
        yield {"title": response.css("title::text").get("")}
```
The cache lives in .scrapling_cache/{spider.name}/ by default and can be redirected anywhere with development_cache_dir. Two new stat counters, cache_hits and cache_misses, let you see how the cache performed. Cache replay bypasses download_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider with development_mode = True -- it's a development tool, not a production cache. See the docs for the full story.
Safer redirects by default: follow_redirects now defaults to "safe" across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Pass follow_redirects="all" to get the old behavior, or False to disable redirects entirely.

🐛 Bug Fixes

Force-stop no longer loses your checkpoint: Pressing Ctrl+C twice (force-stop) on a spider with crawldir enabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leaving paused=False and triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before calling cancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.

🙏 Special thanks to the community for all the continuous testing and feedback

Contributors

voidborne-d

Assets 2

05 Apr 03:37

github-actions

v0.4.4

065bf1c

Release v0.4.4

A new update with important spider improvements and bug fixes 🎉

🚀 New Stuff and quality of life changes

Added robots.txt compliance to the Spider framework with a new robots_txt_obey option. When enabled, the spider will automatically fetch and respect robots.txt rules before crawling, including Disallow, Crawl-delay, and Request-rate directives. Robots.txt files are fetched concurrently and cached per domain for the entire crawl. By @AbdullahY36 in #226
Added robots.txt cache pre-warming so all start_urls domains have their robots.txt fetched and parsed before the crawl loop begins, avoiding delays on the first request to each domain.
Added a new robots_disallowed_count stat to CrawlStats to track how many requests were blocked by robots.txt rules during a crawl.

Check it out on the website from here

🐛 Bug Fixes

Fixed a critical MRO issue with ProxyRotator where the _build_context_with_proxy stub was shadowing the real implementation from child classes, causing proxy rotation to always raise NotImplementedError (Fixes #215). Thanks @yetval
Fixed a page pool leak when using per-request proxy rotation with browser sessions. Pages created inside temporary contexts were not removed from the pool on cleanup, leading to stale references accumulating over time. By @yetval in #223
Fixed a missing type assertion in the static fetcher where curl_cffi could return None from session.request(), causing downstream errors.

Other

Updated dependencies, so expect the latest fingerprints and other stuff.
Added protego as a new dependency under the fetchers optional group for robots.txt parsing.

🙏 Special thanks to the community for all the continuous testing and feedback

Big shoutout to our Platinum Sponsors

Contributors

AbdullahY36 and yetval

Assets 2

30 Mar 03:50

github-actions

v0.4.3

e173f81

Release v0.4.3

A new update with many important changes 🎉

🚀 New Stuff and quality of life changes

Added a new MCP tool to open a persistent normal/stealthy browser to keep using it with the rest of the tools, and another new tool to close it. (Examples)
Added a new MCP tool to list all existing browser sessions. Aimed to be used with the new tools.
Added a new option to browser sessions to automatically collect all background requests that happen during a request (Solves #159) [Examples].
Added a new sanitizer to protect the MCP server from common Prompt Injection attacks by removing hidden/invisible content.
Added a new commandline option called --ai-targeted to the Web Scraping commands to make content targeted to AI and safe against common Prompt Injection attacks like the MCP server.
Added a new option to browser sessions called executable_path to allow setting a custom browser path (Solves #202)
Refactored the MCP server code to be easily maintained and unified all tools to be async.
Refactored the CLI commands code to be easily maintained and shorter by 210 lines.

🐛 Bug Fixes

A fix to preserve HTTP method across retries in spider session by @karesansui-u in #201
Added a max retry limit to getting page content to prevent infinite loop by @haosenwang1018 & @D4Vinci in #197
Replace bare raise with return False in _restore_from_checkpoint by @haosenwang1018 in #196
Replaced get_all with getall in Texthandler to match the Selector class.

Coverage/tests improvement

Added _normalize_credentials edge case coverage tests by @Bortlesboat in #192
Added save/retrieve round-trip and core storage coverage tests by @haosenwang1018 in #193
Added coverage for TextHandler regex paths and TextHandlers.re() by @haosenwang1018 in #194
Added edge case tests for filter, iterancestors, and find_similar by @awanawana in #200

Agent Skill improvement

Fixed broken markdown links in skill references by @yetval in #204
Improved the skill structure to be more acceptable by Clawhub validation.
Forced the skill to use the --ai-targeted commandline option when scraping through commandline commands.

Docs improvement

Added Korean README translation by @greatsk55 in #187
CJK Latin spacing fixes for the Chinese and Japanese READMEs.
Fixed broken links from the old website design.

🙏 Special thanks to the community for all the continuous testing and feedback

Big shoutout to our Platinum Sponsors

Contributors

greatsk55, D4Vinci, and 5 other contributors

Assets 2

08 Mar 23:37

github-actions

v0.4.2

19c21c1

Release v0.4.2

A new maintenance update with important changes

Bug fixes

The function get_all_text() now captures tail text nodes. This will make the MCP server and commands see text that was missed before (#168). Thanks @mhillebrand
Referer now returns a bare Google url instead of a Google search URL. The previous logic was incorrect and may have produced a fingerprinting signal (#179). Thanks @Bortlesboat
Fixed an issue with extra flags concatenation in all browsers. Thanks @rostchri
Fixed a type hints issue with Python versions below 3.12 that caused it to crash. (Solves #163)

Other

Added an Agent Skill for Claude Code / OpenClaw and other AI agentic tools.
Added the Agent Skill to Clawhub.
Updates all browsers and Playwright versions to the latest.
Added a French translation to the main README file.

🙏 Special thanks to the community for all the continuous testing and feedback

Big shoutout to our Platinum Sponsors

Contributors

mhillebrand, rostchri, and Bortlesboat

Assets 2

27 Feb 04:12

github-actions

v0.4.1

2b50123

Release v0.4.1

A new update with many important changes

🚀 New Stuff and quality of life changes

Improved regex precision for Cloudflare challenge detection (Thanks to @RinZ27 #133)
Improved the speed and efficiency of the Cloudflare solver. Now it is nearly twice as fast.
Improved the Cloudflare solver to handle the case where websites sometimes show the Cloudflare page twice before redirecting to the main website.
Improved the stealthy browser's stealth mode and speed by removing the injected JS files.
Improved the MCP schema to be acceptable by OpenCode (Thanks to @robin-ede #137)
Made the MCP schema even more MCP-friendly to be accepted by VS Code Copilot and other strict tools. (Solves #150 )
Improved the MCP server tokens consumption by a large margin through stripping useless HTML tags while the main_content_only option is activated.
Fixed the PyPI page and added the files to register the MCP server to the MCP servers registry.
Added a new code snippet to show how to install the browsers deps through code instead of using the commandline to allow easier automation.
Improved all workflows by using the latest actions versions (Thanks to @salmanmkc #143/#144)

🙏 Special thanks to the community for all the continuous testing and feedback

Contributors

salmanmkc, robin-ede, and RinZ27

Assets 2

15 Feb 05:13

github-actions

v0.4

04d796b

Release v0.4

The biggest release of Scrapling yet — introducing the Spider framework, proxy rotation, and major parser improvements

This release brings a fully async spider/crawling framework, intelligent proxy management, and significant API changes that make Scrapling more powerful and consistent. Please review the breaking changes section carefully before upgrading.

🕷️ Spider Framework

A new async crawling framework built on top of anyio for structured, large-scale scraping:

from scrapling.spiders import Spider, Response

class MySpider(Spider):
  name = "demo"
  start_urls = ["https://example.com/"]

  async def parse(self, response: Response):
      for item in response.css('.product'):
          yield {"title": item.css('h2::text').get()}

MySpider().start()

Scrapy-like Spider API: Define spiders with start_urls, async parse callbacks, Request/Response objects, and priority queue.
Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. Supports lazy session initialization.
Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C to gracefully shut down; then restart to resume from where you left off.
Streaming Mode: Stream scraped items as they arrive via async for item in spider.stream() with real-time stats - ideal for UI, pipelines, and long-running crawls.
Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with result.items.to_json() / result.items.to_jsonl() respectively.
Lifecycle hooks: on_start(), on_close(), on_error(), on_scraped_item(), and more hooks for full control over the crawl lifecycle.
Detailed crawl stats: track requests, responses, bytes, status codes, proxies, per-domain/session breakdowns, log level counts, and more.
uvloop support: Pass use_uvloop=True to spider.start() for faster async execution when available.

A new section has been added to the website with the Full details. Click here

🔄 Proxy Rotation

New ProxyRotator class with thread-safe rotation. Works with all fetchers and sessions:

from scrapling import ProxyRotator
rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"])
Fetcher.get(url, proxy_rotator=rotator)

Custom rotation strategies: Make your own proxy rotation logic
Per-request proxy override: Pass proxy= to any individual get()/post()/fetch() call to override the session proxy for that request.

🌐 Browser Fetcher Improvements

Domain blocking: New blocked_domains parameter on DynamicFetcher/StealthyFetcher to block requests to specific domains (subdomains matched automatically).
Automatic retries: Browser fetchers now retry on failure with retries (default: 3) and retry_delay (default: 1s) parameters. Includes proxy-aware error detection.
Response metadata: Response.meta dict automatically stores the proxy used, and merges request metadata.
Response.follow(): Create follow-up Request objects with automatic referer flow, designed for the spider system.
No autoplay: Browser sessions are now blocking autoplay content, which caused issues before.
Speed: Improved stealth and speed by adjusting browser flags.

🔧 Bug Fixes & Improvements

Parser optimization: Optimized the parser for repeated operations, improving performance.
Errored pages: Fixed a bug that caused the browser to not close when pages gave errors.
Empty body: Handle responses with empty body.
Playwright loop: Solving an issue with leaving the Playwright loop open when CDP connection fails
Type safety: Fixed all mypy errors and added type hints across untyped function bodies. Added mypy and pyright to the CI workflow.

⚠️ Breaking Changes

css_first/xpath_first removed: Use css('.selector').first, css('.selector')[0], or css('.selector').get() instead.
All selection now returns Selectors: css('::text'), xpath('//text()'), css('::attr(href)'), and xpath('//@href') now return Selectors (wrapping text nodes in Selector objects with tag="#text") instead of TextHandlers. This makes the API consistent across all selection methods and the type hints.
Response.body is always bytes: Previously could be str or bytes, now always returns bytes.
get()/getall() behavior: On Selector: get() returns TextHandler (serialized HTML or text value), getall() returns TextHandlers. Aliases: extract_first = get, extract = getall. Old get_all() on Selectors is removed.
Selectors.first/.last: Safe accessors that return Selector | None instead of raising IndexError.
Internal constants renamed: DEFAULT_FLAGS → DEFAULT_ARGS, DEFAULT_STEALTH_FLAGS → STEALTH_ARGS, HARMFUL_DEFAULT_ARGS → HARMFUL_ARGS, DEFAULT_DISABLED_RESOURCES → EXTRA_RESOURCES.

🔨 Other Changes

Dependency changes: Replaced tldextract with tld, removed internal _html_utils.py in favor of w3lib.html.replace_entities, added typing_extensions as a hard requirement.
Docs overhaul: Full switch from MkDocs to Zensical, new spider documentation section, updated all existing pages, and added new API references.

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Assets 2

Uh oh!

Releases: D4Vinci/Scrapling

Release v0.4.9

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Docs

Big shoutout to our Platinum Sponsors

Contributors

Uh oh!

Release v0.4.8

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Docs

Big shoutout to our Platinum Sponsors

Contributors

Uh oh!

Release v0.4.7

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Translations

Big shoutout to our Platinum Sponsors

Contributors

Uh oh!

Release v0.4.6

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Big shoutout to our Platinum Sponsors

Contributors

Uh oh!

Release v0.4.5

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Contributors

Uh oh!

Release v0.4.4

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Other

Big shoutout to our Platinum Sponsors

Contributors

Uh oh!

Release v0.4.3

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Coverage/tests improvement

Agent Skill improvement

Docs improvement

Big shoutout to our Platinum Sponsors

Contributors

Uh oh!

Release v0.4.2

Bug fixes

Other

Big shoutout to our Platinum Sponsors

Contributors

Uh oh!

Release v0.4.1

🚀 New Stuff and quality of life changes

Contributors

Uh oh!

Release v0.4

🕷️ Spider Framework

🔄 Proxy Rotation

🌐 Browser Fetcher Improvements

🔧 Bug Fixes & Improvements

⚠️ Breaking Changes

🔨 Other Changes

Big shoutout to our biggest Sponsors

Uh oh!