Skip to content

Releases: D4Vinci/Scrapling

Release v0.4.9

07 Jun 18:42
1490506

Choose a tag to compare

A maintenance update packed with community-reported fixes πŸ› οΈ

πŸš€ New Stuff and quality of life changes

  • Updated all browsers and fingerprints. Run scrapling install --force after updating to refresh them.
  • Added a --version flag to the CLI by @ETM-Code in #303 (Solves #299)

πŸ› Bug Fixes

  • Fixed the session-level proxy argument being silently ignored in HTTP sessions, which could leak your real IP (Solves #295). Note that mixing a session-level proxy with a per-request proxies argument (or vice versa) now raises an error instead of one being silently dropped.
  • Fixed browser navigations failing when combining init_script with user_data_dir (Solves #294).
  • Fixed encoding detection when websites quote the charset value in the Content-Type header by @Bortlesboat in #323.
  • Fixed an IndexError in adaptive element relocation when auto_save is enabled by @Mubashirrrr in #340.
  • Fixed spiders' checkpoint and cache saving crashing on Windows by @MrStarkEG in #344.
  • Fixed incorrect similarity scoring in find_similar for elements with mismatched attribute counts (Solves #322).

Docs

  • Clarified that the default installation includes the parser engine only, and the fetchers/spiders need the extras (Solves #343).
  • Fixed the Docker image name in the remaining examples by @evanclan in #315.
  • Fixed a broken link in the contribution guide by @Bortlesboat in #320.

πŸ™ Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

Release v0.4.8

11 May 02:00
9ee3501

Choose a tag to compare

A big spider update that takes the crawling framework to the next level πŸ•·οΈ

πŸš€ New Stuff and quality of life changes

  • Added a LinkExtractor primitive in scrapling.spiders.LinkExtractor to pull URLs out of a Response. There are a lot of controls (Check the docs)

    from scrapling.spiders import LinkExtractor
    
    extractor = LinkExtractor(allow=r"/posts/", deny_domains=["ads.example.com"])
  • Added CrawlSpider and CrawlRule generic spider templates so you no longer have to hand-write the same "follow links matching this pattern" boilerplate. Override rules() to return a list of CrawlRule objects, each pairing a LinkExtractor. (Check the docs)

    from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor
    
    class QuotesSpider(CrawlSpider):
        name = "blog"
        start_urls = ["https://quotes.toscrape.com/"]
    
        def rules(self):
            return [
                CrawlRule(LinkExtractor(allow=r"/author/"), callback=self.parse_author),
                CrawlRule(LinkExtractor(allow=r"/page/\d+/")),  # pagination, no callback
            ]
    
        async def parse_author(self, response):
            yield {
                "name": response.css(".author-title::text").get(),
                "birthday": response.css(".author-born-date::text").get(),
                "url": response.url,
            }
  • Added a SitemapSpider template that seeds a crawl directly from a sitemap, or robots.txt URLs. Handles gzip-compressed sitemaps, and a lot of controls and options. URLs are dispatched via the crawl rules as shown above for CrawlSpider. (Check the docs)

    from scrapling.spiders import SitemapSpider, CrawlRule, LinkExtractor
    
    class NewsSitemap(SitemapSpider):
        name = "news"
        sitemap_urls = ["https://example.com/robots.txt"]
    
        def rules(self):
            return [
                CrawlRule(LinkExtractor(allow=r"/articles/"), callback=self.parse_article),
            ]
    
        async def parse_article(self, response):
            yield {"url": response.url, "title": response.css("h1::text").get()}
  • Adaptive relocation now defaults to a 40% similarity threshold instead of 0 across all methods. This will make the adaptive feature work better. When nothing crosses the threshold, a warning now tells you the top score it did see, so you can lower percentage deliberately if needed.

  • Updated all browsers and fingerprints. Run a new scrapling install --force after updating to refresh the browsers and fingerprints.

πŸ› Bug Fixes

  • Fixed Fetcher.configure(...) not applying to per-request calls. Same fix applied to AsyncFetcher.
  • Fixed incorrect request fingerprinting that caused duplicate requests in spiders by @yetval in #255.
  • Fixed the Adaptive scraping engine staying silent on weak matches. Combined with the threshold change above, you now get a warning instead of a misleading "best guess" element when relocation fails.

Docs

  • Refreshed older code examples across the documentation to match the current version.
  • Improved the code copy-paste experience on the docs site and trimmed the agent skill so it uses fewer tokens per invocation.

πŸ™ Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

Release v0.4.7

17 Apr 21:09
54a080a

Choose a tag to compare

A focused update bringing eyes to your AI agents πŸ“Έ

πŸš€ New Stuff and quality of life changes

  • Added a screenshot MCP tool that captures a page and returns it as a real MCP ImageContent block so the model can actually see it. The tool requires an open browser session, so you call open_session first (either dynamic or stealthy) and pass the session_id here. Supports PNG and JPEG, full-page captures, JPEG quality, and the usual readiness controls (wait, wait_selector, network_idle, timeout). (implements #244)
  • Added a custom session_id parameter to open_session so you can name sessions meaningfully ("search", "checkout") instead of the random 12-character hex default. By @hauntedhost in #243

πŸ› Bug Fixes

  • Fixed FetcherSession state corruption and a lazy session close crash. By @yetval in #245
  • Fixed TypeError: Session.request() got an unexpected keyword argument 'block_ads' when using the CLI's --ai-targeted flag with HTTP commands. By @voidborne-d in #249 (Fixes #247)

Translations

  • Added a Brazilian Portuguese README translation By @rgomids in #250

πŸ™ Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

Release v0.4.6

13 Apr 13:36
ced9a8d

Choose a tag to compare

A focused update on browser stealth, privacy, and developer experience πŸ”’

πŸš€ New Stuff and quality of life changes

  • Added built-in ad blocking for browser fetchers. Pass block_ads=True to block requests to ~3,500 known ad and tracker domains at the route interception level -- no DNS, no TCP, instant abort. Can be combined with blocked_domains for custom lists. The MCP server and CLI --ai-targeted mode enable this automatically to save tokens and speed up page loads.
    page = StealthyFetcher.fetch('https://example.com', block_ads=True)
  • Added DNS-over-HTTPS support to prevent DNS leaks when using proxies. Pass dns_over_https=True to route DNS queries through Cloudflare's DoH, so your real location isn't exposed through DNS resolution even when your HTTP traffic goes through a proxy.
    page = StealthyFetcher.fetch('https://example.com', proxy='http://proxy:8080', dns_over_https=True)
  • Added page_setup callback for browser fetchers. A function that runs before page.goto(), letting you register event listeners, routes, or scripts that must be set up before the page navigates. Pairs with page_action (which runs after navigation). (Solves #237)
    def capture_websockets(page):
        page.on("websocket", lambda ws: print(f"WS: {ws.url}"))
    
    page = DynamicFetcher.fetch('https://example.com', page_setup=capture_websockets)
  • Added --block-ads and --dns-over-https CLI options to both fetch and stealthy-fetch commands.

πŸ› Bug Fixes

  • Fixed Seconds type alias rejecting float values. Passing wait=1.5 or timeout=500.0 to browser fetchers would fail with a type error because the type alias incorrectly treated float as metadata instead of a type. by @kuishou68 in #240
  • Fixed duplicate ID segments in full-path selector generation. Elements with id attributes had their selector appended twice when generating full CSS/XPath paths, producing selectors like body > #main > #main > #target > #target. Also fixed full-path XPath emitting bare [@id='x'] predicates (invalid XPath) instead of *[@id='x']. by @sjhddh in #241
  • Fixed missing shell signature parameters. The interactive shell was missing blocked_domains, block_ads, retries, retry_delay, capture_xhr, executable_path, and dns_over_https from its function signatures.

πŸ™ Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

Release v0.4.5

07 Apr 04:22
cb449af

Choose a tag to compare

A focused update with one big quality-of-life feature for spider developers and a couple of important fixes πŸŽ‰

πŸš€ New Stuff and quality of life changes

  • Spider Development Mode: Iterating on a spider's parse() logic used to mean re-hitting the target servers on every run, which is slow, noisy, and a great way to get rate-limited while you're still figuring out your selectors. The new development mode caches every response to disk on the first run and replays them from disk on every subsequent run, so you can tweak your callbacks and re-run as many times as you want without making a single network request. Enable it with one class attribute:

    class MySpider(Spider):
        name = "my_spider"
        start_urls = ["https://example.com"]
        development_mode = True
    
        async def parse(self, response):
            yield {"title": response.css("title::text").get("")}

    The cache lives in .scrapling_cache/{spider.name}/ by default and can be redirected anywhere with development_cache_dir. Two new stat counters, cache_hits and cache_misses, let you see how the cache performed. Cache replay bypasses download_delay, rate limiting, and the blocked-request retry path so iteration is as fast as the disk allows. Don't ship a spider with development_mode = True -- it's a development tool, not a production cache. See the docs for the full story.

  • Safer redirects by default: follow_redirects now defaults to "safe" across all HTTP fetchers, the MCP server, and the shell. Redirects are still followed, but ones targeting internal/private IPs (loopback, private networks, link-local) are rejected. This protects you from SSRF when scraping user-supplied URLs. Pass follow_redirects="all" to get the old behavior, or False to disable redirects entirely.

πŸ› Bug Fixes

  • Force-stop no longer loses your checkpoint: Pressing Ctrl+C twice (force-stop) on a spider with crawldir enabled used to race against the checkpoint write -- the cancel scope would tear down the task before the pickle finished, leaving paused=False and triggering the cleanup path that deletes the previous checkpoint. The result was that force-stopping a long crawl could lose all the progress you were trying to save. The engine now writes the checkpoint before calling cancel_scope.cancel(), so a force-stop always preserves the latest pending state. By @voidborne-d in #230.

πŸ™ Special thanks to the community for all the continuous testing and feedback


Release v0.4.4

05 Apr 03:37
065bf1c

Choose a tag to compare

A new update with important spider improvements and bug fixes πŸŽ‰

πŸš€ New Stuff and quality of life changes

  • Added robots.txt compliance to the Spider framework with a new robots_txt_obey option. When enabled, the spider will automatically fetch and respect robots.txt rules before crawling, including Disallow, Crawl-delay, and Request-rate directives. Robots.txt files are fetched concurrently and cached per domain for the entire crawl. By @AbdullahY36 in #226
  • Added robots.txt cache pre-warming so all start_urls domains have their robots.txt fetched and parsed before the crawl loop begins, avoiding delays on the first request to each domain.
  • Added a new robots_disallowed_count stat to CrawlStats to track how many requests were blocked by robots.txt rules during a crawl.

Check it out on the website from here

πŸ› Bug Fixes

  • Fixed a critical MRO issue with ProxyRotator where the _build_context_with_proxy stub was shadowing the real implementation from child classes, causing proxy rotation to always raise NotImplementedError (Fixes #215). Thanks @yetval
  • Fixed a page pool leak when using per-request proxy rotation with browser sessions. Pages created inside temporary contexts were not removed from the pool on cleanup, leading to stale references accumulating over time. By @yetval in #223
  • Fixed a missing type assertion in the static fetcher where curl_cffi could return None from session.request(), causing downstream errors.

Other

  • Updated dependencies, so expect the latest fingerprints and other stuff.
  • Added protego as a new dependency under the fetchers optional group for robots.txt parsing.

πŸ™ Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

Release v0.4.3

30 Mar 03:50
e173f81

Choose a tag to compare

A new update with many important changes πŸŽ‰

πŸš€ New Stuff and quality of life changes

  • Added a new MCP tool to open a persistent normal/stealthy browser to keep using it with the rest of the tools, and another new tool to close it. (Examples)
  • Added a new MCP tool to list all existing browser sessions. Aimed to be used with the new tools.
  • Added a new option to browser sessions to automatically collect all background requests that happen during a request (Solves #159) [Examples].
  • Added a new sanitizer to protect the MCP server from common Prompt Injection attacks by removing hidden/invisible content.
  • Added a new commandline option called --ai-targeted to the Web Scraping commands to make content targeted to AI and safe against common Prompt Injection attacks like the MCP server.
  • Added a new option to browser sessions called executable_path to allow setting a custom browser path (Solves #202)
  • Refactored the MCP server code to be easily maintained and unified all tools to be async.
  • Refactored the CLI commands code to be easily maintained and shorter by 210 lines.

πŸ› Bug Fixes

  • A fix to preserve HTTP method across retries in spider session by @karesansui-u in #201
  • Added a max retry limit to getting page content to prevent infinite loop by @haosenwang1018 & @D4Vinci in #197
  • Replace bare raise with return False in _restore_from_checkpoint by @haosenwang1018 in #196
  • Replaced get_all with getall in Texthandler to match the Selector class.

Coverage/tests improvement

  • Added _normalize_credentials edge case coverage tests by @Bortlesboat in #192
  • Added save/retrieve round-trip and core storage coverage tests by @haosenwang1018 in #193
  • Added coverage for TextHandler regex paths and TextHandlers.re() by @haosenwang1018 in #194
  • Added edge case tests for filter, iterancestors, and find_similar by @awanawana in #200

Agent Skill improvement

  • Fixed broken markdown links in skill references by @yetval in #204
  • Improved the skill structure to be more acceptable by Clawhub validation.
  • Forced the skill to use the --ai-targeted commandline option when scraping through commandline commands.

Docs improvement

  • Added Korean README translation by @greatsk55 in #187
  • CJK Latin spacing fixes for the Chinese and Japanese READMEs.
  • Fixed broken links from the old website design.

πŸ™ Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

Release v0.4.2

08 Mar 23:37
19c21c1

Choose a tag to compare

A new maintenance update with important changes

Bug fixes

  • The function get_all_text() now captures tail text nodes. This will make the MCP server and commands see text that was missed before (#168). Thanks @mhillebrand
  • Referer now returns a bare Google url instead of a Google search URL. The previous logic was incorrect and may have produced a fingerprinting signal (#179). Thanks @Bortlesboat
  • Fixed an issue with extra flags concatenation in all browsers. Thanks @rostchri
  • Fixed a type hints issue with Python versions below 3.12 that caused it to crash. (Solves #163)

Other

  • Added an Agent Skill for Claude Code / OpenClaw and other AI agentic tools.
  • Added the Agent Skill to Clawhub.
  • Updates all browsers and Playwright versions to the latest.
  • Added a French translation to the main README file.

πŸ™ Special thanks to the community for all the continuous testing and feedback


Big shoutout to our Platinum Sponsors

Release v0.4.1

27 Feb 04:12
2b50123

Choose a tag to compare

A new update with many important changes

πŸš€ New Stuff and quality of life changes

  • Improved regex precision for Cloudflare challenge detection (Thanks to @RinZ27 #133)
  • Improved the speed and efficiency of the Cloudflare solver. Now it is nearly twice as fast.
  • Improved the Cloudflare solver to handle the case where websites sometimes show the Cloudflare page twice before redirecting to the main website.
  • Improved the stealthy browser's stealth mode and speed by removing the injected JS files.
  • Improved the MCP schema to be acceptable by OpenCode (Thanks to @robin-ede #137)
  • Made the MCP schema even more MCP-friendly to be accepted by VS Code Copilot and other strict tools. (Solves #150 )
  • Improved the MCP server tokens consumption by a large margin through stripping useless HTML tags while the main_content_only option is activated.
  • Fixed the PyPI page and added the files to register the MCP server to the MCP servers registry.
  • Added a new code snippet to show how to install the browsers deps through code instead of using the commandline to allow easier automation.
  • Improved all workflows by using the latest actions versions (Thanks to @salmanmkc #143/#144)

πŸ™ Special thanks to the community for all the continuous testing and feedback

Release v0.4

15 Feb 05:13
04d796b

Choose a tag to compare

The biggest release of Scrapling yet β€” introducing the Spider framework, proxy rotation, and major parser improvements

This release brings a fully async spider/crawling framework, intelligent proxy management, and significant API changes that make Scrapling more powerful and consistent. Please review the breaking changes section carefully before upgrading.

πŸ•·οΈ Spider Framework

A new async crawling framework built on top of anyio for structured, large-scale scraping:

from scrapling.spiders import Spider, Response

class MySpider(Spider):
  name = "demo"
  start_urls = ["https://example.com/"]

  async def parse(self, response: Response):
      for item in response.css('.product'):
          yield {"title": item.css('h2::text').get()}

MySpider().start()
  • Scrapy-like Spider API: Define spiders with start_urls, async parse callbacks, Request/Response objects, and priority queue.
  • Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
  • Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. Supports lazy session initialization.
  • Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C to gracefully shut down; then restart to resume from where you left off.
  • Streaming Mode: Stream scraped items as they arrive via async for item in spider.stream() with real-time stats - ideal for UI, pipelines, and long-running crawls.
  • Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
  • Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with result.items.to_json() / result.items.to_jsonl() respectively.
  • Lifecycle hooks: on_start(), on_close(), on_error(), on_scraped_item(), and more hooks for full control over the crawl lifecycle.
  • Detailed crawl stats: track requests, responses, bytes, status codes, proxies, per-domain/session breakdowns, log level counts, and more.
  • uvloop support: Pass use_uvloop=True to spider.start() for faster async execution when available.

A new section has been added to the website with the Full details. Click here

πŸ”„ Proxy Rotation

  • New ProxyRotator class with thread-safe rotation. Works with all fetchers and sessions:
    from scrapling import ProxyRotator
    rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"])
    Fetcher.get(url, proxy_rotator=rotator)
  • Custom rotation strategies: Make your own proxy rotation logic
  • Per-request proxy override: Pass proxy= to any individual get()/post()/fetch() call to override the session proxy for that request.

🌐 Browser Fetcher Improvements

  • Domain blocking: New blocked_domains parameter on DynamicFetcher/StealthyFetcher to block requests to specific domains (subdomains matched automatically).
  • Automatic retries: Browser fetchers now retry on failure with retries (default: 3) and retry_delay (default: 1s) parameters. Includes proxy-aware error detection.
  • Response metadata: Response.meta dict automatically stores the proxy used, and merges request metadata.
  • Response.follow(): Create follow-up Request objects with automatic referer flow, designed for the spider system.
  • No autoplay: Browser sessions are now blocking autoplay content, which caused issues before.
  • Speed: Improved stealth and speed by adjusting browser flags.

πŸ”§ Bug Fixes & Improvements

  • Parser optimization: Optimized the parser for repeated operations, improving performance.
  • Errored pages: Fixed a bug that caused the browser to not close when pages gave errors.
  • Empty body: Handle responses with empty body.
  • Playwright loop: Solving an issue with leaving the Playwright loop open when CDP connection fails
  • Type safety: Fixed all mypy errors and added type hints across untyped function bodies. Added mypy and pyright to the CI workflow.

⚠️ Breaking Changes

  • css_first/xpath_first removed: Use css('.selector').first, css('.selector')[0], or css('.selector').get() instead.
  • All selection now returns Selectors: css('::text'), xpath('//text()'), css('::attr(href)'), and xpath('//@href') now return Selectors (wrapping text nodes in Selector objects with tag="#text") instead of TextHandlers. This makes the API consistent across all selection methods and the type hints.
  • Response.body is always bytes: Previously could be str or bytes, now always returns bytes.
  • get()/getall() behavior: On Selector: get() returns TextHandler (serialized HTML or text value), getall() returns TextHandlers. Aliases: extract_first = get, extract = getall. Old get_all() on Selectors is removed.
  • Selectors.first/.last: Safe accessors that return Selector | None instead of raising IndexError.
  • Internal constants renamed: DEFAULT_FLAGS β†’ DEFAULT_ARGS, DEFAULT_STEALTH_FLAGS β†’ STEALTH_ARGS, HARMFUL_DEFAULT_ARGS β†’ HARMFUL_ARGS, DEFAULT_DISABLED_RESOURCES β†’ EXTRA_RESOURCES.

πŸ”¨ Other Changes

  • Dependency changes: Replaced tldextract with tld, removed internal _html_utils.py in favor of w3lib.html.replace_entities, added typing_extensions as a hard requirement.
  • Docs overhaul: Full switch from MkDocs to Zensical, new spider documentation section, updated all existing pages, and added new API references.

πŸ™ Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors