Releases: spider-rs/spider
v2.30.23
Whats Changed
Chrome now manages document reloading to prevent infinite page reloading through scripting.
The firewall feature flag now enables the firewall protection via networking on chrome as well for an improved ad, tracking, and malice website blocker.
- chore(chrome): add infinite loop document reload protection
- chore(chrome): add to block list
- chore(chrome): add firewall feature flag.
Full Changelog: v2.30.3...v2.30.23
v2.30.3
Whats Changed
Use the feature flag firewall
to protect against malice websites and lazy loading smart mode chrome rendering.
- feat(firewall): add start of spider_firewall
- chore(smart): fix missing bytes transferred
- feature(smart): add lazy load chrome
- perf(bytes): remove BytesMut
Full Changelog: v2.27.66...v2.30.3
v2.27.66
What's Changed
- chore(cli): trigger help page on missing arguments by @pwnwriter in #265
- chore(chrome): add connection retry ws
- chore(smart): add initial http fallback
- chore(website): add direct proxy control
- chore(website): fix scrape hang [#268]
New Contributors
- @pwnwriter made their first contribution in #265
Full Changelog: v2.27.50...v2.27.66
v2.27.50
Whats Changed
Web page normalizing to prevent all duplicate content, crawl traps, and more pages from being crawled repeatedly.
We can now crawl websites that target ports outside 80 and 443.
- feat(page): add relative directory url handling
- chore(website): fix relative page merging links
- chore(serde): fix cron compile configuration
- chore(chrome): update [email protected]
- chore(page): add port validation links
- chore(website): fix signature compile non disk feature flag
- chore(rand): update [email protected]
- [chore(abs): clear query pairs [#257]
Full Changelog: v2.26.27...v2.27.50
v2.26.27
Whats Changed
- add auto find sitemap url on 404 or network error.
- fix chrome_cache_hybrid compile.
- add
cache_chrome_hybrid_mem
flag to use memory instead of disk. - fix q draining across website methods
- fix crawl depth handling
- fix worker init background connect
- add proper status code from errors
Full Changelog: v2.26.1...v2.26.27
v2.26.1
Whats Changed
This release brings performance improvements by skipping URL parsing per page.
You can now also pass in a second param to the page link methods to collect the links with a new domain target.
Targeting the correct root domain for parsing the links is now handled across features.
If you used page::Page::take_url
directly you may need to call page::Page::set_url_parsed_direct_empty()
first or the page::Page::get_url_parsed()
method.
- perf(cli): add page links direct return
- cli(scrape): now outputs full page links
Full Changelog: v2.24.15...v2.26.1
v2.24.15
Whats Changed
Add a callback to perform validation using spider::page::Page.
You can now use the basic
feature flag to easily disable io-uring on linux and still get the default features with "default-features = false"
.
- feat(website): add on_should_crawl_callback [#241]
- feat(page): add blocked_crawl [#242]
- chore(disk): fix cfg aho_corasick
- chore(fs): remove tentril crate
- chore(page): fix crawling initial redirects
- chore(chrome): fix compile fs flag
- feat(cargo): add basic feature flag
- chore(connect): fix compile missing libc
- feat(page): add page_error_status_details
- perf(page): remove parsing url directly
Full Changelog: v2.23.7...v2.24.15
v2.23.7
Whats Changed
Linux now uses io_uring for the DNS connect phase.
If you do not have a recent version of linux installed disable the feature flag io_uring
.
- feat(io_uring): add io_uring for connect_phase linux
- chore(fs): fix feature flag compile fs
Full Changelog: v2.22.19...v2.23.7
v2.22.19
Whats Changed
This release brings in a SQLite for improved memory handling with the feature flags disk_native_tls
, disk
, and disk_aws
.
SQLite is set to be used in a hybrid manner with memory in order to maintain performance.
With disk handling and our string interning urls crawled can entire the billions of resources or infinite with EFS attached.
Other Changes
- chore(website,page): fix concurrent initial scoped access to
lazy_static!
- chore(chrome): add more network block layers for chrome
- chore(chrome): remove smart mode default idle_dom usage
- perf(page): dynamic rewriter chunk size
- chore(website): add connect layer concurrency limit apply
- perf(runtime): add dedicated thread for request connect
Full Changelog: v2.21.33...v2.22.19
v2.21.33
Whats Changed
Fix http crawling past first page
Fix safe handling abs urls
Full Changelog: v2.21.27...v2.21.33