Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New scraper to fetch a website from web.archive.org #1243

Open
vitaly-zdanevich opened this issue Dec 21, 2024 · 2 comments
Open

New scraper to fetch a website from web.archive.org #1243

vitaly-zdanevich opened this issue Dec 21, 2024 · 2 comments
Labels
Scraper Needed We need to build a dedicated scraper for this website

Comments

@vitaly-zdanevich
Copy link

No description provided.

@benoit74
Copy link
Contributor

benoit74 commented Jan 6, 2025

This is not the purpose of Zimit, but it is definitely doable. Most probably a different scraper is needed. I will move this issue to zim-requests repo.

I don't know if Internet Archive is proposing to download a website as a WARC file, it would simplify the thing since we would just have to download the WARC and reuse warc2zim to transform the WARC into a ZIM.

Note that financial sponsoring or contributor will be needed to move this issue forward, Kiwix does not have the resources to implement this in the coming months.

Do you have website(s) in mind we should use to test the scraper?

@benoit74 benoit74 transferred this issue from openzim/zimit Jan 6, 2025
@benoit74 benoit74 added the Scraper Needed We need to build a dedicated scraper for this website label Jan 6, 2025
@benoit74 benoit74 changed the title Please provide a way to fetch a website from web.archive.org New scraper to fetch a website from web.archive.org Jan 6, 2025
@hamoudak
Copy link

hamoudak commented Jan 24, 2025

I've some websites that vanished from the web, they were once on the top of others, but they're in arabic, if you don't mind.
I have tried "Webrecorder ArchiveWeb.page" but failed, so may be you need to talk to Internet Archive about this someday.
for me I used "SingleFile" extension to save the most valuable topics as possible as I could. here is the websites;
https://ahlalhdeeth.com
https://web.archive.org/web/20140122061007/http://ahlalhdeeth.com/vb/index.php

https://www.ahlalloghah.com
https://web.archive.org/web/20111011184930/http://www.ahlalloghah.com/index.php

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scraper Needed We need to build a dedicated scraper for this website
Projects
None yet
Development

No branches or pull requests

3 participants