-
Notifications
You must be signed in to change notification settings - Fork 52
Add compatibility so that both URL path types are supported (absolute and relative) #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @suppadeliux! Does the {
"start_urls": ["https://www.mysite1.com", "https://www.mysite2.com"]
} Sorry if I miss understood your issue. PS: I transfer your issue into the docs-scraper repo |
For more documentation your can check out the README of this repo (docs-scraper) |
Hello @curquiza , and thanks for taking the time to answer to my question. I have many What I wish I could do is only run the scraper once, and having relative URLS on my index, instead of absolute (like in the meilisearch doc site). So When I will have the response from my meilisearch instance, I will only have relative paths (e.g. I hope it clears it up a little bit 👍 |
I understand now. Unfortunately, and if I'm not wrong, there is no way to change the
We have many clients depending on your favorite language here to update your documents: https://github.com/meilisearch/integration-guides#-sdks-for-meilisearch-api |
We need a PR that add a compatibility with both path techniques. Thanks for raising this 🔥 Feel free to implement it otherwise we wait for a contributor to do so |
@suppadeliux I had this issue as well on my end and ended up writing a really hacky patch that simply makes all URLs relatives. This works for my very narrow use-case and will very likely break for yours, but, in the off chance this patch can help you, here it is: diff --git a/scraper/src/documentation_spider.py b/scraper/src/documentation_spider.py
index 88bd125..704b13d 100644
--- a/scraper/src/documentation_spider.py
+++ b/scraper/src/documentation_spider.py
@@ -13,6 +13,8 @@ import os
# End of import for the sitemap behavior
+from urllib.parse import urlparse
+
from scrapy.spidermiddlewares.httperror import HttpError
from scrapy.exceptions import CloseSpider
@@ -148,6 +150,11 @@ class DocumentationSpider(CrawlSpider, SitemapSpider):
return super()._parse(response, **kwargs)
def add_records(self, response, from_sitemap):
+
+ parsedURL = urlparse(response.url)
+ response = response.replace(url=parsedURL._replace(scheme="",netloc=None).geturl())
+ print("Changed {} to relative URL {}".format(parsedURL.geturl(), response.url))
+
records = self.strategy.get_records_from_response(response)
self.meilisearch_helper.add_records(records, response.url, from_sitemap) |
Does it still work with absolute paths? It could be an acceptable solution. If you don't have time trying it out no problem :) |
@bidoubiwa This code always changes the URL from absolute to relative. It would need to be adapted to provide the option to enable/disable this feature. Unfortunately, the docs-scraper codebase is a little hard to follow and my python skills are lacking, so I can't really provide a better solution. |
As this repo is now low-maintenance, this PR is no longer relevant today. I'm closing all issues that are not bugs. |
Hello,
I am currently working on a documentation website generated with Jekyll, and Meilisearch has been pretty easy to add, with the docs-scraper and the docs-searchbar.
In fact, I have many instances of my documentation website hosted in different places
That means that I have to run the docs-scraper for each site (update of repository).
I wish I could run only one scraper for all my sites, and be independent of where each documentation site is hosted. so my question is:
I guess, I can do that somehow overriding some of the logic from the source code of the scraper. But is there another way? (maybe someone else has already thought/discussed about that).
Thanks in advance!!
The text was updated successfully, but these errors were encountered: