Skip to content

Add compatibility so that both URL path types are supported (absolute and relative) #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
suppadeliux opened this issue Mar 15, 2021 · 9 comments · May be fixed by #477
Closed

Add compatibility so that both URL path types are supported (absolute and relative) #105

suppadeliux opened this issue Mar 15, 2021 · 9 comments · May be fixed by #477
Labels
enhancement New feature or request question Further information is requested

Comments

@suppadeliux
Copy link

Hello,

I am currently working on a documentation website generated with Jekyll, and Meilisearch has been pretty easy to add, with the docs-scraper and the docs-searchbar.

In fact, I have many instances of my documentation website hosted in different places

That means that I have to run the docs-scraper for each site (update of repository).

I wish I could run only one scraper for all my sites, and be independent of where each documentation site is hosted. so my question is:

  • Is it possible to replace absolute with relative URLs in docs-scraper?.

I guess, I can do that somehow overriding some of the logic from the source code of the scraper. But is there another way? (maybe someone else has already thought/discussed about that).

Thanks in advance!!

@curquiza
Copy link
Member

curquiza commented Mar 16, 2021

Hello @suppadeliux!

Does the start_urls option work for your usecase?
https://github.com/meilisearch/docs-scraper#start_urls
You should be able to define the absolute URL in the array like:

{
  "start_urls": ["https://www.mysite1.com", "https://www.mysite2.com"]
}

Sorry if I miss understood your issue.

PS: I transfer your issue into the docs-scraper repo

@curquiza curquiza transferred this issue from meilisearch/meilisearch Mar 16, 2021
@curquiza
Copy link
Member

For more documentation your can check out the README of this repo (docs-scraper)

@curquiza curquiza added the question Further information is requested label Mar 16, 2021
@suppadeliux
Copy link
Author

suppadeliux commented Mar 16, 2021

Hello @suppadeliux!

Does the start_urls option work for your usecase?
https://github.com/meilisearch/docs-scraper#start_urls
You should be able to define the absolute URL in the array like:

{
  "start_urls": ["https://www.mysite1.com", "https://www.mysite2.com"]
}

Sorry if I miss understood your issue.

PS: I transfer your issue into the docs-scraper repo

Hello @curquiza , and thanks for taking the time to answer to my question.

I have many docs-scraper.config files, each one containing the url for each documention website. Each time I run the scraper, I run it for each site.

What I wish I could do is only run the scraper once, and having relative URLS on my index, instead of absolute (like in the meilisearch doc site).

image

So When I will have the response from my meilisearch instance, I will only have relative paths (e.g. /getting-started/introduction or /about-us) to redirect the user to the the right result just using the relative urls. This way, each of my documentation website, doesn't contain the raw URL from another site in the search API response.

I hope it clears it up a little bit 👍

@curquiza
Copy link
Member

curquiza commented Mar 16, 2021

I understand now. Unfortunately, and if I'm not wrong, there is no way to change the url field...
Once your documents are added to MeiliSearch, what you can do is to update all the url fields in your documents:

We have many clients depending on your favorite language here to update your documents: https://github.com/meilisearch/integration-guides#-sdks-for-meilisearch-api

@bidoubiwa
Copy link
Contributor

We need a PR that add a compatibility with both path techniques. Thanks for raising this 🔥 Feel free to implement it otherwise we wait for a contributor to do so

@huguesalary
Copy link

huguesalary commented Oct 7, 2021

@suppadeliux I had this issue as well on my end and ended up writing a really hacky patch that simply makes all URLs relatives.

This works for my very narrow use-case and will very likely break for yours, but, in the off chance this patch can help you, here it is:

diff --git a/scraper/src/documentation_spider.py b/scraper/src/documentation_spider.py
index 88bd125..704b13d 100644
--- a/scraper/src/documentation_spider.py
+++ b/scraper/src/documentation_spider.py
@@ -13,6 +13,8 @@ import os

 # End of import for the sitemap behavior

+from urllib.parse import urlparse
+
 from scrapy.spidermiddlewares.httperror import HttpError

 from scrapy.exceptions import CloseSpider
@@ -148,6 +150,11 @@ class DocumentationSpider(CrawlSpider, SitemapSpider):
         return super()._parse(response, **kwargs)

     def add_records(self, response, from_sitemap):
+
+        parsedURL = urlparse(response.url)
+        response = response.replace(url=parsedURL._replace(scheme="",netloc=None).geturl())
+        print("Changed {} to relative URL {}".format(parsedURL.geturl(), response.url))
+
         records = self.strategy.get_records_from_response(response)
         self.meilisearch_helper.add_records(records, response.url, from_sitemap)

@bidoubiwa
Copy link
Contributor

Does it still work with absolute paths? It could be an acceptable solution. If you don't have time trying it out no problem :)

@huguesalary
Copy link

@bidoubiwa This code always changes the URL from absolute to relative. It would need to be adapted to provide the option to enable/disable this feature.

Unfortunately, the docs-scraper codebase is a little hard to follow and my python skills are lacking, so I can't really provide a better solution.

@alallema alallema changed the title How to replace absolute with relative URLs in docs-scraper? Add compatibility so that both URL path types are supported (absolute and relative) Sep 27, 2022
@alallema alallema added enhancement New feature or request hacktoberfest labels Sep 27, 2022
@alallema
Copy link
Contributor

alallema commented Sep 6, 2023

As this repo is now low-maintenance, this PR is no longer relevant today. I'm closing all issues that are not bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants