Add compatibility so that both URL path types are supported (absolute and relative) #105

suppadeliux · 2021-03-15T10:42:45Z

Hello,

I am currently working on a documentation website generated with Jekyll, and Meilisearch has been pretty easy to add, with the docs-scraper and the docs-searchbar.

In fact, I have many instances of my documentation website hosted in different places

That means that I have to run the docs-scraper for each site (update of repository).

I wish I could run only one scraper for all my sites, and be independent of where each documentation site is hosted. so my question is:

Is it possible to replace absolute with relative URLs in docs-scraper?.

I guess, I can do that somehow overriding some of the logic from the source code of the scraper. But is there another way? (maybe someone else has already thought/discussed about that).

Thanks in advance!!

curquiza · 2021-03-16T14:52:02Z

Hello @suppadeliux!

Does the start_urls option work for your usecase?
https://github.com/meilisearch/docs-scraper#start_urls
You should be able to define the absolute URL in the array like:

{
  "start_urls": ["https://www.mysite1.com", "https://www.mysite2.com"]
}

Sorry if I miss understood your issue.

PS: I transfer your issue into the docs-scraper repo

curquiza · 2021-03-16T14:54:00Z

For more documentation your can check out the README of this repo (docs-scraper)

suppadeliux · 2021-03-16T15:15:29Z

Hello @suppadeliux!

Does the start_urls option work for your usecase?
https://github.com/meilisearch/docs-scraper#start_urls
You should be able to define the absolute URL in the array like:
{
  "start_urls": ["https://www.mysite1.com", "https://www.mysite2.com"]
}
Sorry if I miss understood your issue.

PS: I transfer your issue into the docs-scraper repo

Hello @curquiza , and thanks for taking the time to answer to my question.

I have many docs-scraper.config files, each one containing the url for each documention website. Each time I run the scraper, I run it for each site.

What I wish I could do is only run the scraper once, and having relative URLS on my index, instead of absolute (like in the meilisearch doc site).

So When I will have the response from my meilisearch instance, I will only have relative paths (e.g. /getting-started/introduction or /about-us) to redirect the user to the the right result just using the relative urls. This way, each of my documentation website, doesn't contain the raw URL from another site in the search API response.

I hope it clears it up a little bit 👍

curquiza · 2021-03-16T15:55:01Z

I understand now. Unfortunately, and if I'm not wrong, there is no way to change the url field...
Once your documents are added to MeiliSearch, what you can do is to update all the url fields in your documents:

you get all of them (browsing them using offset) with this route: https://docs.meilisearch.com/reference/api/documents.html#get-documents
you update them with this route: https://docs.meilisearch.com/reference/api/documents.html#add-or-update-documents

We have many clients depending on your favorite language here to update your documents: https://github.com/meilisearch/integration-guides#-sdks-for-meilisearch-api

bidoubiwa · 2021-09-29T08:53:23Z

We need a PR that add a compatibility with both path techniques. Thanks for raising this 🔥 Feel free to implement it otherwise we wait for a contributor to do so

huguesalary · 2021-10-07T22:01:14Z

@suppadeliux I had this issue as well on my end and ended up writing a really hacky patch that simply makes all URLs relatives.

This works for my very narrow use-case and will very likely break for yours, but, in the off chance this patch can help you, here it is:

diff --git a/scraper/src/documentation_spider.py b/scraper/src/documentation_spider.py
index 88bd125..704b13d 100644
--- a/scraper/src/documentation_spider.py
+++ b/scraper/src/documentation_spider.py
@@ -13,6 +13,8 @@ import os

 # End of import for the sitemap behavior

+from urllib.parse import urlparse
+
 from scrapy.spidermiddlewares.httperror import HttpError

 from scrapy.exceptions import CloseSpider
@@ -148,6 +150,11 @@ class DocumentationSpider(CrawlSpider, SitemapSpider):
         return super()._parse(response, **kwargs)

     def add_records(self, response, from_sitemap):
+
+        parsedURL = urlparse(response.url)
+        response = response.replace(url=parsedURL._replace(scheme="",netloc=None).geturl())
+        print("Changed {} to relative URL {}".format(parsedURL.geturl(), response.url))
+
         records = self.strategy.get_records_from_response(response)
         self.meilisearch_helper.add_records(records, response.url, from_sitemap)

bidoubiwa · 2021-10-11T10:12:04Z

Does it still work with absolute paths? It could be an acceptable solution. If you don't have time trying it out no problem :)

huguesalary · 2021-10-15T23:00:54Z

@bidoubiwa This code always changes the URL from absolute to relative. It would need to be adapted to provide the option to enable/disable this feature.

Unfortunately, the docs-scraper codebase is a little hard to follow and my python skills are lacking, so I can't really provide a better solution.

alallema · 2023-09-06T11:03:08Z

As this repo is now low-maintenance, this PR is no longer relevant today. I'm closing all issues that are not bugs.

curquiza transferred this issue from meilisearch/meilisearch Mar 16, 2021

curquiza added the question Further information is requested label Mar 16, 2021

bidoubiwa added the hacktoberfest label Sep 29, 2021

meili-bot removed the hacktoberfest label Nov 4, 2021

alallema changed the title ~~How to replace absolute with relative URLs in docs-scraper?~~ Add compatibility so that both URL path types are supported (absolute and relative) Sep 27, 2022

alallema added enhancement New feature or request hacktoberfest labels Sep 27, 2022

curquiza removed the hacktoberfest label Nov 15, 2022

alallema closed this as completed Sep 6, 2023

sengoku-f mentioned this issue Mar 17, 2025

url supports relative paths #477

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compatibility so that both URL path types are supported (absolute and relative) #105

Add compatibility so that both URL path types are supported (absolute and relative) #105

suppadeliux commented Mar 15, 2021

curquiza commented Mar 16, 2021 •

edited

Loading

curquiza commented Mar 16, 2021

suppadeliux commented Mar 16, 2021 •

edited

Loading

curquiza commented Mar 16, 2021 •

edited

Loading

bidoubiwa commented Sep 29, 2021

huguesalary commented Oct 7, 2021 •

edited

Loading

bidoubiwa commented Oct 11, 2021

huguesalary commented Oct 15, 2021

alallema commented Sep 6, 2023

Add compatibility so that both URL path types are supported (absolute and relative) #105

Add compatibility so that both URL path types are supported (absolute and relative) #105

Comments

suppadeliux commented Mar 15, 2021

curquiza commented Mar 16, 2021 • edited Loading

curquiza commented Mar 16, 2021

suppadeliux commented Mar 16, 2021 • edited Loading

curquiza commented Mar 16, 2021 • edited Loading

bidoubiwa commented Sep 29, 2021

huguesalary commented Oct 7, 2021 • edited Loading

bidoubiwa commented Oct 11, 2021

huguesalary commented Oct 15, 2021

alallema commented Sep 6, 2023

curquiza commented Mar 16, 2021 •

edited

Loading

suppadeliux commented Mar 16, 2021 •

edited

Loading

curquiza commented Mar 16, 2021 •

edited

Loading

huguesalary commented Oct 7, 2021 •

edited

Loading