Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up Xapian searches by preloading indexes #617

Open
kelson42 opened this issue Aug 29, 2021 · 13 comments
Open

Speed-up Xapian searches by preloading indexes #617

kelson42 opened this issue Aug 29, 2021 · 13 comments

Comments

@kelson42
Copy link
Contributor

#418 has shown that the typical steps for a search are:

  1. Read the zim file (to be able to locate the xapian index in it) : Cold : 7.44s | Warm : 0.12s
  2. Open the xapian database (internal xapian code) : Cold : 0.09s | Warm : 0.003s
  3. Set the enquire on the database : Cold : 0.02s | Warm: 0.0004s
  4. Run the enquire and get a set of (ranged) results from the enquire (internal xapian code) : Cold : 3.74s | Warm : 1.5s

Here is when it happens:

  1. Once at file opening
  2. At first search requested and then cached
  3. At each search
  4. At each search

In a attempt to speed-up searches (in particular the first one) the idea would be to have the following workflow:

  1. Once at file opening
  2. Once at file opening (optional?) and then cached
  3. At file opening and then keep one (more?) ready to go all the time
  4. At each search

He would be the related questions on my side:

  • Can we secure that 2. does not slows down the opening of the file (so it should run in an other thread)?
  • Can we secure that 3. does not slows down the searches (so it should run in an other thread)?
  • I guess the whole search system is protected to avoid two search requests to happen at the same time. If this is secure in a multithreaded context. This will be responsible of massive slow downs in many search requests happen at the same time. Would that be possible/reasonable to have a pull of "searcher"?
@kelson42
Copy link
Contributor Author

@mgautierfr @maneeshpm What do you think? Is that a proper approach?

@kelson42 kelson42 changed the title Speed-up Xapian searches Speed-up Xapian searches by preloading Xapian indexes Aug 29, 2021
@kelson42 kelson42 changed the title Speed-up Xapian searches by preloading Xapian indexes Speed-up Xapian searches by preloading indexes Aug 29, 2021
@kelson42 kelson42 pinned this issue Sep 2, 2021
@kelson42
Copy link
Contributor Author

Any update here?

@kelson42 kelson42 added this to the 7.1.0 milestone Sep 27, 2021
@kelson42
Copy link
Contributor Author

kelson42 commented Jan 2, 2022

@mgautierfr @maneeshpm We have started the dev of 7.2.0. Do we agree on this approach?

@kelson42 kelson42 removed their assignment Jan 2, 2022
@kelson42 kelson42 modified the milestones: 8.2.0, 8.3.0 Mar 18, 2023
@Jaifroid
Copy link

Just to add to the documented issue here, Xapian-based search in the WASM version of libzim is basically unusable on Android, due to excessive I/O generated by libzim on startup. See openzim/javascript-libzim#42.

@kelson42 kelson42 modified the milestones: 9.0.0, 9.1.0 Sep 26, 2023
@kelson42 kelson42 modified the milestones: 9.1.0, 10.0.0 Nov 1, 2023
@kelson42 kelson42 unpinned this issue Dec 26, 2023
@kelson42 kelson42 modified the milestones: 10.0.0, 9.3.0 Sep 14, 2024
@mgautierfr
Copy link
Collaborator

mgautierfr commented Jan 16, 2025

When creating a Searcher, a number of initialization steps are done (it is technically done at first search using this searcher):

  • Locate Xapian database dirent in the archive
  • Locate the direct access information for this dirent
  • Instantiate a xapian database using the direct access (open file, seek at position, create xapian db)
  • Then, if it is the first database opened of the searcher:
    . get "valuesmap" (mapping from value name to value index)
    . get language from db
    . create stem for the db's language
    . get stop words from db
  • Add this zim xapian db to the Searcher db (for multizim search)

Note that we open/create a new xapian database per Searcher.
Libkiwix is keeping a cache of Searcher so it is not a problem.

However, in case of multizim search, we will create a new multizim searcher and so reopen a xapian db per zim searched.
So all the pre-opening of xapian db is kind of useless in multizim search.

Could be reuse the same xapiand db instead of reopen it ? Yes we could. However, xapian db is not thread safe.
So instead of locking the current Searcher (and prevent multi searches on the same searcher) we would have to lock ALL searchers using sharing/using the xapian db. It may be more complex and add more slow down than just recreate a db.

Proposition:

  • Introduce a Searcher interface.
  • Split Searcher into SingleSearcher and MultiSearcher, implementing the Searcher interface.
  • Open the xapian database and "parse" its values (valuesmap, language, stopwords) at zim file opening.
    This should be configurable (from "CacheConfig", so depends on Make cache configuration available from cpp api. #946)
    We must be able to open a zim file without any "preload" of data.
  • Keep the opened xapiand db (from previous step) open to allow it to be use in a futur SingleSearcher. After on SingleSearcher is created, the opened xapian db will be transferred to this single searcher and "lost". This is to avoid to SingleSearcher to be created with same xapian db leading to data race.
  • MultiSearcher will reopen a xapian database (to avoid cross locks) but we reuse the values already "parsed"
  • SuggestionSearcher will also preload xapian db (and its value) the same way we do for SingleSearcher.
    When multi zim suggestion searcher will be made (Multizim (suggestions) does not work at all #932), preload of suggestion db will be the same than for MultiSearcher.

Note that created SingleSearcher and MultiSearcher will not be cached by libzim. It is done on libkiwix side (as it is already done)

Also note that it seems that most of the speedup comes from the kernel cache (page cache,..) which avoid IO (both libzim searching dirent and xapian doing it stuff).
Pre-opening the xapian database will populate this cache. However, nothing guaranty that the data in the kernel cache is actually available when we do the search.
This can be easily the case when handling a library with a lot of zim files.

Preloading the data from the zim file will be done in the same/main thread. We assume here that if we are in a use case where user will not search (and know it), this can be deactivated with cache configuration.
Doing dirent lookup to search the xapian database will help populate the dirent cache and so will speed up next lookup (for regular usage). So not all time spend in populating this cache is lost.
Decision to move this in a background thread should be done with some numbers after this improvement is done.

Testing:

As for #946, testing cache is a bit complex.

  • Xapian search must work and give the same result, whatever is the cache strategy

@kelson42
Copy link
Contributor Author

kelson42 commented Jan 27, 2025

@mgautierfr Why the singlesearch is not considered as a special case of the multisearch? And the multisearch locking system made in a way that only needed Xapian searcher (opened xapiand db) instances are locked?

@mgautierfr
Copy link
Collaborator

Why the singlesearch is not considered as a special case of the multisearch?

For now, at least because their constructors don't have the same sementics:

  • SingleSearcher will take ownership of the database (and the database is removed from the "preload cache")
  • MultiSearcher will clone/reopen the database

In such we could only specialize the constructor but I forsee another specialization (in the way we parse db metadata or queries).
We could indeed wait a bit before splitting in two classes.

And the multisearch locking system made in a way that only needed Xapian searcher (opened xapiand db) instances are locked?

I'm not sure to understand what you are suggesting, but if you are proposing to not lock zim archive during the search, it is already the case.

The problem of the lock is not that we are locking all xapian db in the same time. But if we have multizim searcher with db (A, B, C) and another with (C, D, E, F, ...X).

If we have search of first multizim searcher, we have a lock on C.
And so second searcher cannot do a search because it also need a lock on C.

With two searcher it is kind of acceptable but with a lot of multizim searcher, it would be pretty complex to handle correctly. It is really easy to have a dead lock between 2 searchers.

Do not sharing db between searcher avoid the problem.

@kelson42
Copy link
Contributor Author

@mgautierfr Thank you, it seems clearer to me now.

I still don't understand what is the fundamental reason why single db searches should be handled differently as multi db searches. Blocking n DB has potentially the same impact as blocking only one, it really depends what the users do. If we have two users dealing with the same single ZIM, with current system they can achieve to impact each other if they launch single DB searches at the "same" time. The chance that this occurs is just statistically less than if people start multizim searches.

If we want to avoid the chance to have one user impacting an other one in "normal" condition, we should consider all the scenarios. And having a kiwix-hotspot with only one ZIM used by 20 users at the same time is not a rare one!

The challenge we have here seems to me similar to the one that a Web server has. A rare ressource, the HTTP or PHP engine for a Web server, and in our case the DB engine.

What about solving that problem like a Web server does:

  • A pool of Xapian DB handlers for each DB
  • Searches take/clone/release whatever is needed so they can work
  • Limits max/preloaded/... for the pools

@mgautierfr
Copy link
Collaborator

I still don't understand what is the fundamental reason why single db searches should be handled differently as multi db searches. Blocking n DB has potentially the same impact as blocking only one, it really depends what the users do. If we have two users dealing with the same single ZIM, with current system they can achieve to impact each other if they launch single DB searches at the "same" time. The chance that this occurs is just statistically less than if people start multizim searches.

If we want to avoid the chance to have one user impacting an other one in "normal" condition, we should consider all the scenarios. And having a kiwix-hotspot with only one ZIM used by 20 users at the same time is not a rare one!

The issue is not really about one user blocking other searches while it is searching (Even if it is (?) an issue to investigate).
The issue is more about dead locks.

We can have to multisearcher:

  • S with db A, B and C
  • R with db B and A

If we have two search in the same time:

  • S lock A => ok
  • R lock B => ok
  • S lock B => blocked
  • R lock A => blocked

Then you have a dead lock, two threads are locked.
No new search at all can be done on A nor B (and all search trying to do so will be blocked).
Your threadpool in kiwix-serve will lost N threads (and so potentially will not be able to respond any request at a moment)
Threads will lock resources, even if they are removed from cache.

The only way to solve that would be to restart the process.

Designing this correctly is not a easy task.

The challenge we have here seems to me similar to the one that a Web server has. A rare ressource, the HTTP or PHP engine for a Web server, and in our case the DB engine.

What about solving that problem like a Web server does:

  • A pool of Xapian DB handlers for each DB
  • Searches take/clone/release whatever is needed so they can work
  • Limits max/preloaded/... for the pools

We can do that. But:

  • The previous point is still valid
  • It would be a totally another project that what described in this issue.

@kelson42
Copy link
Contributor Author

kelson42 commented Jan 27, 2025

The previous point is still valid

I though "MultiSearcher will clone/reopen the database" (your words) is there exactly for the purpose of not been blocked (and blocked) other (SingleSearcher) searchs? So basically a method to solve the problem of concurrent accesses? I got that wrong?

For the dead lock problem, we should make that part atomic to avoid the problem you describe. Is that requirement problematic to implement?

@mgautierfr
Copy link
Collaborator

I though "MultiSearcher will clone/reopen the database" (your words) is there exactly for the purpose of not been blocked (and blocked) other (SingleSearcher) searchs? So basically a method to solve the problem of concurrent accesses? I got that wrong?

You got that right. Reopen the database avoid any sharing of the database and so not cross-lock at all.

For the dead lock problem, we should make that part atomic to avoid the problem you describe. Is that requirement problematic to implement?

We might yes. But I'm not confident with this. Not about the implementation in itself, but more with the potential conflict which can arise from the cache in libkiwix (no strong arguments to give here, just a feeling)


Saying that, I realize that all this preload can already done on libkiwix side by simply creating a searcher in a background thread and storing it in the cache there.

@kelson42
Copy link
Contributor Author

kelson42 commented Feb 3, 2025

@mgautierfr To conclude:

  • We have identified ways (new Xapian DB handler, proper atomic lock) to secure multizim searches work fine with deadlocks. To me there is a way forward
  • Future of search is multisearch, actually Kiwix Apple and Kiwix Desktop are ready and wait the libzim to complete the feature
  • Single search, in its current form, because of the only one DB handle is a bootlneck in case of high concurrent researches on the same ZIM file. In addition is only a special case of the multisearch. We should have only one search handling all of this.
  • The three points above are out of scope of this issue anyway, but relevant for Multizim (suggestions) does not work at all #932, which will probably require a revamp of the ft search anyway first.
  • Regarding this feature request, like the path lookup prefetch, this should and optionaly requested at the the opening ZIM time via an option (see my comment). Considering this can takes many seconds, this has to be done in a dedicated thread and non blocking manner for the the ZIM opening operation

@mgautierfr
Copy link
Collaborator

We have identified ways (new Xapian DB handler, proper atomic lock) to secure multizim searches work fine with deadlocks. To me there is a way forward

I will do proper atomic lock. But no new xapian db handler.
Only one xapian db will be opened per zim file.

Future of search is multisearch, actually Kiwix Apple and Kiwix Desktop are ready and wait the libzim to complete the feature

This is already done. Multi zim search is working. What is not working is multi suggestion and multi lang. But they will not be handled in this issue.

We should have only one search handling all of this.

I don't understand what you mean here.

The three points above are out of scope of this issue anyway, but relevant for #932, which will probably require a revamp of the ft search anyway first.

The plan here it to imlement multisuggestion the same way as multisearch, so it will not revamp the ft search.

Regarding this feature request, like the path lookup prefetch, this should and optionaly requested at the the opening ZIM time via an option (see #946 (comment)). Considering this can takes many seconds, this has to be done in a dedicated thread and non blocking manner for the the ZIM opening operation

Issue #946 is about making the cache configurable (and we agree on that).
Doing it in a dedicated thread is not part of the issue nor any project as far as I know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants