Speed-up Xapian searches by preloading indexes #617

kelson42 · 2021-08-29T08:01:02Z

#418 has shown that the typical steps for a search are:

Read the zim file (to be able to locate the xapian index in it) : Cold : 7.44s | Warm : 0.12s
Open the xapian database (internal xapian code) : Cold : 0.09s | Warm : 0.003s
Set the enquire on the database : Cold : 0.02s | Warm: 0.0004s
Run the enquire and get a set of (ranged) results from the enquire (internal xapian code) : Cold : 3.74s | Warm : 1.5s

Here is when it happens:

Once at file opening
At first search requested and then cached
At each search
At each search

In a attempt to speed-up searches (in particular the first one) the idea would be to have the following workflow:

Once at file opening
Once at file opening (optional?) and then cached
At file opening and then keep one (more?) ready to go all the time
At each search

He would be the related questions on my side:

Can we secure that 2. does not slows down the opening of the file (so it should run in an other thread)?
Can we secure that 3. does not slows down the searches (so it should run in an other thread)?
I guess the whole search system is protected to avoid two search requests to happen at the same time. If this is secure in a multithreaded context. This will be responsible of massive slow downs in many search requests happen at the same time. Would that be possible/reasonable to have a pull of "searcher"?

kelson42 · 2021-08-29T08:01:46Z

@mgautierfr @maneeshpm What do you think? Is that a proper approach?

kelson42 · 2021-09-27T05:34:56Z

Any update here?

kelson42 · 2022-01-02T22:12:52Z

@mgautierfr @maneeshpm We have started the dev of 7.2.0. Do we agree on this approach?

Jaifroid · 2023-07-30T05:55:59Z

Just to add to the documented issue here, Xapian-based search in the WASM version of libzim is basically unusable on Android, due to excessive I/O generated by libzim on startup. See openzim/javascript-libzim#42.

mgautierfr · 2025-01-16T10:53:03Z

When creating a Searcher, a number of initialization steps are done (it is technically done at first search using this searcher):

Locate Xapian database dirent in the archive
Locate the direct access information for this dirent
Instantiate a xapian database using the direct access (open file, seek at position, create xapian db)
Then, if it is the first database opened of the searcher:
. get "valuesmap" (mapping from value name to value index)
. get language from db
. create stem for the db's language
. get stop words from db
Add this zim xapian db to the Searcher db (for multizim search)

Note that we open/create a new xapian database per Searcher.
Libkiwix is keeping a cache of Searcher so it is not a problem.

However, in case of multizim search, we will create a new multizim searcher and so reopen a xapian db per zim searched.
So all the pre-opening of xapian db is kind of useless in multizim search.

Could be reuse the same xapiand db instead of reopen it ? Yes we could. However, xapian db is not thread safe.
So instead of locking the current Searcher (and prevent multi searches on the same searcher) we would have to lock ALL searchers using sharing/using the xapian db. It may be more complex and add more slow down than just recreate a db.

Proposition:

Introduce a Searcher interface.
Split Searcher into SingleSearcher and MultiSearcher, implementing the Searcher interface.
Open the xapian database and "parse" its values (valuesmap, language, stopwords) at zim file opening.
This should be configurable (from "CacheConfig", so depends on Make cache configuration available from cpp api. #946)
We must be able to open a zim file without any "preload" of data.
Keep the opened xapiand db (from previous step) open to allow it to be use in a futur SingleSearcher. After on SingleSearcher is created, the opened xapian db will be transferred to this single searcher and "lost". This is to avoid to SingleSearcher to be created with same xapian db leading to data race.
MultiSearcher will reopen a xapian database (to avoid cross locks) but we reuse the values already "parsed"
SuggestionSearcher will also preload xapian db (and its value) the same way we do for SingleSearcher.
When multi zim suggestion searcher will be made (Multizim (suggestions) does not work at all #932), preload of suggestion db will be the same than for MultiSearcher.

Note that created SingleSearcher and MultiSearcher will not be cached by libzim. It is done on libkiwix side (as it is already done)

Also note that it seems that most of the speedup comes from the kernel cache (page cache,..) which avoid IO (both libzim searching dirent and xapian doing it stuff).
Pre-opening the xapian database will populate this cache. However, nothing guaranty that the data in the kernel cache is actually available when we do the search.
This can be easily the case when handling a library with a lot of zim files.

Preloading the data from the zim file will be done in the same/main thread. We assume here that if we are in a use case where user will not search (and know it), this can be deactivated with cache configuration.
Doing dirent lookup to search the xapian database will help populate the dirent cache and so will speed up next lookup (for regular usage). So not all time spend in populating this cache is lost.
Decision to move this in a background thread should be done with some numbers after this improvement is done.

Testing:

As for #946, testing cache is a bit complex.

Xapian search must work and give the same result, whatever is the cache strategy

kelson42 · 2025-01-27T11:21:29Z

@mgautierfr Why the singlesearch is not considered as a special case of the multisearch? And the multisearch locking system made in a way that only needed Xapian searcher (opened xapiand db) instances are locked?

mgautierfr · 2025-01-27T13:53:12Z

Why the singlesearch is not considered as a special case of the multisearch?

For now, at least because their constructors don't have the same sementics:

SingleSearcher will take ownership of the database (and the database is removed from the "preload cache")
MultiSearcher will clone/reopen the database

In such we could only specialize the constructor but I forsee another specialization (in the way we parse db metadata or queries).
We could indeed wait a bit before splitting in two classes.

And the multisearch locking system made in a way that only needed Xapian searcher (opened xapiand db) instances are locked?

I'm not sure to understand what you are suggesting, but if you are proposing to not lock zim archive during the search, it is already the case.

The problem of the lock is not that we are locking all xapian db in the same time. But if we have multizim searcher with db (A, B, C) and another with (C, D, E, F, ...X).

If we have search of first multizim searcher, we have a lock on C.
And so second searcher cannot do a search because it also need a lock on C.

With two searcher it is kind of acceptable but with a lot of multizim searcher, it would be pretty complex to handle correctly. It is really easy to have a dead lock between 2 searchers.

Do not sharing db between searcher avoid the problem.

kelson42 · 2025-01-27T14:30:10Z

@mgautierfr Thank you, it seems clearer to me now.

I still don't understand what is the fundamental reason why single db searches should be handled differently as multi db searches. Blocking n DB has potentially the same impact as blocking only one, it really depends what the users do. If we have two users dealing with the same single ZIM, with current system they can achieve to impact each other if they launch single DB searches at the "same" time. The chance that this occurs is just statistically less than if people start multizim searches.

If we want to avoid the chance to have one user impacting an other one in "normal" condition, we should consider all the scenarios. And having a kiwix-hotspot with only one ZIM used by 20 users at the same time is not a rare one!

The challenge we have here seems to me similar to the one that a Web server has. A rare ressource, the HTTP or PHP engine for a Web server, and in our case the DB engine.

What about solving that problem like a Web server does:

A pool of Xapian DB handlers for each DB
Searches take/clone/release whatever is needed so they can work
Limits max/preloaded/... for the pools

mgautierfr · 2025-01-27T16:36:24Z

I still don't understand what is the fundamental reason why single db searches should be handled differently as multi db searches. Blocking n DB has potentially the same impact as blocking only one, it really depends what the users do. If we have two users dealing with the same single ZIM, with current system they can achieve to impact each other if they launch single DB searches at the "same" time. The chance that this occurs is just statistically less than if people start multizim searches.

If we want to avoid the chance to have one user impacting an other one in "normal" condition, we should consider all the scenarios. And having a kiwix-hotspot with only one ZIM used by 20 users at the same time is not a rare one!

The issue is not really about one user blocking other searches while it is searching (Even if it is (?) an issue to investigate).
The issue is more about dead locks.

We can have to multisearcher:

S with db A, B and C
R with db B and A

If we have two search in the same time:

S lock A => ok
R lock B => ok
S lock B => blocked
R lock A => blocked

Then you have a dead lock, two threads are locked.
No new search at all can be done on A nor B (and all search trying to do so will be blocked).
Your threadpool in kiwix-serve will lost N threads (and so potentially will not be able to respond any request at a moment)
Threads will lock resources, even if they are removed from cache.

The only way to solve that would be to restart the process.

Designing this correctly is not a easy task.

The challenge we have here seems to me similar to the one that a Web server has. A rare ressource, the HTTP or PHP engine for a Web server, and in our case the DB engine.

What about solving that problem like a Web server does:

A pool of Xapian DB handlers for each DB

Searches take/clone/release whatever is needed so they can work

Limits max/preloaded/... for the pools

We can do that. But:

The previous point is still valid
It would be a totally another project that what described in this issue.

kelson42 · 2025-01-27T18:22:40Z

The previous point is still valid

I though "MultiSearcher will clone/reopen the database" (your words) is there exactly for the purpose of not been blocked (and blocked) other (SingleSearcher) searchs? So basically a method to solve the problem of concurrent accesses? I got that wrong?

For the dead lock problem, we should make that part atomic to avoid the problem you describe. Is that requirement problematic to implement?

mgautierfr · 2025-01-28T11:17:35Z

I though "MultiSearcher will clone/reopen the database" (your words) is there exactly for the purpose of not been blocked (and blocked) other (SingleSearcher) searchs? So basically a method to solve the problem of concurrent accesses? I got that wrong?

You got that right. Reopen the database avoid any sharing of the database and so not cross-lock at all.

For the dead lock problem, we should make that part atomic to avoid the problem you describe. Is that requirement problematic to implement?

We might yes. But I'm not confident with this. Not about the implementation in itself, but more with the potential conflict which can arise from the cache in libkiwix (no strong arguments to give here, just a feeling)

Saying that, I realize that all this preload can already done on libkiwix side by simply creating a searcher in a background thread and storing it in the cache there.

kelson42 · 2025-02-03T06:45:21Z

@mgautierfr To conclude:

We have identified ways (new Xapian DB handler, proper atomic lock) to secure multizim searches work fine with deadlocks. To me there is a way forward
Future of search is multisearch, actually Kiwix Apple and Kiwix Desktop are ready and wait the libzim to complete the feature
Single search, in its current form, because of the only one DB handle is a bootlneck in case of high concurrent researches on the same ZIM file. In addition is only a special case of the multisearch. We should have only one search handling all of this.
The three points above are out of scope of this issue anyway, but relevant for Multizim (suggestions) does not work at all #932, which will probably require a revamp of the ft search anyway first.
Regarding this feature request, like the path lookup prefetch, this should and optionaly requested at the the opening ZIM time via an option (see my comment). Considering this can takes many seconds, this has to be done in a dedicated thread and non blocking manner for the the ZIM opening operation

mgautierfr · 2025-02-04T09:23:18Z

We have identified ways (new Xapian DB handler, proper atomic lock) to secure multizim searches work fine with deadlocks. To me there is a way forward

I will do proper atomic lock. But no new xapian db handler.
Only one xapian db will be opened per zim file.

Future of search is multisearch, actually Kiwix Apple and Kiwix Desktop are ready and wait the libzim to complete the feature

This is already done. Multi zim search is working. What is not working is multi suggestion and multi lang. But they will not be handled in this issue.

We should have only one search handling all of this.

I don't understand what you mean here.

The three points above are out of scope of this issue anyway, but relevant for #932, which will probably require a revamp of the ft search anyway first.

The plan here it to imlement multisuggestion the same way as multisearch, so it will not revamp the ft search.

Regarding this feature request, like the path lookup prefetch, this should and optionaly requested at the the opening ZIM time via an option (see #946 (comment)). Considering this can takes many seconds, this has to be done in a dedicated thread and non blocking manner for the the ZIM opening operation

Issue #946 is about making the cache configurable (and we agree on that).
Doing it in a dedicated thread is not part of the issue nor any project as far as I know.

kelson42 added question search labels Aug 29, 2021

kelson42 assigned mgautierfr, kelson42 and maneeshpm Aug 29, 2021

kelson42 mentioned this issue Aug 29, 2021

Search suggestion performance analysis #418

Closed

kelson42 changed the title ~~Speed-up Xapian searches~~ Speed-up Xapian searches by preloading Xapian indexes Aug 29, 2021

kelson42 changed the title ~~Speed-up Xapian searches by preloading Xapian indexes~~ Speed-up Xapian searches by preloading indexes Aug 29, 2021

kelson42 pinned this issue Sep 2, 2021

kelson42 added this to the 7.1.0 milestone Sep 27, 2021

kelson42 removed their assignment Jan 2, 2022

kelson42 added the enhancement label Jan 2, 2022

kelson42 unassigned maneeshpm Mar 11, 2023

kelson42 modified the milestones: 8.2.0, 8.3.0 Mar 18, 2023

kelson42 modified the milestones: 9.0.0, 9.1.0 Sep 26, 2023

kelson42 modified the milestones: 9.1.0, 10.0.0 Nov 1, 2023

kelson42 unpinned this issue Dec 26, 2023

kelson42 modified the milestones: 10.0.0, 9.3.0 Sep 14, 2024

kelson42 unassigned mgautierfr Sep 14, 2024

This was referenced Jan 20, 2025

About cache limited by memory #947

Open

Multizim (suggestions) does not work at all #932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up Xapian searches by preloading indexes #617

Speed-up Xapian searches by preloading indexes #617

kelson42 commented Aug 29, 2021

kelson42 commented Aug 29, 2021

kelson42 commented Sep 27, 2021

kelson42 commented Jan 2, 2022

Jaifroid commented Jul 30, 2023

mgautierfr commented Jan 16, 2025 •

edited

Loading

kelson42 commented Jan 27, 2025 •

edited

Loading

mgautierfr commented Jan 27, 2025

kelson42 commented Jan 27, 2025

mgautierfr commented Jan 27, 2025

kelson42 commented Jan 27, 2025 •

edited

Loading

mgautierfr commented Jan 28, 2025

kelson42 commented Feb 3, 2025 •

edited

Loading

mgautierfr commented Feb 4, 2025

Speed-up Xapian searches by preloading indexes #617

Speed-up Xapian searches by preloading indexes #617

Comments

kelson42 commented Aug 29, 2021

kelson42 commented Aug 29, 2021

kelson42 commented Sep 27, 2021

kelson42 commented Jan 2, 2022

Jaifroid commented Jul 30, 2023

mgautierfr commented Jan 16, 2025 • edited Loading

Proposition:

Testing:

kelson42 commented Jan 27, 2025 • edited Loading

mgautierfr commented Jan 27, 2025

kelson42 commented Jan 27, 2025

mgautierfr commented Jan 27, 2025

kelson42 commented Jan 27, 2025 • edited Loading

mgautierfr commented Jan 28, 2025

kelson42 commented Feb 3, 2025 • edited Loading

mgautierfr commented Feb 4, 2025

mgautierfr commented Jan 16, 2025 •

edited

Loading

kelson42 commented Jan 27, 2025 •

edited

Loading

kelson42 commented Jan 27, 2025 •

edited

Loading

kelson42 commented Feb 3, 2025 •

edited

Loading