Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search suggestion performance analysis #418

Closed
mgautierfr opened this issue Sep 2, 2020 · 10 comments
Closed

Search suggestion performance analysis #418

mgautierfr opened this issue Sep 2, 2020 · 10 comments
Assignees

Comments

@mgautierfr
Copy link
Collaborator

mgautierfr commented Sep 2, 2020

Following the issue kiwix/kiwix-android#2082 I've made some test searching suggestion on a low device.

I'm testing on a RaspberryPi 3b, the zim file is wikipedia_en_all_maxi_2020-08.zim stored on a external usb disk.
I'm using kiwix-search tool to search over the zim file (kiwix-search <zimfile> -s <query>), recompiled with some timing trace. It should be pretty equivalent of what is made on kiwix-android side where the thread, to avoid race condition, is creating a new reader and start on search on it.

I also tried on a smaller zim file on sdcard. I've somehow got the same results (numbers are different but ratio is the same).

Big numbers

On a "cold" search (kernel's page cache cleared using echo 1 > /proc/sys/vm/drop_caches) for f takes 12 seconds.
However, a "warm" search (rerun the same command) takes less than 2 seconds.

All the "lost time" is spend on io :
trace_cold
trace_warm)

Small numbers

Trying to better understand the problem, we can look for different parts. A "full" search is composed of :

  • Read the zim file (to be able to locate the xapian index in it) : Cold : 7.44s | Warm : 0.12s
  • Open the xapian database (internal xapian code) : Cold : 0.09s | Warm : 0.003s
  • Set the enquire on the database : Cold : 0.02s | Warm: 0.0004s
  • Run the enquire and get a set of (ranged) results from the enquire (internal xapian code) : Cold : 3.74s | Warm : 1.5s
  • Display/use the results. Cold: 0.001s | Warm: 0.001s

Such precision is disputable but it indicates well where we spend time.

What can we do ?

On the real performance side, I think there is not a lot we can do.
Most of the real time is spend in xapian code. And even if this part is improved it will not help a lot for the first search.
If we don't have to file quickly available, we will have to wait. No choice.
We must be prepared for long search (Ensure the UI is not blocked by long search. Display useful things to users while the search is ongoing, ...)

On a classical usage, the zim file should be already opened when we start a search. So the reading of the zim file should be quick. So a cold search is more about 5s that 12s.

We may try to mitigate the user feeling by try to "pre-cache" thing when possible before the user do a search.

  • Opening the zim file. Nothing to do here. We will not open all zim file behind the user back to be prepared. But when we start the search, the location of xapian index and such should be quick as the data is already cached.
  • Opening the database and pre-setup the enquire. Here we can improve things. We can assume that when a user open a zim, he will search in it and directly open the database and setup an enquire. It would allow use to win 1s.
  • Getting the results. We cannot do a much here neither. This is were the real code is done and we cannot do it before the user do the search.

Get less results ?

The time to get the results from the enquire is related to the number of result we retrieve.
However this is not linear. Retrieving twice less results doesn't reduce the time by two.
Running the request and retrieve no results takes 1s (warm or cold). And it doesn't help the retrieving of other results.

Async ?

Having a async api would not really help.
It would be difficult to have intermediate steps. The whole results would be usable only when the search is finished. We can simply run the search in a thread and update display when the search is finished.


Questions ?
Ideas ?
Suggestions ?

@mgautierfr mgautierfr assigned mgautierfr and kelson42 and unassigned mgautierfr and kelson42 Sep 2, 2020
@mgautierfr
Copy link
Collaborator Author

Ping @macgills (I cannot assign the issue to you :/)

@dbedrenko
Copy link

Thanks for the detailed analysis. I am wondering why the search was instantaneous in kiwix-android v2.5 (and whatever libzim accompanied it) with the 2018-10 en_all_maxi?

Why was the old version so much faster? What was changed in the software to cause this?

@macgills
Copy link

macgills commented Sep 3, 2020

On the android side I think the coroutines implementation along with an actual UI state for "search in progress" will go a long way. The API has also been updated to be thread safe right so I can avoid the creation of a new reader per search?
There may also be the upside of "multizim tabs" will mean there will be an LRU cache of zim readers which may avoid this being an issue too commonly?

@mgautierfr
Copy link
Collaborator Author

Why was the old version so much faster? What was changed in the software to cause this?

New libzim use xapian database to search for suggestions. In old version we where simply searching for article's titles starting by the query.

The API has also been updated to be thread safe right so I can avoid the creation of a new reader per search?

Not yet. However, the creation of a new reader should be quick as you already have one and so the page cache is already populated.

There may also be the upside of "multizim tabs" will mean there will be an LRU cache of zim readers which may avoid this being an issue too commonly?

Not sure to understand this point.
The page cache is handled by linux itself. We do nothing specific on our side. If we open several zim readers, previous file data may be keep in cache or not depending of linux decision (enough ram, other application,...). A lru cache will not help here.

@macgills
Copy link

macgills commented Sep 4, 2020

This might be my lack of knowledge on the internals of kiwixlib but talking from the client perspective the app only keeps 1 Reader in memory at a time, when we open a new 1 we call dispose on the old one. Some exceptions are searching (we create a new reader for the duration of the search) and when scanning the file system for zims (each file with a zim extension creates a reader temporarily to parse the information to create a record in a DB). I was talking about a new app feature that is on the cards to allow multiple open Readers but it may be irrelevant to the underlying data structures in the native code. The LRU cache was an idea to avoid excessive creation of readers and have X Readers readily available.

@kelson42
Copy link
Contributor

@mgautierfr Would you be able please to transform this ticket in actionable tickets?

@kelson42
Copy link
Contributor

@veloman-yunkan @mgautierfr @macgills Kiwix Android 3.4.1 has brough significant improvements in term of suggestion speed. But first reports seem to indicate that this might still be too slow... and we still have serious performance problem with kiwix-serve. If the first feedbacks from 3.4.1, this ticket and its following actionable will come on the really top of the TODO list at libzim level.

@kelson42 kelson42 changed the title Search suggestion performance analysis. Search suggestion performance analysis Sep 30, 2020
@kelson42 kelson42 assigned maneeshpm and unassigned kelson42 and veloman-yunkan Apr 7, 2021
@kelson42
Copy link
Contributor

  • Opening the database and pre-setup the enquire. Here we can improve things. We can assume that when a user open a zim, he will search in it and directly open the database and setup an enquire. It would allow use to win 1s.

@mgautierfr @maneeshpm I’m in favour of closing this ticket and opening a new one requesting a pre-setup of the enquires for both the title and the ft indexes. Good for you?

@maneeshpm
Copy link
Contributor

@kelson42 I agree. We need to break this into actionable tickets and presetup of enquire can the first of these.

@kelson42
Copy link
Contributor

@mgautierfr @maneeshpm I have open #617 to propose a pre-loading of the Xapian indexes/enquires. Closing that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants