-
Notifications
You must be signed in to change notification settings - Fork 775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make index database "optimization" explicit operation #3983
Conversation
After running the tests without the segment count reduction, Iterator<T> iterator = SomeType.iterator();
while (iterator.hasNext()) {
// Moving cursor to next element
T data = iterator.next();
... // process data
} or something like: Iterator<T> iterator = SomeType.iterator();
T data;
while ((data = iterator.next()) != null) {
... // process data
} however in this case the pattern was: opengrok/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java Lines 1791 to 1808 in 9bbba82
For some reason, when an index has multiple segments, the first call to Now, opengrok/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java Lines 634 to 635 in 9bbba82
After the |
Looking at the |
Also, I noticed that the |
it does not seem to be used
Also, actually use the dirty flag to emit a warning message.
This change removes the automatic index "optimization" (a.k.a. reduction of segment count for given index to 1) at the end of the indexing. The outcome is reduced indexing time - indexing linux kernel repo with history off takes 06:49 minutes, out of which 50 seconds is the "optimization" step; with the step off, the indexing time is 6:03 minutes, which means 12 % improvement. On the other hand, it might theoretically slow down the search a bit and also the index size might increase - in multi segment index, each segment typically contains deleted documents. These are expunged on segment merge which happens during indexing. Comparing the size of the index of the linux kernel repository without and with the changes, there is no significant difference - in both cases the index directory had 1.5 GB (the real difference was some 45 MB) and there were 16 segments at the end of the indexing in the changed case. The number of the segments is not the same for each reindex of the same data due to the parallel nature of the indexing.
While this might seem as a overturn to the less optimal handling, it just normalizes the situation - Lucene documents consider the reduction to single segment to be extreme and useful only for long term index archival. The other alternative would be to reduce the segments to some other number however that would be highly dependent on given index. To address the search latency, I plan to submit a PR that will create
IndexSearcher
objects with a dedicated executors so that index segments can be searched in parallel.I chose a bit of a more aggressive approach - instead of setting the default to false, this is now explicit operation invoked via the
--reduceSegmentCount
indexer option. Still softer approach than ripping this functionality out altogether. The limitation of the programming of this option's handling is that when used together with index database update, it reduces the segment count for all index databases, not just those that changed. It can be run as a standalone operation when combined with the-n
/--noIndex
option.That said, I think the
reduceSegmentCount
functionality should be eventually replaced by tunables to control the Lucene segment merge policy type and its parameters.