-
Notifications
You must be signed in to change notification settings - Fork 35
OpenNTF Hyper Search
Note: this architecture was inspired by [DigestSearch](http://www.ibm.com/developerworks/lotus/library/digestsearch/) by Andrei Kouvchinnikov
Currently in the nathan branch of the OpenNTF API is a new set of classes as part of the org.openntf.domino.big.impl package. (The .big package is intended for dealing with data sets across multiple NSFs) These classes are IndexDatabase, IndexQuery, IndexHit and IndexResults. Here's how it works...
The IndexDatabase is a Java class that holds a reference to an org.openntf.domino.Database object. That Database is presumed to have two basic views that show two kinds of documents: Db documents and Term documents. These views a extremely simple, each having only one column that displays a key value in ascending order. If the views are not present the first time the IndexDatabase is used, they are added automatically.
The IndexDatabase then has a method to .scanServer(Session session, String serverName). When this method is called, the IndexDatabase will use the session to get a DbDirectory on the specified server, and then begin iterating over the directory. For each Database, it creates an org.openntf.domino.helpers.DocumentScanner and loops over all the Documents in the Database, processing each through the DocumentScanner.
The DocumentScanner has been enhanced to improve two areas: first, it has greatly improved text scanning capabilities to identify alphanumeric term roots; and second, it maintains a new data structure called the TokenLocationMap. The TokenLocationMap is a Map<CaseInsensitiveString, Map<CaseInsensitiveString, Set<String>>>, where the outer Map's key is the individual word found in the Document's contents, while the inner Map's key is the name of the Item in which that term was found. The Set<String> values in the inner map are a HashSet of the following values from the Document: .getUniversalID() + .hasReaders() + .getFormName(). (.hasReaders() is recorded as a String 1 or 0 based on whether any field on the Document has a ReaderName flag. .getFormName() returns the value of the "Form" item in the Document.)
Once the DocumentScanner has built the TokenLocationMap for all the Documents, the IndexDatabase then walks the Map, looking at each CaseInsensitiveString key and doing .getDocumentByKey(key, true) on the IndexDatabase's storage database. Since .getDocumentByKey converts the key argument to an MD5 hash value, the hash is used as the UNID for the resulting Document, resulting in extremely fast retrieval times. The resulting Document is known as the TermDoc.
Once we have the TermDoc, we then take the inner Map from our TokenLocationMap and write it as a MIMEEntity to the TermDoc using Java serialization. The Item name for the MIMEEntity is "TermMap_" + the replicaID of the Database that was just scanned. Thus a TermDoc can have 1-n Items, one for each replica ID in which the term was found. Each of those Items can be deserialized into a Map of the Item names and UNIDs in which the term was found by taking the first 32 characters of the Set in the Map's values. Each of those UNIDs is also overloaded with a flag indicating the presence of a ReaderNames list, and the Form assigned to that Document in the target Database.
As a result, we can do some very interesting things with term searching. Given any single word, we can find every known location of that word by accessing a single Document in a single NSF. Further, once we've accessed that Document, we can easily identify the Databases in which it was found, along with the number of Documents containing that term. Further, once we've deserialized the MIMEEntities, we can identify the unique fields in which the term occurs, along with the number of Documents that match that field. Further, we can also consolidate the Form names to identify the number of Documents that match each type of Form.
To accomplish this, we use the IndexQuery object. An IndexQuery can do two basic types of term search: OR and AND. An AND search means that the results must have ALL the terms being queried, while the OR search means that the results must have ANY of the terms. Currently the term searches cannot process more sophisticated syntaxes, but this is more a matter of developing a parser for the search expression than a limitation of the data structure.
When an IndexQuery.execute(IndexDatabase db) is called, it returns an IndexResults object. The IndexResults has a few key methods, like .getTerms(), .getDbids(), .getItems() and .getForms(). Each of these returns a Map<String, AtomicInteger> where the String key is the corresponding term, replicaid, itemname or formname, and the AtomicInteger value is the number of Documents matching that criteria.
The IndexResults also has a .getHits() method which returns a List<IndexHit>. An IndexHit is basically a metadata wrapper for a Document, indicating the replicaID, UNID, formname, item name and term matched to cause this hit to be included. The IndexHit also has a .getDocument(Session session, String serverName) method that returns the Document corresponding to the hit. In the near future, .getHits() will also have paging options built-in, as well as accept arguments to filter out Documents that the current Session does not have access to due to ReaderNames items.
Finally, the IndexDatabase object also has a .getTermStarts(String startsWith, int resultCount) method. This method uses the $TermIndex view in the IndexDatabase's storage NSF to return the known terms matching the startsWith string (up to the number in the resultCount, of course). Thus we can easily provide typeahead capabilities in XPages search inputs that provide typeahead for all terms included in the index, which could cover 1 to N servers in a domain.
So at this point, you might be asking "well, that sounds like a lot of work. How long does all this stuff take?" Current benchmarks indicate that the index construction process mapping out all the terms in an NSF and writing the Maps to the IndexDatabase takes between 60-90 seconds for a database like events4.nsf (depending on CPU and disk I/O contention.) Once the IndexDatabase has been built, the TermIndex view can provide around 30 search term matches out of a set of 32000 terms in around 100ms. Because the total list of terms appearing in all NSFs across a server or domain is relatively stable, updates to this index are not expected to be frequent.
Once search terms have been identified, getting the actual IndexResults is quite fast. Current benchmarking shows term searches with > 1000 hits can be served in about 40ms. Multi-term searches where merging or intersecting the resulting IndexHit Lists adds < 5ms per transaction. What's more, when we search for a term, not only do we get the number of overall matches, but filtering the results on one or more Databases, one or more Items, or one or more Forms, is extremely fast. We can consistently filter results in < 100ms. As a result, it is possible to treat the searching and filtering as real-time transactions against a server, triggered by onChange events rather than specific search requests with potential multi-threading of search processors.
These results could be further optimized by scope caching, REST-service based transaction and XPages lifecycle optimization. The objective of the current capabilities is to provide enough speed to fundamentally change the user's perception of XPages searching across multiple NSFs, and we think this objective has been achieved.