|
1 |
| -## **Tantivy** |
| 1 | + |
| 2 | +## **Tantivy** |
2 | 3 | 
|
3 | 4 |
|
4 | 5 | Tantivy is a library meant to build search engines. Its architecture is strongly inspired by Lucene. It focuses mainly on full-text search. It looks to execute text queries in a large set of text documents and return N most relevant documents. To execute these queries quickly, tantivy needs to build an index beforehand. Given a large set of textual documents, and a text query, Tantivy returns the K-most relevant documents in a very efficient way. The relevance score (from Okapi BM25 - Best matching ranking function) in Tantivy is non-configurable.
|
@@ -65,7 +66,7 @@ These normalization factors are typically derived based on the length of the fie
|
65 | 66 | -> Schema definition of the index.
|
66 | 67 | -> Document addition to the index as per the defined schema.
|
67 | 68 | -> Tokenization and analysis of text fields as per the schema.
|
68 |
| - doc1 Body: "He was an old man who fished alone..." -> ["he", "was", "an", "old", "man", "who", "fished", "alone"] |
| 69 | + doc1 Body: "He was an old man who fished alone..." -> ["he", "was", "an", "old", "man", "who", "fished", "alone"] |
69 | 70 | -> Creation of the inverted index by mapping terms to the documents they appear in. Writing the .term and .idx file.
|
70 | 71 | -> Creating the fast fields if required and writing the .fast file.
|
71 | 72 | -> Creation of the Fieldnorms and creation of .fieldnorm files.
|
@@ -96,59 +97,57 @@ When a user submits a query, it needs to be transformed into a format that the s
|
96 | 97 | 2. Query Parsing
|
97 | 98 | The query parser takes the user's query and converts it into a structured query. This involves several sub-steps:
|
98 | 99 |
|
99 |
| - Tokenization: Breaking the query into individual terms. |
100 |
| -Field Mapping: Mapping each term to the relevant fields in the schema (e.g., title, body). |
101 |
| -Logical Operators: Determining how the terms should be combined (e.g., AND, OR). |
102 |
| -Example |
103 |
| -The query "president obama" might be parsed as: |
104 |
| -(title:president OR body:president) AND (title:obama OR body:obama) |
105 |
| -Here, title and body are fields defined in the schema, and the query parser creates logical combinations of terms across these fields. |
| 100 | + a. Tokenization: Breaking the query into individual terms. |
| 101 | + b. Field Mapping: Mapping each term to the relevant fields in the schema (e.g., title, body). |
| 102 | + c. Logical Operators: Determining how the terms should be combined (e.g., AND, OR). |
| 103 | + |
| 104 | +> For example, the query "president obama" might be parsed as: |
| 105 | +> (title:president OR body:president) AND (title:obama OR body:obama) |
| 106 | +> Here, title and body are fields defined in the schema, and the query parser creates logical combinations of terms across these fields. |
106 | 107 |
|
107 | 108 | 3. Query Execution
|
108 | 109 | This phase involves several steps to retrieve and score the relevant documents:
|
109 | 110 |
|
110 |
| - a. Document Retrieval |
| 111 | + a. Document Retrieval |
111 | 112 | Using the parsed query, the search engine scans the term dictionaries and postings lists to find documents containing the query terms. This involves:
|
112 | 113 |
|
113 |
| - Term Lookup: For each term in the query, the engine looks it up in the term dictionary (.term file). |
114 |
| -Postings List Access: Retrieves the list of document IDs (DocIDs) that contain the term from the inverted index (.idx file). |
115 |
| -Position Information: Accesses the .pos file to get term positions within the documents, if positional data is relevant for the query (e.g., phrase queries). |
116 |
| -b. Scoring |
| 114 | + i) Term Lookup: For each term in the query, the engine looks it up in the term dictionary (.term file). |
| 115 | + ii) Postings List Access: Retrieves the list of document IDs (DocIDs) that contain the term from the inverted index (.idx file). |
| 116 | + iii) Position Information: Accesses the .pos file to get term positions within the documents, if positional data is relevant for the query (e.g., phrase queries). |
| 117 | + b. Scoring |
117 | 118 | Documents are scored based on the relevance of the query terms within them. Tantivy uses the Okapi BM25 ranking function, which considers:
|
118 | 119 |
|
119 |
| - Term Frequency (TF): The number of times a term appears in a document. |
120 |
| -Inverse Document Frequency (IDF): A measure of how common or rare a term is across all documents. |
121 |
| -Field Normalization: Adjusts the term frequency based on the length of the field to prevent longer fields from having an undue advantage. |
122 |
| -Example |
123 |
| -For a document containing "president" and "obama" in the title and body fields: |
124 |
| - |
125 |
| - Calculate the TF and IDF for each term. |
| 120 | + i) Term Frequency (TF): The number of times a term appears in a document. |
| 121 | + ii) Inverse Document Frequency (IDF): A measure of how common or rare a term is across all documents. |
| 122 | + iii) Field Normalization: Adjusts the term frequency based on the length of the field to prevent longer fields from having an undue advantage. |
| 123 | +> For example, a document containing "president" and "obama" in the title and body fields: |
| 124 | +Calculate the TF and IDF for each term. |
126 | 125 | Combine these with normalization factors to compute a BM25 score for the document.
|
| 126 | + |
127 | 127 | 4. Document Deserialization
|
128 | 128 | After retrieving the relevant documents, they need to be deserialized, which means converting the stored document data into a usable format.
|
129 | 129 |
|
130 |
| - a. Stored Fields Retrieval |
131 |
| -Fields marked as stored in the schema (e.g., title, author) are retrieved from the .store file. This involves: |
132 |
| - |
133 |
| - Reading the .store File: Extracting the field values based on document IDs. |
134 |
| -Constructing the Document: Building the full document from the retrieved fields. |
| 130 | + a. Stored Fields Retrieval: Fields marked as stored in the schema (e.g., title, author) are retrieved from the .store file. This involves: |
| 131 | + b. Reading the .store File: Extracting the field values based on document IDs. |
| 132 | + c. Constructing the Document: Building the full document from the retrieved fields. |
135 | 133 | 5. Result Compilation
|
136 | 134 | Once the documents are retrieved and scored, the search engine compiles the results:
|
137 | 135 |
|
138 |
| - Sorting: Sort the documents based on their BM25 scores, from highest to lowest. |
139 |
| -Filtering: Apply any additional filters specified in the query (e.g., date ranges, specific field values). |
140 |
| -Limiting: Return the top N results as specified by the query. |
141 |
| -Example |
| 136 | + i) Sorting: Sort the documents based on their BM25 scores, from highest to lowest. |
| 137 | + ii) Filtering: Apply any additional filters specified in the query (e.g., date ranges, specific field values). |
| 138 | + iii) Limiting: Return the top N results as specified by the query. |
| 139 | +>Example |
142 | 140 | If the query requests the top 10 documents for "president obama", the search engine will sort all matched documents by their relevance scores and return the top 10.
|
143 | 141 |
|
144 |
| - Search Workflow Summary |
145 |
| - User Input: The user enters a query. |
146 |
| -Query Parsing: The query parser converts the user's input into a structured query. |
147 |
| -Index Scanning: The search engine scans the index to find documents matching the query terms. |
148 |
| -Scoring: Documents are scored based on relevance using BM25. |
149 |
| -Deserialization: Relevant documents are deserialized from the stored fields. |
150 |
| -Result Compilation: The top N results are sorted and returned to the user. |
151 |
| -Example Walkthrough |
| 142 | +**Search Workflow Summary** |
| 143 | + User Input: The user enters a query. |
| 144 | + Query Parsing: The query parser converts the user's input into a structured query. |
| 145 | + Index Scanning: The search engine scans the index to find documents matching the query terms. |
| 146 | + Scoring: Documents are scored based on relevance using BM25. |
| 147 | + Deserialization: Relevant documents are deserialized from the stored fields. |
| 148 | + Result Compilation: The top N results are sorted and returned to the user. |
| 149 | + |
| 150 | +*Example Walkthrough* |
152 | 151 | Let’s walk through a detailed example of the query "old man":
|
153 | 152 |
|
154 | 153 | *Query Construction:*
|
@@ -186,3 +185,75 @@ Result Compilation:
|
186 | 185 |
|
187 | 186 | > Sort documents by BM25 scores. Return the top N results to the user.
|
188 | 187 |
|
| 188 | +## **Adding Documents to the Index** |
| 189 | + |
| 190 | +1. **Document Structure**: |
| 191 | + |
| 192 | + |
| 193 | + - A document in Tantivy is a collection of fields, where each field can hold different types of data (e.g., text, numeric, date). |
| 194 | +2. **Schema Definition**: |
| 195 | + |
| 196 | + - Before adding documents, a schema must be defined. This schema specifies the fields, their types, and how they should be indexed (e.g., whether they are tokenized, stored, etc.). |
| 197 | +3. **Creating the Index**: |
| 198 | + |
| 199 | + - An index is created using the schema. The index acts as a container for the documents and allows for efficient querying. |
| 200 | +4. **IndexWriter**: |
| 201 | + |
| 202 | + - To add documents, an `IndexWriter` is used. This is the component responsible for making changes to the index. |
| 203 | +5. **Adding a Document**: |
| 204 | + |
| 205 | + - A document is created by populating it with fields and their values. |
| 206 | + - The document is then added to the `IndexWriter` using the `add_document` method. |
| 207 | +6. **Committing Changes**: |
| 208 | + |
| 209 | + - Changes made by the `IndexWriter` are buffered in memory. |
| 210 | + - To persist these changes to the index, the `commit` method is called. This writes the buffered documents to disk, making them part of the index. |
| 211 | +7. **Segment Creation**: |
| 212 | + |
| 213 | + - Documents are grouped into segments. A segment is a subset of the index that can be independently searched. |
| 214 | + - When a commit is performed, a new segment is created if there are new documents. |
| 215 | +8. **Merge Policy**: |
| 216 | + |
| 217 | + - Over time, multiple segments are created. Tantivy uses a merge policy to combine smaller segments into larger ones, improving search efficiency. |
| 218 | + |
| 219 | +## Deleting Documents from the Index |
| 220 | + |
| 221 | +1. **Identifying Documents for Deletion**: |
| 222 | + |
| 223 | + - Documents are identified for deletion using a unique identifier (typically a field marked as the primary key). |
| 224 | +2. **Deletion Marker**: |
| 225 | + |
| 226 | + - Instead of physically removing a document, Tantivy marks it as deleted. This is done using a deletion marker. |
| 227 | +3. **Using the IndexWriter**: |
| 228 | + |
| 229 | + - The `IndexWriter` provides a method to mark documents for deletion. For instance, the `delete_term` method is used to mark all documents containing a specific term (e.g., a unique identifier) as deleted. |
| 230 | +4. **Logical Deletion**: |
| 231 | + |
| 232 | + - Deletions are logical rather than physical, meaning the document is still present in the index files but marked as deleted and excluded from search results. |
| 233 | +5. **Commit Changes**: |
| 234 | + |
| 235 | + - As with adding documents, deletions are also buffered in memory. The `commit` method must be called to persist these deletions to the index. |
| 236 | +6. **Garbage Collection**: |
| 237 | + |
| 238 | + - During segment merges, deleted documents are physically removed from the new merged segments. This process is part of Tantivy’s garbage collection mechanism to clean up deleted documents and optimize storage. |
| 239 | + |
| 240 | +## Benchmarks |
| 241 | +There are already multiple benchmarks available which make use of tantivy and compare it against other search engine technologies. |
| 242 | +One such benchmark can be found at the below link which compares Tantivy with lucene and pisa. |
| 243 | + |
| 244 | +> https://tantivy-search.github.io/bench/ |
| 245 | +
|
| 246 | +We conducted our own set of basic benchmarks for some use cases. |
| 247 | + |
| 248 | +| Test/Search | Total Files | Total Size | Total Time | |
| 249 | +|--|--|--|--| |
| 250 | +| Index files | 1061 | 32.5 GB | 127.7 sec | |
| 251 | +| Search term | 1061 | 32.5 GB | 19.7 ms | |
| 252 | +| Search phrase | 1061 | 32.5 GB | 8.2 ms | |
| 253 | +| Search regex | 1061 | 32.5 GB | 45.67 ms | |
| 254 | +| Add document | 1062 | 32.5 GB | 5.7 sec | |
| 255 | +| Delete document | 1061 | 32.5 GB | 190.78 ms | |
| 256 | +| Index files | 5.1 Million | 11 GB | 04:36:35 hrs | |
| 257 | +| Search term | 5.1 Million | 11 GB | 86.2 ms | |
| 258 | +| Search phrase | 5.1 Million | 11 GB | 98.6 ms | |
| 259 | +| Search regex | 5.1 Million | 11 GB | 166.74 ms | |
0 commit comments