Skip to content

Commit 22f1674

Browse files
committed
Update tantivy architecture document
1 parent 4cd2e4c commit 22f1674

File tree

2 files changed

+111
-38
lines changed

2 files changed

+111
-38
lines changed

developer/src/tantivy/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# Tantivy Architecture
2+
13
Tantivy is a high-performance, full-text search engine library written in Rust. It is designed to be fast and reliable, making it suitable for use in search applications that require quick and efficient text indexing and retrieval. Here are some key features and characteristics of Tantivy:
24

35
1. **High Performance**: Tantivy is optimized for speed, leveraging Rust's performance characteristics to provide fast search capabilities.

developer/src/tantivy/tantivy_architecture.md

Lines changed: 109 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
## **Tantivy**
1+
2+
## **Tantivy**
23
![Tantivy's logo](tantivy-logo.png)
34

45
Tantivy is a library meant to build search engines. Its architecture is strongly inspired by Lucene. It focuses mainly on full-text search. It looks to execute text queries in a large set of text documents and return N most relevant documents. To execute these queries quickly, tantivy needs to build an index beforehand. Given a large set of textual documents, and a text query, Tantivy returns the K-most relevant documents in a very efficient way. The relevance score (from Okapi BM25 - Best matching ranking function) in Tantivy is non-configurable.
@@ -65,7 +66,7 @@ These normalization factors are typically derived based on the length of the fie
6566
-> Schema definition of the index.
6667
-> Document addition to the index as per the defined schema.
6768
-> Tokenization and analysis of text fields as per the schema.
68-
doc1 Body: "He was an old man who fished alone..." -> ["he", "was", "an", "old", "man", "who", "fished", "alone"]
69+
doc1 Body: "He was an old man who fished alone..." -> ["he", "was", "an", "old", "man", "who", "fished", "alone"]
6970
-> Creation of the inverted index by mapping terms to the documents they appear in. Writing the .term and .idx file.
7071
-> Creating the fast fields if required and writing the .fast file.
7172
-> Creation of the Fieldnorms and creation of .fieldnorm files.
@@ -96,59 +97,57 @@ When a user submits a query, it needs to be transformed into a format that the s
9697
2. Query Parsing
9798
The query parser takes the user's query and converts it into a structured query. This involves several sub-steps:
9899

99-
Tokenization: Breaking the query into individual terms.
100-
Field Mapping: Mapping each term to the relevant fields in the schema (e.g., title, body).
101-
Logical Operators: Determining how the terms should be combined (e.g., AND, OR).
102-
Example
103-
The query "president obama" might be parsed as:
104-
(title:president OR body:president) AND (title:obama OR body:obama)
105-
Here, title and body are fields defined in the schema, and the query parser creates logical combinations of terms across these fields.
100+
a. Tokenization: Breaking the query into individual terms.
101+
b. Field Mapping: Mapping each term to the relevant fields in the schema (e.g., title, body).
102+
c. Logical Operators: Determining how the terms should be combined (e.g., AND, OR).
103+
104+
> For example, the query "president obama" might be parsed as:
105+
> (title:president OR body:president) AND (title:obama OR body:obama)
106+
> Here, title and body are fields defined in the schema, and the query parser creates logical combinations of terms across these fields.
106107
107108
3. Query Execution
108109
This phase involves several steps to retrieve and score the relevant documents:
109110

110-
a. Document Retrieval
111+
a. Document Retrieval
111112
Using the parsed query, the search engine scans the term dictionaries and postings lists to find documents containing the query terms. This involves:
112113

113-
Term Lookup: For each term in the query, the engine looks it up in the term dictionary (.term file).
114-
Postings List Access: Retrieves the list of document IDs (DocIDs) that contain the term from the inverted index (.idx file).
115-
Position Information: Accesses the .pos file to get term positions within the documents, if positional data is relevant for the query (e.g., phrase queries).
116-
b. Scoring
114+
i) Term Lookup: For each term in the query, the engine looks it up in the term dictionary (.term file).
115+
ii) Postings List Access: Retrieves the list of document IDs (DocIDs) that contain the term from the inverted index (.idx file).
116+
iii) Position Information: Accesses the .pos file to get term positions within the documents, if positional data is relevant for the query (e.g., phrase queries).
117+
b. Scoring
117118
Documents are scored based on the relevance of the query terms within them. Tantivy uses the Okapi BM25 ranking function, which considers:
118119

119-
Term Frequency (TF): The number of times a term appears in a document.
120-
Inverse Document Frequency (IDF): A measure of how common or rare a term is across all documents.
121-
Field Normalization: Adjusts the term frequency based on the length of the field to prevent longer fields from having an undue advantage.
122-
Example
123-
For a document containing "president" and "obama" in the title and body fields:
124-
125-
Calculate the TF and IDF for each term.
120+
i) Term Frequency (TF): The number of times a term appears in a document.
121+
ii) Inverse Document Frequency (IDF): A measure of how common or rare a term is across all documents.
122+
iii) Field Normalization: Adjusts the term frequency based on the length of the field to prevent longer fields from having an undue advantage.
123+
> For example, a document containing "president" and "obama" in the title and body fields:
124+
Calculate the TF and IDF for each term.
126125
Combine these with normalization factors to compute a BM25 score for the document.
126+
127127
4. Document Deserialization
128128
After retrieving the relevant documents, they need to be deserialized, which means converting the stored document data into a usable format.
129129

130-
a. Stored Fields Retrieval
131-
Fields marked as stored in the schema (e.g., title, author) are retrieved from the .store file. This involves:
132-
133-
Reading the .store File: Extracting the field values based on document IDs.
134-
Constructing the Document: Building the full document from the retrieved fields.
130+
a. Stored Fields Retrieval: Fields marked as stored in the schema (e.g., title, author) are retrieved from the .store file. This involves:
131+
b. Reading the .store File: Extracting the field values based on document IDs.
132+
c. Constructing the Document: Building the full document from the retrieved fields.
135133
5. Result Compilation
136134
Once the documents are retrieved and scored, the search engine compiles the results:
137135

138-
Sorting: Sort the documents based on their BM25 scores, from highest to lowest.
139-
Filtering: Apply any additional filters specified in the query (e.g., date ranges, specific field values).
140-
Limiting: Return the top N results as specified by the query.
141-
Example
136+
i) Sorting: Sort the documents based on their BM25 scores, from highest to lowest.
137+
ii) Filtering: Apply any additional filters specified in the query (e.g., date ranges, specific field values).
138+
iii) Limiting: Return the top N results as specified by the query.
139+
>Example
142140
If the query requests the top 10 documents for "president obama", the search engine will sort all matched documents by their relevance scores and return the top 10.
143141

144-
Search Workflow Summary
145-
User Input: The user enters a query.
146-
Query Parsing: The query parser converts the user's input into a structured query.
147-
Index Scanning: The search engine scans the index to find documents matching the query terms.
148-
Scoring: Documents are scored based on relevance using BM25.
149-
Deserialization: Relevant documents are deserialized from the stored fields.
150-
Result Compilation: The top N results are sorted and returned to the user.
151-
Example Walkthrough
142+
**Search Workflow Summary**
143+
User Input: The user enters a query.
144+
Query Parsing: The query parser converts the user's input into a structured query.
145+
Index Scanning: The search engine scans the index to find documents matching the query terms.
146+
Scoring: Documents are scored based on relevance using BM25.
147+
Deserialization: Relevant documents are deserialized from the stored fields.
148+
Result Compilation: The top N results are sorted and returned to the user.
149+
150+
*Example Walkthrough*
152151
Let’s walk through a detailed example of the query "old man":
153152

154153
*Query Construction:*
@@ -186,3 +185,75 @@ Result Compilation:
186185

187186
> Sort documents by BM25 scores. Return the top N results to the user.
188187
188+
## **Adding Documents to the Index**
189+
190+
1. **Document Structure**:
191+
192+
193+
- A document in Tantivy is a collection of fields, where each field can hold different types of data (e.g., text, numeric, date).
194+
2. **Schema Definition**:
195+
196+
- Before adding documents, a schema must be defined. This schema specifies the fields, their types, and how they should be indexed (e.g., whether they are tokenized, stored, etc.).
197+
3. **Creating the Index**:
198+
199+
- An index is created using the schema. The index acts as a container for the documents and allows for efficient querying.
200+
4. **IndexWriter**:
201+
202+
- To add documents, an `IndexWriter` is used. This is the component responsible for making changes to the index.
203+
5. **Adding a Document**:
204+
205+
- A document is created by populating it with fields and their values.
206+
- The document is then added to the `IndexWriter` using the `add_document` method.
207+
6. **Committing Changes**:
208+
209+
- Changes made by the `IndexWriter` are buffered in memory.
210+
- To persist these changes to the index, the `commit` method is called. This writes the buffered documents to disk, making them part of the index.
211+
7. **Segment Creation**:
212+
213+
- Documents are grouped into segments. A segment is a subset of the index that can be independently searched.
214+
- When a commit is performed, a new segment is created if there are new documents.
215+
8. **Merge Policy**:
216+
217+
- Over time, multiple segments are created. Tantivy uses a merge policy to combine smaller segments into larger ones, improving search efficiency.
218+
219+
## Deleting Documents from the Index
220+
221+
1. **Identifying Documents for Deletion**:
222+
223+
- Documents are identified for deletion using a unique identifier (typically a field marked as the primary key).
224+
2. **Deletion Marker**:
225+
226+
- Instead of physically removing a document, Tantivy marks it as deleted. This is done using a deletion marker.
227+
3. **Using the IndexWriter**:
228+
229+
- The `IndexWriter` provides a method to mark documents for deletion. For instance, the `delete_term` method is used to mark all documents containing a specific term (e.g., a unique identifier) as deleted.
230+
4. **Logical Deletion**:
231+
232+
- Deletions are logical rather than physical, meaning the document is still present in the index files but marked as deleted and excluded from search results.
233+
5. **Commit Changes**:
234+
235+
- As with adding documents, deletions are also buffered in memory. The `commit` method must be called to persist these deletions to the index.
236+
6. **Garbage Collection**:
237+
238+
- During segment merges, deleted documents are physically removed from the new merged segments. This process is part of Tantivy’s garbage collection mechanism to clean up deleted documents and optimize storage.
239+
240+
## Benchmarks
241+
There are already multiple benchmarks available which make use of tantivy and compare it against other search engine technologies.
242+
One such benchmark can be found at the below link which compares Tantivy with lucene and pisa.
243+
244+
> https://tantivy-search.github.io/bench/
245+
246+
We conducted our own set of basic benchmarks for some use cases.
247+
248+
| Test/Search | Total Files | Total Size | Total Time |
249+
|--|--|--|--|
250+
| Index files | 1061 | 32.5 GB | 127.7 sec |
251+
| Search term | 1061 | 32.5 GB | 19.7 ms |
252+
| Search phrase | 1061 | 32.5 GB | 8.2 ms |
253+
| Search regex | 1061 | 32.5 GB | 45.67 ms |
254+
| Add document | 1062 | 32.5 GB | 5.7 sec |
255+
| Delete document | 1061 | 32.5 GB | 190.78 ms |
256+
| Index files | 5.1 Million | 11 GB | 04:36:35 hrs |
257+
| Search term | 5.1 Million | 11 GB | 86.2 ms |
258+
| Search phrase | 5.1 Million | 11 GB | 98.6 ms |
259+
| Search regex | 5.1 Million | 11 GB | 166.74 ms |

0 commit comments

Comments
 (0)