Skip to content

Commit 2fda0e0

Browse files
docs: Ngram Index (#2104)
* ngram ddl * Ngram Index guide * Update ngram-index.md --------- Co-authored-by: z <[email protected]>
1 parent b7edd20 commit 2fda0e0

File tree

7 files changed

+279
-4
lines changed

7 files changed

+279
-4
lines changed
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
title: Ngram Index
3+
---
4+
5+
import EEFeature from '@site/src/components/EEFeature';
6+
7+
<EEFeature featureName='NGRAM INDEX'/>
8+
9+
The Ngram Index is a specialized indexing technique that improves the performance of pattern matching queries using the `LIKE` operator with the `%` wildcard. These queries are common in applications that require substring or fuzzy matching, such as searching for keywords within product descriptions, user comments, or log data.
10+
11+
Unlike traditional indexes, which are typically ineffective when the search pattern does not have a fixed prefix (e.g., `LIKE '%keyword%'`), the Ngram Index breaks down text into overlapping substrings (n-grams) and indexes them for fast lookup. This allows Databend to narrow down matching rows efficiently, avoiding costly full table scans.
12+
13+
## How Ngram Index Works
14+
15+
Ngram Index in Databend is built using character-level n-grams. When a column is indexed, its text content is treated as a continuous sequence of characters, including letters, spaces, and punctuation. The text is then split into all possible overlapping substrings of a fixed length, defined by the gram_size parameter.
16+
17+
For example, with `gram_size = 3`, the string:
18+
19+
```text
20+
The quick brown
21+
```
22+
23+
will be split into the following 3-character substrings:
24+
25+
```text
26+
"The", "he ", "e q", " qu", "qui", "uic", "ick", "ck ", "k b", " br", "bro", "row", "own"
27+
```
28+
29+
These substrings are stored in the index and used to accelerate pattern matching in queries using the `LIKE` operator.
30+
When a query such as:
31+
32+
```sql
33+
SELECT * FROM t WHERE content LIKE '%quick br%'
34+
```
35+
36+
is issued, the condition `%quick br%` is also tokenized into trigrams, such as "qui", "uic", "ick", "ck ", "k b", " br", etc. Databend uses these to filter data blocks via the n-gram index before applying the full `LIKE` filter, significantly reducing the amount of data scanned.
37+
38+
:::note
39+
- The index only works when the pattern to be matched is at least as long as `gram_size`. Short patterns (e.g., '%yo%' with gram_size = 3) won't benefit from the index.
40+
41+
- When using the Ngram index, matches are case-insensitive. For example, searching for "FOO" will match "foo", "Foo", or "fOo".
42+
:::
43+
44+
## Managing Ngram Indexes
45+
46+
Databend provides a variety of commands to manage Ngram indexes. For details, see [Ngram Index](/sql/sql-commands/ddl/ngram-index/).
47+
48+
## Usage Examples
49+
50+
To accelerate fuzzy string searches using the `LIKE` operator, you can create an Ngram Index on one or more STRING columns of a table. This example shows how to create a table, define an Ngram Index, insert sample data, and verify that the index is being used in query planning.
51+
52+
First, create a simple table to store text data:
53+
54+
```sql
55+
CREATE TABLE t_articles (
56+
id INT,
57+
content STRING
58+
);
59+
```
60+
61+
Next, create an Ngram Index on the `content` column. The `gram_size` parameter defines the number of characters used in each n-gram segment:
62+
63+
```sql
64+
CREATE NGRAM INDEX ngram_idx_content
65+
ON t_articles(content)
66+
gram_size = 3;
67+
```
68+
69+
To show the created index:
70+
71+
```sql
72+
SHOW INDEXES;
73+
```
74+
75+
```sql
76+
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
77+
│ name │ type │ original │ definition │ created_on │ updated_on │
78+
├───────────────────┼────────┼──────────┼──────────────────────────────────┼────────────────────────────┼─────────────────────┤
79+
│ ngram_idx_content │ NGRAM │ │ t_articles(content)gram_size='3'2025-05-13 01:02:58.598409NULL
80+
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
81+
```
82+
83+
Now insert a large number of rows. Most entries contain unrelated text, but a few contain the keyword we want to match later:
84+
85+
```sql
86+
-- Insert 995 irrelevant rows
87+
INSERT INTO t_articles
88+
SELECT number, CONCAT('Random text number ', number)
89+
FROM numbers(995);
90+
91+
-- Insert 5 rows with target keyword
92+
INSERT INTO t_articles VALUES
93+
(1001, 'The silence was deep and complete'),
94+
(1002, 'They walked in silence through the woods'),
95+
(1003, 'Silence fell over the room'),
96+
(1004, 'A moment of silence was observed'),
97+
(1005, 'In silence, they understood each other');
98+
```
99+
100+
Now run a query using a `LIKE '%silence%'` pattern. This is where the Ngram Index becomes useful:
101+
102+
```sql
103+
EXPLAIN SELECT id, content FROM t_articles WHERE content LIKE '%silence%';
104+
```
105+
106+
In the `EXPLAIN` output, look for the `bloom pruning` detail in the `pruning stats` line:
107+
108+
```sql
109+
-[ EXPLAIN ]-----------------------------------
110+
TableScan
111+
├── table: default.default.t_articles
112+
├── output columns: [id (#0), content (#1)]
113+
├── read rows: 5
114+
├── read size: < 1 KiB
115+
├── partitions total: 2
116+
├── partitions scanned: 1
117+
├── pruning stats: [segments: <range pruning: 2 to 2>, blocks: <range pruning: 2 to 2, bloom pruning: 2 to 1>]
118+
├── push downs: [filters: [is_true(like(t_articles.content (#1), '%silence%'))], limit: NONE]
119+
└── estimated rows: 15.62
120+
```
121+
122+
Here, `bloom pruning: 2 to 1` shows that the Ngram Index successfully filtered out one of the two data blocks before scan.

docs/en/sql-reference/00-sql-reference/31-system-tables/system-indexes.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ import FunctionDescription from '@site/src/components/FunctionDescription';
66

77
<FunctionDescription description="Introduced: v1.1.50"/>
88

9-
Contains information about the created aggregating indexes.
9+
Contains information about the created indexes.
1010

11-
See also: [SHOW INDEXES](../../10-sql-commands/00-ddl/07-aggregating-index/show-indexes.md)
11+
See also: [SHOW INDEXES](../../10-sql-commands/50-administration-cmds/show-indexes.md)
1212

1313
```sql
1414
CREATE TABLE t1(a int,b int);
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"label": "Ngram Index",
3+
"position": 11
4+
}
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
title: CREATE NGRAM INDEX
3+
sidebar_position: 1
4+
---
5+
6+
import FunctionDescription from '@site/src/components/FunctionDescription';
7+
8+
<FunctionDescription description="Introduced or updated: v1.2.726"/>
9+
10+
import EEFeature from '@site/src/components/EEFeature';
11+
12+
<EEFeature featureName='NGRAM INDEX'/>
13+
14+
Creates an Ngram index on one or more columns for a table.
15+
16+
## Syntax
17+
18+
```sql
19+
-- Create an Ngram index on an existing table
20+
CREATE [OR REPLACE] NGRAM INDEX [IF NOT EXISTS] <index_name>
21+
ON [<database>.]<table_name>(<column1> [, <column2>, ...])
22+
[gram_size = <number>] [bitmap_size = <number>]
23+
24+
-- Create an Ngram index when creating a table
25+
CREATE [OR REPLACE] TABLE <table_name> (
26+
<column_definitions>,
27+
NGRAM INDEX <index_name> (<column1> [, <column2>, ...])
28+
[gram_size = <number>] [bitmap_size = <number>]
29+
)...
30+
```
31+
32+
- `gram_size` (defaults to 3) specifies the length of each character-based substring (n-gram) when the column text is indexed. For example, with `gram_size = 3`, the text "hello world" would be split into overlapping substrings like:
33+
34+
```text
35+
"hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"
36+
```
37+
38+
- `bloom_size` specifies the size in bytes of the Bloom filter bitmap used to accelerate string matching within each block of data. It controls the trade-off between index accuracy and memory usage:
39+
40+
- A larger `bloom_size` reduces false positives in string lookups, improving query precision at the cost of more memory.
41+
- A smaller `bloom_size` saves memory but may increase false positives.
42+
- If not explicitly set, the default is 1,048,576 bytes (1m) per indexed column per block. The valid range is from 512 bytes to 10,485,760 bytes (10m).
43+
44+
## Examples
45+
46+
The following example creates a table `amazon_reviews_ngram` with an Ngram index on the `review_body` column. The index is configured with a `gram_size` of 10 and a `bitmap_size` of 2 MB to optimize fuzzy search performance on large text fields such as user reviews.
47+
48+
```sql
49+
CREATE OR REPLACE TABLE amazon_reviews_ngram (
50+
review_date int(11) NULL,
51+
marketplace varchar(20) NULL,
52+
customer_id bigint(20) NULL,
53+
review_id varchar(40) NULL,
54+
product_id varchar(10) NULL,
55+
product_parent bigint(20) NULL,
56+
product_title varchar(500) NULL,
57+
product_category varchar(50) NULL,
58+
star_rating smallint(6) NULL,
59+
helpful_votes int(11) NULL,
60+
total_votes int(11) NULL,
61+
vine boolean NULL,
62+
verified_purchase boolean NULL,
63+
review_headline varchar(500) NULL,
64+
review_body string NULL,
65+
NGRAM INDEX idx1 (review_body) gram_size = 10 bloom_size = 2097152
66+
) Engine = Fuse bloom_index_columns='review_body';
67+
```
68+
69+
To show the created index, use the [SHOW INDEXES](../../50-administration-cmds/show-indexes.md) command:
70+
71+
```sql
72+
SHOW INDEXES;
73+
```
74+
75+
```sql
76+
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
77+
│ name │ type │ original │ definition │ created_on │ updated_on │
78+
├────────┼────────┼──────────┼──────────────────────────────────────────────────────────────────────┼────────────────────────────┼─────────────────────┤
79+
│ idx1 │ NGRAM │ │ amazon_reviews_ngram(review_body)bloom_size='2097152' gram_size='10'2025-05-13 01:22:34.123927NULL
80+
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
81+
```
82+
83+
Alternatively, you can create the table first, then create the Ngram index on the `review_body` column:
84+
85+
```sql
86+
CREATE TABLE amazon_reviews_ngram (
87+
review_date int(11) NULL,
88+
marketplace varchar(20) NULL,
89+
customer_id bigint(20) NULL,
90+
review_id varchar(40) NULL,
91+
product_id varchar(10) NULL,
92+
product_parent bigint(20) NULL,
93+
product_title varchar(500) NULL,
94+
product_category varchar(50) NULL,
95+
star_rating smallint(6) NULL,
96+
helpful_votes int(11) NULL,
97+
total_votes int(11) NULL,
98+
vine boolean NULL,
99+
verified_purchase boolean NULL,
100+
review_headline varchar(500) NULL,
101+
review_body string NULL
102+
);
103+
```
104+
105+
```sql
106+
CREATE NGRAM INDEX idx1
107+
ON amazon_reviews_ngram(review_body)
108+
gram_size = 10 bloom_size = 2097152;
109+
```
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
title: DROP NGRAM INDEX
3+
sidebar_position: 4
4+
---
5+
6+
import FunctionDescription from '@site/src/components/FunctionDescription';
7+
8+
<FunctionDescription description="Introduced or updated: v1.2.726"/>
9+
10+
import EEFeature from '@site/src/components/EEFeature';
11+
12+
<EEFeature featureName='NGRAM INDEX'/>
13+
14+
Drops an existing NGRAM index from a table.
15+
16+
## Syntax
17+
18+
```sql
19+
DROP NGRAM INDEX [IF EXISTS] <index_name>
20+
ON [<database>.]<table_name>;
21+
```
22+
23+
## Examples
24+
25+
The following example drops the `idx1` index from the `amazon_reviews_ngram` table:
26+
27+
```sql
28+
DROP NGRAM INDEX idx1 ON amazon_reviews_ngram;
29+
```
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
title: NGRAM INDEX
3+
---
4+
import IndexOverviewList from '@site/src/components/IndexOverviewList';
5+
import EEFeature from '@site/src/components/EEFeature';
6+
7+
<EEFeature featureName='NGRAM INDEX'/>
8+
9+
This page provides reference information for the Ngram index-related commands in Databend.
10+
11+
<IndexOverviewList />

docs/en/sql-reference/10-sql-commands/00-ddl/07-aggregating-index/show-indexes.md renamed to docs/en/sql-reference/10-sql-commands/50-administration-cmds/show-indexes.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ import FunctionDescription from '@site/src/components/FunctionDescription';
66

77
<FunctionDescription description="Introduced or updated: v1.2.190"/>
88

9-
Shows the created aggregating indexes. Equivalent to `SELECT * FROM system.indexes`.
9+
Shows the created indexes. Equivalent to `SELECT * FROM system.indexes`.
1010

11-
See also: [system.indexes](../../../00-sql-reference/31-system-tables/system-indexes.md)
11+
See also: [system.indexes](../../00-sql-reference/31-system-tables/system-indexes.md)
1212

1313
## Syntax
1414

0 commit comments

Comments
 (0)