docs: Ngram Index (#2104)

soyeric128 · Chasen-Zhang · web-flow · commit 2fda0e03ce45 · 2025-05-14T08:51:32.000-04:00
* ngram ddl

* Ngram Index guide

* Update ngram-index.md

---------

Co-authored-by: z &lt;787025321@qq.com&gt;
diff --git a/docs/en/guides/55-performance/ngram-index.md b/docs/en/guides/55-performance/ngram-index.md
@@ -0,0 +1,122 @@
+---
+title: Ngram Index
+---
+
+import EEFeature from '@site/src/components/EEFeature';
+
+<EEFeature featureName='NGRAM INDEX'/>
+
+The Ngram Index is a specialized indexing technique that improves the performance of pattern matching queries using the `LIKE` operator with the `%` wildcard. These queries are common in applications that require substring or fuzzy matching, such as searching for keywords within product descriptions, user comments, or log data.
+
+Unlike traditional indexes, which are typically ineffective when the search pattern does not have a fixed prefix (e.g., `LIKE '%keyword%'`), the Ngram Index breaks down text into overlapping substrings (n-grams) and indexes them for fast lookup. This allows Databend to narrow down matching rows efficiently, avoiding costly full table scans.
+
+## How Ngram Index Works
+
+Ngram Index in Databend is built using character-level n-grams. When a column is indexed, its text content is treated as a continuous sequence of characters, including letters, spaces, and punctuation. The text is then split into all possible overlapping substrings of a fixed length, defined by the gram_size parameter.
+
+For example, with `gram_size = 3`, the string:
+
+```text
+The quick brown
+```
+
+will be split into the following 3-character substrings:
+
+```text
+"The", "he ", "e q", " qu", "qui", "uic", "ick", "ck ", "k b", " br", "bro", "row", "own"
+```
+
+These substrings are stored in the index and used to accelerate pattern matching in queries using the `LIKE` operator.
+When a query such as:
+
+```sql
+SELECT * FROM t WHERE content LIKE '%quick br%'
+```
+
+is issued, the condition `%quick br%` is also tokenized into trigrams, such as "qui", "uic", "ick", "ck ", "k b", " br", etc. Databend uses these to filter data blocks via the n-gram index before applying the full `LIKE` filter, significantly reducing the amount of data scanned.
+
+:::note
+- The index only works when the pattern to be matched is at least as long as `gram_size`. Short patterns (e.g., '%yo%' with gram_size = 3) won't benefit from the index.
+
+- When using the Ngram index, matches are case-insensitive. For example, searching for "FOO" will match "foo", "Foo", or "fOo".
+:::
+
+## Managing Ngram Indexes
+
+Databend provides a variety of commands to manage Ngram indexes. For details, see [Ngram Index](/sql/sql-commands/ddl/ngram-index/).
+
+## Usage Examples
+
+To accelerate fuzzy string searches using the `LIKE` operator, you can create an Ngram Index on one or more STRING columns of a table. This example shows how to create a table, define an Ngram Index, insert sample data, and verify that the index is being used in query planning.
+
+First, create a simple table to store text data:
+
+```sql
+CREATE TABLE t_articles (
+    id INT,
+    content STRING
+);
+```
+
+Next, create an Ngram Index on the `content` column. The `gram_size` parameter defines the number of characters used in each n-gram segment:
+
+```sql
+CREATE NGRAM INDEX ngram_idx_content
+ON t_articles(content)
+gram_size = 3;
+```
+
+To show the created index:
+
+```sql
+SHOW INDEXES;
+```
+
+```sql
+┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
+│        name       │  type  │ original │            definition            │         created_on         │      updated_on     │
+├───────────────────┼────────┼──────────┼──────────────────────────────────┼────────────────────────────┼─────────────────────┤
+│ ngram_idx_content │ NGRAM  │          │ t_articles(content)gram_size='3' │ 2025-05-13 01:02:58.598409 │ NULL                │
+└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
+```
+
+Now insert a large number of rows. Most entries contain unrelated text, but a few contain the keyword we want to match later:
+
+```sql
+-- Insert 995 irrelevant rows
+INSERT INTO t_articles
+SELECT number, CONCAT('Random text number ', number)
+FROM numbers(995);
+
+-- Insert 5 rows with target keyword
+INSERT INTO t_articles VALUES
+    (1001, 'The silence was deep and complete'),
+    (1002, 'They walked in silence through the woods'),
+    (1003, 'Silence fell over the room'),
+    (1004, 'A moment of silence was observed'),
+    (1005, 'In silence, they understood each other');
+```
+
+Now run a query using a `LIKE '%silence%'` pattern. This is where the Ngram Index becomes useful:
+
+```sql
+EXPLAIN SELECT id, content FROM t_articles WHERE content LIKE '%silence%';
+```
+
+In the `EXPLAIN` output, look for the `bloom pruning` detail in the `pruning stats` line:
+
+```sql
+-[ EXPLAIN ]-----------------------------------
+TableScan
+├── table: default.default.t_articles
+├── output columns: [id (#0), content (#1)]
+├── read rows: 5
+├── read size: < 1 KiB
+├── partitions total: 2
+├── partitions scanned: 1
+├── pruning stats: [segments: <range pruning: 2 to 2>, blocks: <range pruning: 2 to 2, bloom pruning: 2 to 1>]
+├── push downs: [filters: [is_true(like(t_articles.content (#1), '%silence%'))], limit: NONE]
+└── estimated rows: 15.62
+```
+
+Here, `bloom pruning: 2 to 1` shows that the Ngram Index successfully filtered out one of the two data blocks before scan. 
diff --git a/docs/en/sql-reference/00-sql-reference/31-system-tables/system-indexes.md b/docs/en/sql-reference/00-sql-reference/31-system-tables/system-indexes.md
@@ -6,9 +6,9 @@ import FunctionDescription from '@site/src/components/FunctionDescription';
 
 <FunctionDescription description="Introduced: v1.1.50"/>
 
-Contains information about the created aggregating indexes.
+Contains information about the created indexes.
 
-See also: [SHOW INDEXES](../../10-sql-commands/00-ddl/07-aggregating-index/show-indexes.md)
+See also: [SHOW INDEXES](../../10-sql-commands/50-administration-cmds/show-indexes.md)
 
 ```sql
 CREATE TABLE t1(a int,b int);
diff --git a/docs/en/sql-reference/10-sql-commands/00-ddl/07-ngram-index/_category_.json b/docs/en/sql-reference/10-sql-commands/00-ddl/07-ngram-index/_category_.json
@@ -0,0 +1,4 @@
+{
+  "label": "Ngram Index",
+  "position": 11
+}
diff --git a/docs/en/sql-reference/10-sql-commands/00-ddl/07-ngram-index/create-ngram-index.md b/docs/en/sql-reference/10-sql-commands/00-ddl/07-ngram-index/create-ngram-index.md
@@ -0,0 +1,109 @@
+---
+title: CREATE NGRAM INDEX
+sidebar_position: 1
+---
+
+import FunctionDescription from '@site/src/components/FunctionDescription';
+
+<FunctionDescription description="Introduced or updated: v1.2.726"/>
+
+import EEFeature from '@site/src/components/EEFeature';
+
+<EEFeature featureName='NGRAM INDEX'/>
+
+Creates an Ngram index on one or more columns for a table.
+
+## Syntax
+
+```sql
+-- Create an Ngram index on an existing table
+CREATE [OR REPLACE] NGRAM INDEX [IF NOT EXISTS] <index_name>
+ON [<database>.]<table_name>(<column1> [, <column2>, ...])
+[gram_size = <number>] [bitmap_size = <number>]
+
+-- Create an Ngram index when creating a table
+CREATE [OR REPLACE] TABLE <table_name> (
+    <column_definitions>,
+    NGRAM INDEX <index_name> (<column1> [, <column2>, ...])
+        [gram_size = <number>] [bitmap_size = <number>]
+)...
+```
+
+- `gram_size` (defaults to 3) specifies the length of each character-based substring (n-gram) when the column text is indexed. For example, with `gram_size = 3`, the text "hello world" would be split into overlapping substrings like:
+
+  ```text
+  "hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"
+  ```
+
+- `bloom_size` specifies the size in bytes of the Bloom filter bitmap used to accelerate string matching within each block of data. It controls the trade-off between index accuracy and memory usage:
+
+  - A larger `bloom_size` reduces false positives in string lookups, improving query precision at the cost of more memory.
+  - A smaller `bloom_size` saves memory but may increase false positives.
+  - If not explicitly set, the default is 1,048,576 bytes (1m) per indexed column per block. The valid range is from 512 bytes to 10,485,760 bytes (10m).
+
+## Examples
+
+The following example creates a table `amazon_reviews_ngram` with an Ngram index on the `review_body` column. The index is configured with a `gram_size` of 10 and a `bitmap_size` of 2 MB to optimize fuzzy search performance on large text fields such as user reviews.
+
+```sql
+CREATE OR REPLACE TABLE amazon_reviews_ngram (
+    review_date   int(11) NULL,
+    marketplace   varchar(20) NULL,
+    customer_id   bigint(20) NULL,
+    review_id   varchar(40) NULL,
+    product_id   varchar(10) NULL,
+    product_parent   bigint(20) NULL,
+    product_title   varchar(500) NULL,
+    product_category   varchar(50) NULL,
+    star_rating   smallint(6) NULL,
+    helpful_votes   int(11) NULL,
+    total_votes   int(11) NULL,
+    vine   boolean NULL,
+    verified_purchase   boolean NULL,
+    review_headline   varchar(500) NULL,
+    review_body   string NULL,
+    NGRAM INDEX idx1 (review_body) gram_size = 10 bloom_size = 2097152
+) Engine = Fuse bloom_index_columns='review_body';
+```
+
+To show the created index, use the [SHOW INDEXES](../../50-administration-cmds/show-indexes.md) command:
+
+```sql
+SHOW INDEXES;
+```
+
+```sql
+┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
+│  name  │  type  │ original │                              definition                              │         created_on         │      updated_on     │
+├────────┼────────┼──────────┼──────────────────────────────────────────────────────────────────────┼────────────────────────────┼─────────────────────┤
+│ idx1   │ NGRAM  │          │ amazon_reviews_ngram(review_body)bloom_size='2097152' gram_size='10' │ 2025-05-13 01:22:34.123927 │ NULL                │
+└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
+```
+
+Alternatively, you can create the table first, then create the Ngram index on the `review_body` column:
+
+```sql
+CREATE TABLE amazon_reviews_ngram (
+    review_date   int(11) NULL,
+    marketplace   varchar(20) NULL,
+    customer_id   bigint(20) NULL,
+    review_id   varchar(40) NULL,
+    product_id   varchar(10) NULL,
+    product_parent   bigint(20) NULL,
+    product_title   varchar(500) NULL,
+    product_category   varchar(50) NULL,
+    star_rating   smallint(6) NULL,
+    helpful_votes   int(11) NULL,
+    total_votes   int(11) NULL,
+    vine   boolean NULL,
+    verified_purchase   boolean NULL,
+    review_headline   varchar(500) NULL,
+    review_body   string NULL
+);
+```
+
+```sql
+CREATE NGRAM INDEX idx1
+ON amazon_reviews_ngram(review_body)
+gram_size = 10 bloom_size = 2097152;
+```
diff --git a/docs/en/sql-reference/10-sql-commands/00-ddl/07-ngram-index/drop-ngram-index.md b/docs/en/sql-reference/10-sql-commands/00-ddl/07-ngram-index/drop-ngram-index.md
@@ -0,0 +1,29 @@
+---
+title: DROP NGRAM INDEX
+sidebar_position: 4
+---
+
+import FunctionDescription from '@site/src/components/FunctionDescription';
+
+<FunctionDescription description="Introduced or updated: v1.2.726"/>
+
+import EEFeature from '@site/src/components/EEFeature';
+
+<EEFeature featureName='NGRAM INDEX'/>
+
+Drops an existing NGRAM index from a table.
+
+## Syntax
+
+```sql
+DROP NGRAM INDEX [IF EXISTS] <index_name>
+ON [<database>.]<table_name>;
+```
+
+## Examples
+
+The following example drops the `idx1` index from the `amazon_reviews_ngram` table:
+
+```sql
+DROP NGRAM INDEX idx1 ON amazon_reviews_ngram;
+```
diff --git a/docs/en/sql-reference/10-sql-commands/00-ddl/07-ngram-index/index.md b/docs/en/sql-reference/10-sql-commands/00-ddl/07-ngram-index/index.md
@@ -0,0 +1,11 @@
+---
+title: NGRAM INDEX
+---
+import IndexOverviewList from '@site/src/components/IndexOverviewList';
+import EEFeature from '@site/src/components/EEFeature';
+
+<EEFeature featureName='NGRAM INDEX'/>
+
+This page provides reference information for the Ngram index-related commands in Databend.
+
+<IndexOverviewList />
diff --git a/docs/en/sql-reference/10-sql-commands/50-administration-cmds/show-indexes.md b/docs/en/sql-reference/10-sql-commands/50-administration-cmds/show-indexes.md
@@ -6,9 +6,9 @@ import FunctionDescription from '@site/src/components/FunctionDescription';
 
 <FunctionDescription description="Introduced or updated: v1.2.190"/>
 
-Shows the created aggregating indexes. Equivalent to `SELECT * FROM system.indexes`.
+Shows the created indexes. Equivalent to `SELECT * FROM system.indexes`.
 
-See also: [system.indexes](../../../00-sql-reference/31-system-tables/system-indexes.md)
+See also: [system.indexes](../../00-sql-reference/31-system-tables/system-indexes.md)
 
 ## Syntax
 

-Original file line number
+Diff line change
@@ @@ -0,0 +1,4 @@ @@
 +{
 +  "label": "Ngram Index",
 +  "position": 11
 +}