|
1 | 1 | ---
|
2 | 2 | title: 'COSINE_DISTANCE'
|
3 |
| -description: '在Databend中使用cosine_distance函数测量相似度' |
| 3 | +description: '在 Databend 中使用 cosine_distance 函数测量相似度' |
4 | 4 | ---
|
5 | 5 |
|
6 |
| -本文档概述了Databend中的cosine_distance函数,并演示如何使用该函数测量文档相似度。 |
| 6 | +计算两个向量之间的余弦距离,测量它们的相异程度。 |
7 | 7 |
|
8 |
| -:::info |
| 8 | +## 语法 |
| 9 | + |
| 10 | +```sql |
| 11 | +COSINE_DISTANCE(vector1, vector2) |
| 12 | +``` |
9 | 13 |
|
10 |
| -cosine_distance函数在Databend内部执行向量计算,不依赖于(Azure)OpenAI API。 |
| 14 | +## 参数 |
11 | 15 |
|
12 |
| -::: |
| 16 | +- `vector1`: 第一个向量 (ARRAY(FLOAT32 NOT NULL)) |
| 17 | +- `vector2`: 第二个向量 (ARRAY(FLOAT32 NOT NULL)) |
| 18 | + |
| 19 | +## 返回值 |
13 | 20 |
|
14 |
| -Databend中的cosine_distance函数是一个内置函数,用于计算两个向量之间的余弦距离。它通常用于自然语言处理任务,如文档相似度和推荐系统。 |
| 21 | +返回一个介于 0 和 1 之间的 FLOAT 值: |
| 22 | +- 0:相同向量(完全相似) |
| 23 | +- 1:正交向量(完全不相似) |
| 24 | + |
| 25 | +## 描述 |
| 26 | + |
| 27 | +余弦距离测量两个向量之间基于它们之间角度的相异度,而不管它们的大小。该函数: |
| 28 | + |
| 29 | +1. 验证两个输入向量是否具有相同的长度 |
| 30 | +2. 计算两个向量的元素乘积之和(点积) |
| 31 | +3. 计算每个向量的平方和的平方根(向量大小) |
| 32 | +4. 返回 `1 - (dot_product / (magnitude1 * magnitude2))` |
| 33 | + |
| 34 | +实现的数学公式为: |
| 35 | + |
| 36 | +``` |
| 37 | +cosine_distance(v1, v2) = 1 - (Σ(v1ᵢ * v2ᵢ) / (√Σ(v1ᵢ²) * √Σ(v2ᵢ²))) |
| 38 | +``` |
| 39 | + |
| 40 | +其中 v1ᵢ 和 v2ᵢ 是输入向量的元素。 |
| 41 | + |
| 42 | +:::info |
| 43 | +此函数在 Databend 中执行向量计算,不依赖于外部 API。 |
| 44 | +::: |
15 | 45 |
|
16 |
| -余弦距离是基于两个向量之间夹角的余弦值来衡量相似度的一种度量。该函数接受两个输入向量,并返回一个介于0和1之间的值,其中0表示完全相同的向量,1表示正交(完全不相似)的向量。 |
17 | 46 |
|
18 | 47 | ## 示例
|
19 | 48 |
|
20 |
| -**创建表并插入示例数据** |
| 49 | +创建一个包含向量数据的表: |
21 | 50 |
|
22 |
| -让我们创建一个表来存储一些示例文本文档及其对应的嵌入向量: |
23 | 51 | ```sql
|
24 |
| -CREATE TABLE articles ( |
| 52 | +CREATE OR REPLACE TABLE vectors ( |
25 | 53 | id INT,
|
26 |
| - title VARCHAR, |
27 |
| - content VARCHAR, |
28 |
| - embedding ARRAY(FLOAT32) |
| 54 | + vec ARRAY(FLOAT32 NOT NULL) |
29 | 55 | );
|
30 |
| -``` |
31 | 56 |
|
32 |
| -现在,让我们向表中插入一些示例文档: |
33 |
| -```sql |
34 |
| -INSERT INTO articles (id, title, content, embedding) |
35 |
| -VALUES |
36 |
| - (1, 'Python for Data Science', 'Python is a versatile programming language widely used in data science...', ai_embedding_vector('Python is a versatile programming language widely used in data science...')), |
37 |
| - (2, 'Introduction to R', 'R is a popular programming language for statistical computing and graphics...', ai_embedding_vector('R is a popular programming language for statistical computing and graphics...')), |
38 |
| - (3, 'Getting Started with SQL', 'Structured Query Language (SQL) is a domain-specific language used for managing relational databases...', ai_embedding_vector('Structured Query Language (SQL) is a domain-specific language used for managing relational databases...')); |
| 57 | +INSERT INTO vectors VALUES |
| 58 | + (1, [1.0000, 2.0000, 3.0000]), |
| 59 | + (2, [1.0000, 2.2000, 3.0000]), |
| 60 | + (3, [4.0000, 5.0000, 6.0000]); |
39 | 61 | ```
|
40 | 62 |
|
41 |
| -**查询相似文档** |
| 63 | +找到与 [1, 2, 3] 最相似的向量: |
42 | 64 |
|
43 |
| -现在,让我们使用cosine_distance函数找到与给定查询最相似的文档: |
44 | 65 | ```sql
|
45 |
| -SELECT |
46 |
| - id, |
47 |
| - title, |
48 |
| - content, |
49 |
| - cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity |
50 |
| -FROM |
51 |
| - articles |
52 |
| -ORDER BY |
53 |
| - similarity ASC |
54 |
| - LIMIT 3; |
| 66 | +SELECT |
| 67 | + vec, |
| 68 | + COSINE_DISTANCE(vec, [1.0000, 2.0000, 3.0000]) AS distance |
| 69 | +FROM |
| 70 | + vectors |
| 71 | +ORDER BY |
| 72 | + distance ASC |
| 73 | +LIMIT 1; |
55 | 74 | ```
|
56 | 75 |
|
57 |
| -结果: |
58 |
| -```sql |
59 |
| -+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+ |
60 |
| -| id | title | content | similarity | |
61 |
| -+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+ |
62 |
| -| 1 | Python for Data Science | Python is a versatile programming language widely used in data science... | 0.1142081 | |
63 |
| -| 2 | Introduction to R | R is a popular programming language for statistical computing and graphics... | 0.18741018 | |
64 |
| -| 3 | Getting Started with SQL | Structured Query Language (SQL) is a domain-specific language used for managing relational databases... | 0.25137568 | |
65 |
| -+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+ |
| 76 | +``` |
| 77 | ++-------------------------+----------+ |
| 78 | +| vec | distance | |
| 79 | ++-------------------------+----------+ |
| 80 | +| [1.0000,2.2000,3.0000] | 0.0 | |
| 81 | ++-------------------------+----------+ |
66 | 82 | ```
|
0 commit comments