Skip to content

Commit 257a60d

Browse files
💬Generate LLM translations (#2175)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent c5e9f5f commit 257a60d

File tree

1 file changed

+58
-42
lines changed

1 file changed

+58
-42
lines changed
Lines changed: 58 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,66 +1,82 @@
11
---
22
title: 'COSINE_DISTANCE'
3-
description: '在Databend中使用cosine_distance函数测量相似度'
3+
description: '在 Databend 中使用 cosine_distance 函数测量相似度'
44
---
55

6-
本文档概述了Databend中的cosine_distance函数,并演示如何使用该函数测量文档相似度
6+
计算两个向量之间的余弦距离,测量它们的相异程度
77

8-
:::info
8+
## 语法
9+
10+
```sql
11+
COSINE_DISTANCE(vector1, vector2)
12+
```
913

10-
cosine_distance函数在Databend内部执行向量计算,不依赖于(Azure)OpenAI API。
14+
## 参数
1115

12-
:::
16+
- `vector1`: 第一个向量 (ARRAY(FLOAT32 NOT NULL))
17+
- `vector2`: 第二个向量 (ARRAY(FLOAT32 NOT NULL))
18+
19+
## 返回值
1320

14-
Databend中的cosine_distance函数是一个内置函数,用于计算两个向量之间的余弦距离。它通常用于自然语言处理任务,如文档相似度和推荐系统。
21+
返回一个介于 0 和 1 之间的 FLOAT 值:
22+
- 0:相同向量(完全相似)
23+
- 1:正交向量(完全不相似)
24+
25+
## 描述
26+
27+
余弦距离测量两个向量之间基于它们之间角度的相异度,而不管它们的大小。该函数:
28+
29+
1. 验证两个输入向量是否具有相同的长度
30+
2. 计算两个向量的元素乘积之和(点积)
31+
3. 计算每个向量的平方和的平方根(向量大小)
32+
4. 返回 `1 - (dot_product / (magnitude1 * magnitude2))`
33+
34+
实现的数学公式为:
35+
36+
```
37+
cosine_distance(v1, v2) = 1 - (Σ(v1ᵢ * v2ᵢ) / (√Σ(v1ᵢ²) * √Σ(v2ᵢ²)))
38+
```
39+
40+
其中 v1ᵢ 和 v2ᵢ 是输入向量的元素。
41+
42+
:::info
43+
此函数在 Databend 中执行向量计算,不依赖于外部 API。
44+
:::
1545

16-
余弦距离是基于两个向量之间夹角的余弦值来衡量相似度的一种度量。该函数接受两个输入向量,并返回一个介于0和1之间的值,其中0表示完全相同的向量,1表示正交(完全不相似)的向量。
1746

1847
## 示例
1948

20-
**创建表并插入示例数据**
49+
创建一个包含向量数据的表:
2150

22-
让我们创建一个表来存储一些示例文本文档及其对应的嵌入向量:
2351
```sql
24-
CREATE TABLE articles (
52+
CREATE OR REPLACE TABLE vectors (
2553
id INT,
26-
title VARCHAR,
27-
content VARCHAR,
28-
embedding ARRAY(FLOAT32)
54+
vec ARRAY(FLOAT32 NOT NULL)
2955
);
30-
```
3156

32-
现在,让我们向表中插入一些示例文档:
33-
```sql
34-
INSERT INTO articles (id, title, content, embedding)
35-
VALUES
36-
(1, 'Python for Data Science', 'Python is a versatile programming language widely used in data science...', ai_embedding_vector('Python is a versatile programming language widely used in data science...')),
37-
(2, 'Introduction to R', 'R is a popular programming language for statistical computing and graphics...', ai_embedding_vector('R is a popular programming language for statistical computing and graphics...')),
38-
(3, 'Getting Started with SQL', 'Structured Query Language (SQL) is a domain-specific language used for managing relational databases...', ai_embedding_vector('Structured Query Language (SQL) is a domain-specific language used for managing relational databases...'));
57+
INSERT INTO vectors VALUES
58+
(1, [1.0000, 2.0000, 3.0000]),
59+
(2, [1.0000, 2.2000, 3.0000]),
60+
(3, [4.0000, 5.0000, 6.0000]);
3961
```
4062

41-
**查询相似文档**
63+
找到与 [1, 2, 3] 最相似的向量:
4264

43-
现在,让我们使用cosine_distance函数找到与给定查询最相似的文档:
4465
```sql
45-
SELECT
46-
id,
47-
title,
48-
content,
49-
cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity
50-
FROM
51-
articles
52-
ORDER BY
53-
similarity ASC
54-
LIMIT 3;
66+
SELECT
67+
vec,
68+
COSINE_DISTANCE(vec, [1.0000, 2.0000, 3.0000]) AS distance
69+
FROM
70+
vectors
71+
ORDER BY
72+
distance ASC
73+
LIMIT 1;
5574
```
5675

57-
结果:
58-
```sql
59-
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
60-
| id | title | content | similarity |
61-
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
62-
| 1 | Python for Data Science | Python is a versatile programming language widely used in data science... | 0.1142081 |
63-
| 2 | Introduction to R | R is a popular programming language for statistical computing and graphics... | 0.18741018 |
64-
| 3 | Getting Started with SQL | Structured Query Language (SQL) is a domain-specific language used for managing relational databases... | 0.25137568 |
65-
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
76+
```
77+
+-------------------------+----------+
78+
| vec | distance |
79+
+-------------------------+----------+
80+
| [1.0000,2.2000,3.0000] | 0.0 |
81+
+-------------------------+----------+
6682
```

0 commit comments

Comments
 (0)