Skip to content

Commit c7fd000

Browse files
authored
Databend Data Lakehouse and Databend AI&ML (#2207)
1 parent 856681e commit c7fd000

File tree

6 files changed

+157
-87
lines changed

6 files changed

+157
-87
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
{
2-
"label": "Accessing Data Lake"
2+
"label": "Data Lakehouse"
33
}
Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
---
2-
title: Accessing Data Lake
2+
title: Databend for Data Lakehouse
33
---
44

5-
Databend presents a seamless integration with three robust Data Lake technologies[Apache Hive](https://hive.apache.org/), [Apache Iceberg](https://iceberg.apache.org/), and [Delta Lake](https://delta.io/). This integration brings a distinct advantage by supporting multiple facets of Data Lake functionality. Databend offers a versatile and comprehensive platform, empowering users with increased flexibility and efficiency in handling diverse datasets within the Data Lake environment.
5+
Databend integrates with popular data lake technologies to provide a unified lakehouse architecture that combines data lake flexibility with data warehouse performance.
66

7-
Furthermore, the integration of these three technologies within Databend is characterized by varying approaches. While some, like Apache Hive, integrate at the catalog level, others, such as Delta Lake, operate at the table engine level. The catalog-based integration establishes a centralized connection to the Data Lake, streamlining access and management across multiple tables. On the other hand, table engine-level integration provides a more granular control, allowing for tailored optimization and fine-tuning at the individual table level.
7+
| Technology | Integration Type | Key Features | Documentation |
8+
|------------|-----------------|--------------|---------------|
9+
| Apache Hive | Catalog-level | Legacy data lake support, schema registry | [Apache Hive Catalog](01-hive.md) |
10+
| Apache Iceberg™ | Catalog-level | ACID transactions, schema evolution, time travel | [Apache Iceberg™ Catalog](02-iceberg.md) |
11+
| Delta Lake | Table engine-level | ACID transactions, data versioning, schema enforcement | [Delta Lake Table Engine](03-delta.md) |
812

9-
- [Apache Hive Catalog](01-hive.md)
10-
- [Apache Iceberg Catalog](02-iceberg.md)
11-
- [Delta Lake Table Engine](03-delta.md)
13+
These integrations enable Databend users to efficiently query, analyze, and manage diverse datasets across both data lake and data warehouse environments without data duplication.
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
title: External Functions for Custom AI/ML
3+
---
4+
5+
# External Functions for Custom AI/ML
6+
7+
For advanced AI/ML scenarios, Databend supports external functions that connect your data with custom AI/ML infrastructure written in languages like Python.
8+
9+
| Feature | Description | Benefits |
10+
|---------|-------------|----------|
11+
| **Model Flexibility** | Use open-source models or your internal AI/ML infrastructure | • Freedom to choose any model<br/>• Leverage existing ML investments<br/>• Stay up-to-date with latest AI advancements |
12+
| **GPU Acceleration** | Deploy external function servers on GPU-equipped machines | • Faster inference for deep learning models<br/>• Handle larger batch sizes<br/>• Support compute-intensive workloads |
13+
| **Custom ML Models** | Deploy and use your own machine learning models | • Proprietary algorithms<br/>• Domain-specific models<br/>• Fine-tuned for your data |
14+
| **Advanced AI Pipelines** | Build complex AI workflows with specialized libraries | • Multi-step processing<br/>• Custom transformations<br/>• Integration with ML frameworks |
15+
| **Scalability** | Handle resource-intensive AI operations outside Databend | • Independent scaling<br/>• Optimized resource allocation<br/>• High-throughput processing |
16+
17+
## Implementation Overview
18+
19+
1. Create an external server with your AI/ML code (Python with [databend-udf](https://pypi.org/project/databend-udf))
20+
2. Register the server with Databend using `CREATE FUNCTION`
21+
3. Call your AI/ML functions directly in SQL queries
22+
23+
## Example: Custom AI Model Integration
24+
25+
```python
26+
# Simple embedding UDF server demo
27+
from databend_udf import udf, UDFServer
28+
from sentence_transformers import SentenceTransformer
29+
30+
# Load pre-trained model
31+
model = SentenceTransformer('all-mpnet-base-v2') # 768-dimensional vectors
32+
33+
@udf(
34+
input_types=["STRING"],
35+
result_type="ARRAY(FLOAT)",
36+
)
37+
def ai_embed_768(inputs: list[str], headers) -> list[list[float]]:
38+
"""Generate 768-dimensional embeddings for input texts"""
39+
try:
40+
# Process inputs in a single batch
41+
embeddings = model.encode(inputs)
42+
# Convert to list format
43+
return [embedding.tolist() for embedding in embeddings]
44+
except Exception as e:
45+
print(f"Error generating embeddings: {e}")
46+
# Return empty lists in case of error
47+
return [[] for _ in inputs]
48+
49+
if __name__ == '__main__':
50+
print("Starting embedding UDF server on port 8815...")
51+
server = UDFServer("0.0.0.0:8815")
52+
server.add_function(ai_embed_768)
53+
server.serve()
54+
```
55+
56+
```sql
57+
-- Register the external function in Databend
58+
CREATE OR REPLACE FUNCTION ai_embed_768 (STRING)
59+
RETURNS ARRAY(FLOAT)
60+
LANGUAGE PYTHON
61+
HANDLER = 'ai_embed_768'
62+
ADDRESS = 'https://your-ml-server.example.com';
63+
64+
-- Use the custom embedding in queries
65+
SELECT
66+
id,
67+
title,
68+
cosine_distance(
69+
ai_embed_768(content),
70+
ai_embed_768('machine learning techniques')
71+
) AS similarity
72+
FROM articles
73+
ORDER BY similarity ASC
74+
LIMIT 5;
75+
```
76+
77+
For detailed instructions on setting up external functions, see [External Functions](/guides/query/external-function).
78+
79+
## Getting Started
80+
81+
Try these AI capabilities on [Databend Cloud](https://databend.com) with a free trial.
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
title: Built-in AI Functions
3+
---
4+
5+
# Built-in AI Functions
6+
7+
Databend provides built-in AI functions powered by Azure OpenAI Service for seamless integration of AI capabilities into your SQL workflows.
8+
9+
:::warning
10+
**Data Privacy Notice**: When using built-in AI functions, your data is sent to Azure OpenAI Service. By using these functions, you acknowledge this data transfer and agree to the [Azure OpenAI Data Privacy](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/data-privacy) terms.
11+
:::
12+
13+
| Function | Description | Use Cases |
14+
|----------|-------------|-----------|
15+
| [ai_text_completion](/sql/sql-functions/ai-functions/ai-text-completion) | Generates text based on prompts | • Content generation<br/>• Question answering<br/>• Summarization |
16+
| [ai_embedding_vector](/sql/sql-functions/ai-functions/ai-embedding-vector) | Converts text to vector representations | • Semantic search<br/>• Document similarity<br/>• Content recommendation |
17+
| [cosine_distance](/sql/sql-functions/vector-distance-functions/vector-cosine-distance) | Calculates similarity between vectors | • Finding similar documents<br/>• Ranking search results |
18+
19+
## Vector Storage in Databend
20+
21+
Databend stores embedding vectors using the `ARRAY(FLOAT NOT NULL)` data type, enabling direct similarity calculations with the `cosine_distance` function in SQL.
22+
23+
## Example: Semantic Search with Embeddings
24+
25+
```sql
26+
-- Create a table for documents with embeddings
27+
CREATE TABLE articles (
28+
id INT,
29+
title VARCHAR,
30+
content VARCHAR,
31+
embedding ARRAY(FLOAT NOT NULL)
32+
);
33+
34+
-- Store documents with their vector embeddings
35+
INSERT INTO articles (id, title, content, embedding)
36+
VALUES
37+
(1, 'Python for Data Science', 'Python is a versatile programming language...',
38+
ai_embedding_vector('Python is a versatile programming language...')),
39+
(2, 'Introduction to R', 'R is a popular programming language for statistics...',
40+
ai_embedding_vector('R is a popular programming language for statistics...'));
41+
42+
-- Find semantically similar documents
43+
SELECT
44+
id, title,
45+
cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity
46+
FROM articles
47+
ORDER BY similarity ASC
48+
LIMIT 3;
49+
```
50+
51+
## Example: Text Generation
52+
53+
```sql
54+
-- Generate text based on a prompt
55+
SELECT ai_text_completion('Explain the benefits of cloud data warehouses in three points:') AS completion;
56+
```
57+
58+
## Getting Started
59+
60+
Try these AI capabilities on [Databend Cloud](https://databend.com) with a free trial.
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
{
2-
"label": "AI Capabilities"
2+
"label": "Databend AI and ML"
33
}
Lines changed: 6 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,8 @@
1-
# Databend AI Capabilities
1+
# Databend AI and ML
22

3-
This guide introduces Databend's built-in AI functions that enable natural language processing tasks through SQL queries, including text understanding, generation, and more.
3+
Databend offers two approaches for AI and ML integration:
44

5-
:::warning
6-
Data Privacy and Security
7-
8-
Databend uses Azure OpenAI Service for embeddings and text completions. Your data will be sent to Azure OpenAI when using these functions. These features are available by default on Databend Cloud.
9-
10-
**By using these functions, you acknowledge that your data will be sent to Azure OpenAI Service** and agree to the [Azure OpenAI Data Privacy](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/data-privacy) terms.
11-
:::
12-
13-
## Key AI Functions
14-
15-
| Function | Description | When to Use |
16-
|----------|-------------|------------|
17-
| [ai_text_completion](/sql/sql-functions/ai-functions/ai-text-completion) | Generates text based on a prompt | • Content generation<br/>• Question answering<br/>• Summarization<br/>• Text expansion |
18-
| [ai_embedding_vector](/sql/sql-functions/ai-functions/ai-embedding-vector) | Converts text into vector representations | • Semantic search<br/>• Document similarity<br/>• Content recommendation<br/>• Text classification |
19-
| [cosine_distance](/sql/sql-functions/vector-distance-functions/vector-cosine-distance) | Calculates similarity between vectors | • Finding similar documents<br/>• Ranking search results<br/>• Measuring text similarity |
20-
21-
22-
23-
## What are Embeddings?
24-
25-
Embeddings are vector representations of text that capture semantic meaning. Similar texts have closer vectors in the embedding space, enabling comparison and analysis for tasks like document similarity and clustering.
26-
27-
## Vector Storage in Databend
28-
29-
Databend can store embedding vectors using the `ARRAY(FLOAT NOT NULL)` data type and perform similarity calculations with the cosine_distance function directly in SQL.
30-
31-
## Example: Document Similarity Search
32-
33-
```sql
34-
-- Create a table for documents
35-
CREATE TABLE articles (
36-
id INT,
37-
title VARCHAR,
38-
content VARCHAR,
39-
embedding ARRAY(FLOAT NOT NULL)
40-
);
41-
42-
-- Insert documents with embeddings
43-
INSERT INTO articles (id, title, content, embedding)
44-
VALUES
45-
(1, 'Python for Data Science', 'Python is a versatile programming language...',
46-
ai_embedding_vector('Python is a versatile programming language...')),
47-
(2, 'Introduction to R', 'R is a popular programming language for statistics...',
48-
ai_embedding_vector('R is a popular programming language for statistics...'));
49-
50-
-- Find similar documents to a query
51-
SELECT
52-
id, title, content,
53-
cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity
54-
FROM articles
55-
ORDER BY similarity ASC
56-
LIMIT 3;
57-
```
58-
59-
## Example: Text Completion
60-
61-
```sql
62-
-- Generate a completion for a prompt
63-
SELECT ai_text_completion('Explain the benefits of cloud data warehouses in three points:') AS completion;
64-
65-
-- Result might be:
66-
-- 1. Scalability: Cloud data warehouses can easily scale up or down based on demand,
67-
-- eliminating the need for upfront capacity planning.
68-
-- 2. Cost-efficiency: Pay-as-you-go pricing models reduce capital expenditure and
69-
-- allow businesses to pay only for the resources they use.
70-
-- 3. Accessibility: Cloud data warehouses enable teams to access data from anywhere,
71-
-- facilitating remote work and global collaboration.
72-
```
73-
74-
## Building an AI Q&A System
75-
76-
You can create a simple Q&A system with Databend by:
77-
1. Storing documents with embeddings
78-
2. Finding relevant documents for a question
79-
3. Using text completion to generate answers
80-
81-
Try these AI capabilities on [Databend Cloud](https://databend.com) with a free trial.
5+
| Approach | Features | Use Cases |
6+
|----------|----------|-----------|
7+
| **[External Functions](01-external-functions.md)***Recommended* | • Custom models<br/>• GPU deployment<br/>• Custom pipelines<br/>• Data privacy | • Specialized domains<br/>• High performance<br/>• Privacy requirements |
8+
| **[Built-in Functions](02-built-in-functions.md)** | • Text completion<br/>• Embeddings<br/>• Vector operations<br/>• Zero setup | • Quick prototyping<br/>• General NLP<br/>• Simple implementation |

0 commit comments

Comments
 (0)