Skip to content

Commit fa88127

Browse files
🌐 Add LLM Translations (#878)
* 💬Generate LLM translations * docs: minor update Signed-off-by: Chojan Shang <[email protected]> --------- Signed-off-by: Chojan Shang <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Chojan Shang <[email protected]>
1 parent a1aa9f4 commit fa88127

File tree

1 file changed

+54
-48
lines changed
  • docs/cn/guides/40-load-data/04-transform

1 file changed

+54
-48
lines changed
Lines changed: 54 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,59 @@
11
---
2-
title: Efficient Data Transformation with Databend
2+
title: 查询与转换
33
slug: querying-stage
44
---
55

6-
Databend introduces a transformative approach to data processing with its ELT (Extract, Load, Transform) model. The important aspect of this model is to query data in staged files.
6+
Databend 引入了基于 ELT(提取、加载、转换)模型的数据处理变革性方法。该模型的关键在于对已暂存文件中的数据进行查询。
77

8-
You can query data in staged files using the `SELECT` statement. This feature is available for the following types of stages:
8+
您可以使用 `SELECT` 语句查询已暂存文件中的数据。此功能适用于以下类型的 Stage:
99

10-
- User stage, internal stage, or external stage.
11-
- Bucket or container created within your object storage, such as Amazon S3, Google Cloud Storage, and Microsoft Azure.
12-
- Remote servers accessible via HTTPS.
10+
- 用户 Stage、内部 Stage 或外部 Stage。
11+
- 在您的对象存储中创建的存储桶或容器,例如 Amazon S3Google Cloud Storage Microsoft Azure
12+
- 通过 HTTPS 可访问的远程服务器。
1313

14-
This feature can be particularly useful for inspecting or viewing the contents of staged files, whether it's before or after loading data.
14+
此功能对于检查或查看已暂存文件的内容特别有用,无论是在加载数据之前还是之后。
1515

16-
## Syntax and Parameters
16+
## 语法和参数
1717

1818
```sql
19-
SELECT [<alias>.]<column> [, <column> ...] | [<alias>.]$<col_position> [, $<col_position> ...]
20-
FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
21-
[(
19+
SELECT [<alias>.]<column> [, <column> ...] | [<alias>.]$<col_position> [, $<col_position> ...]
20+
FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
21+
[(
2222
[<connection_parameters>],
2323
[ PATTERN => '<regex_pattern>'],
24-
[ FILE_FORMAT => 'CSV | TSV | NDJSON | PARQUET | <custom_format_name>'],
24+
[ FILE_FORMAT => 'CSV | TSV | NDJSON | PARQUET | ORC | <custom_format_name>'],
2525
[ FILES => ( '<file_name>' [ , '<file_name>' ... ])]
2626
)]
2727
```
2828

2929
:::note
30-
When the stage path contains special characters such as spaces or parentheses, you can enclose the entire path in single quotes, as demonstrated in the following SQL statements:
30+
当 Stage 路径包含空格或括号等特殊字符时,您可以将整个路径用单引号括起来,如下面的 SQL 语句所示:
31+
3132
```sql
3233
SELECT * FROM 's3://mybucket/dataset(databend)/' ...
3334

3435
SELECT * FROM 's3://mybucket/dataset databend/' ...
3536
```
37+
3638
:::
3739

3840
### FILE_FORMAT
3941

40-
The FILE_FORMAT parameter allows you to specify the format of your file, which can be one of the following options: CSV, TSV, NDJSON, PARQUET, or a custom format that you've defined using the [CREATE FILE FORMAT](/sql/sql-commands/ddl/file-format/ddl-create-file-format) command. For example,
42+
FILE_FORMAT 参数允许您指定文件格式,可以是 CSVTSVNDJSONPARQUET 或使用 [CREATE FILE FORMAT](/sql/sql-commands/ddl/file-format/ddl-create-file-format) 命令定义的自定义格式。例如:
4143

4244
```sql
4345
CREATE FILE FORMAT my_custom_csv TYPE=CSV FIELD_DELIMITER='\t';
4446

4547
SELECT $1 FROM @my_stage/file (FILE_FORMAT=>'my_custom_csv');
4648
```
4749

48-
Please note that when you need to query or perform a COPY INTO operation from a staged file, it is necessary to explicitly specify the file format during the creation of the stage. Otherwise, the default format, Parquet, will be applied. See an example below:
50+
请注意,当您需要从已暂存文件查询或执行 COPY INTO 操作时,必须在创建 Stage 时显式指定文件格式。否则,将应用默认格式 Parquet。请参见以下示例:
4951

5052
```sql
5153
CREATE STAGE my_stage FILE_FORMAT = (TYPE = CSV);
5254
```
53-
In cases where you have staged a file in a format different from the specified stage format, you can explicitly specify the file format within the SELECT or COPY INTO statement. Here are examples:
55+
56+
在已暂存文件格式与指定 Stage 格式不同的情况下,您可以在 SELECT 或 COPY INTO 语句中显式指定文件格式。以下是示例:
5457

5558
```sql
5659
SELECT $1 FROM @my_stage (FILE_FORMAT=>'NDJSON');
@@ -60,88 +63,89 @@ COPY INTO my_table FROM (SELECT $1 SELECT @my_stage t) FILE_FORMAT = (TYPE = NDJ
6063

6164
### PATTERN
6265

63-
The PATTERN option allows you to specify a [PCRE2](https://www.pcre.org/current/doc/html/)-based regular expression pattern enclosed in single quotes to match file names. It is used to filter and select files based on the provided pattern. For example, you can use a pattern like '.*parquet' to match all file names ending with "parquet". For detailed information on the PCRE2 syntax, you can refer to the documentation available at http://www.pcre.org/current/doc/html/pcre2syntax.html.
66+
PATTERN 选项允许您指定一个基于 PCRE2 的正则表达式模式(用单引号括起来)来匹配文件名。它用于根据提供的模式过滤和选择文件。例如,您可以使用模式 '.\*parquet' 来匹配所有以 "parquet" 结尾的文件名。有关 PCRE2 语法的详细信息,请参阅 http://www.pcre.org/current/doc/html/pcre2syntax.html 上的文档。
6467

6568
### FILES
6669

67-
The FILES option, on the other hand, enables you to explicitly specify one or more file names separated by commas. This option allows you to directly filter and query data from specific files within a folder. For example, if you want to query data from the Parquet files "books-2023.parquet", "books-2022.parquet", and "books-2021.parquet", you can provide these file names within the FILES option.
70+
FILES 选项允许您显式指定一个或多个用逗号分隔的文件名。此选项允许您直接从文件夹中的特定文件过滤和查询数据。例如,如果您想从 Parquet 文件 "books-2023.parquet""books-2022.parquet" "books-2021.parquet" 查询数据,您可以在 FILES 选项中提供这些文件名。
6871

6972
### table_alias
7073

71-
When working with staged files in a SELECT statement where no table name is available, you can assign an alias to the files. This allows you to treat the files as a table, with its fields serving as columns within the table. This is useful when working with multiple tables within the SELECT statement or when selecting specific columns. Here's an example:
74+
SELECT 语句中处理已暂存文件时,如果没有可用表名,您可以为文件分配别名。这允许您将文件视为表,其字段作为表中的列。这在处理 SELECT 语句中的多个表或选择特定列时非常有用。以下是一个示例:
7275

7376
```sql
74-
-- The alias 't1' represents the staged file, while 't2' is a regular table
77+
-- 别名 't1' 代表已暂存文件,而 't2' 是常规表
7578
SELECT t1.$1, t2.$2 FROM @my_stage t1, t2;
7679
```
7780

7881
### $<col_position>
7982

80-
When selecting from a staged file, you can use column positions, and these positions start from 1. At present, the feature to utilize column positions for SELECT operations from staged files is limited to Parquet, NDJSON, CSV, and TSV formats.
83+
从已暂存文件选择时,您可以使用列位置,这些位置从 1 开始。目前,使用列位置从已暂存文件进行 SELECT 操作的功能仅限于 ParquetNDJSONCSV TSV 格式。
8184

8285
```sql
8386
SELECT $2 FROM @my_stage (FILES=>('sample.csv')) ORDER BY $1;
8487
```
8588

86-
It is important to note that when working with NDJSON, only $1 is allowed, representing the entire row and having the data type Variant. To select a specific field, use `$1:<field_name>`.
89+
请注意,在使用 NDJSON 时,只允许使用 $1,代表整行并具有 Variant 数据类型。要选择特定字段,请使用 `$1:<field_name>`
8790

8891
```sql
89-
-- Select the entire row using column position:
92+
-- 使用列位置选择整行:
9093
SELECT $1 FROM @my_stage (FILE_FORMAT=>'NDJSON')
9194

92-
--Select a specific field named "a" using column position:
95+
-- 使用列位置选择名为 "a" 的特定字段:
9396
SELECT $1:a FROM @my_stage (FILE_FORMAT=>'NDJSON')
9497
```
9598

96-
When using COPY INTO to copy data from a staged file, Databend matches the field names at the top level of the NDJSON file with the column names in the destination table, rather than relying on column positions. In the example below, the table *my_table* should have identical column definitions as the top-level field names in the NDJSON files:
99+
当使用 COPY INTO 从已暂存文件复制数据时,Databend 会匹配 NDJSON 文件顶层字段名称与目标表中的列名称,而不是依赖于列位置。在下面的示例中,表 _my_table_ 应具有与 NDJSON 文件顶层字段名称相同的列定义:
97100

98101
```sql
99102
COPY INTO my_table FROM (SELECT $1 SELECT @my_stage t) FILE_FORMAT = (type = NDJSON)
100103
```
101104

102105
### connection_parameters
103106

104-
To query data files in a bucket or container on your storage service, provide the necessary connection parameters. For the available connection parameters for each storage service, refer to [Connection Parameters](/sql/sql-reference/connect-parameters).
107+
要查询存储服务中存储桶或容器中的数据文件,请提供必要的连接参数。有关每个存储服务可用的连接参数,请参阅 [Connection Parameters](/sql/sql-reference/connect-parameters)
105108

106109
### uri
107110

108-
Specify the URI of remote files accessible via HTTPS.
111+
指定通过 HTTPS 可访问的远程文件的 URI。
109112

110-
## Limitations
113+
## 限制
111114

112-
When querying a staged file, the following limitations are applicable in terms of format-specific constraints:
115+
在查询已暂存文件时,以下格式特定约束的限制适用:
113116

114-
- Selecting all fields with the symbol * is only supported for Parquet files.
115-
- When selecting from a CSV or TSV file, all fields are parsed as strings, and the SELECT statement only allows the use of column positions. Additionally, there is a restriction on the number of fields in the file, which must not exceed max.N+1000. For example, if the statement is `SELECT $1, $2 FROM @my_stage (FILES=>('sample.csv'))`, the sample.csv file can have a maximum of 1,002 fields.
117+
- 使用星号 (\*) 选择所有字段仅支持 Parquet 文件。
118+
- CSV TSV 文件选择时,所有字段都作为字符串解析,SELECT 语句仅允许使用列位置。此外,文件中的字段数量有限制,不得超过 max.N+1000。例如,如果语句是 `SELECT $1, $2 FROM @my_stage (FILES=>('sample.csv'))`,则 sample.csv 文件最多可以有 1,002 个字段。
116119

117-
## Tutorials
120+
## 教程
118121

119-
### Tutorial 1: Querying Data from Stage
122+
### 教程 1:从 Stage 查询数据
120123

121124
import Tabs from '@theme/Tabs';
122125
import TabItem from '@theme/TabItem';
123126

124-
This example shows how to query data in a Parquet file stored in different locations. Click the tabs below to see details.
127+
本示例展示了如何查询存储在不同位置的 Parquet 文件中的数据。点击下面的选项卡查看详细信息。
125128

126129
<Tabs groupId="query2stage">
127-
<TabItem value="Stages" label="Stages">
130+
<TabItem value="Stages" label="Stage">
128131

129-
Let's assume you have a sample file named [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet) and you have uploaded it to your user stage, an internal stage named *my_internal_stage*, and an external stage named *my_external_stage*. To upload files to a stage, use the [PRESIGN](/sql/sql-commands/ddl/stage/presign) method.
132+
假设您有一个名为 [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet) 的示例文件,并且已将其上传到用户 Stage、名为 _my_internal_stage_ 的内部 Stage 和名为 _my_external_stage_ 的外部 Stage。要上传文件到 Stage,请使用 [PRESIGN](/sql/sql-commands/ddl/stage/presign) 方法。
130133

131134
```sql
132-
-- Query file in user stage
135+
-- 查询用户Stage中的文件
133136
SELECT * FROM @~/books.parquet;
134137

135-
-- Query file in internal stage
138+
-- 查询内部Stage中的文件
136139
SELECT * FROM @my_internal_stage/books.parquet;
137140

138-
-- Query file in external stage
141+
-- 查询外部Stage中的文件
139142
SELECT * FROM @my_external_stage/books.parquet;
140143
```
144+
141145
</TabItem>
142-
<TabItem value="Bucket" label="Bucket">
146+
<TabItem value="Bucket" label="存储桶">
143147

144-
Let's assume you have a sample file named [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet) stored in a bucket named *databend-toronto* on Amazon S3 in the region *us-east-2*. You can query the data by specifying the connection parameters:
148+
假设您有一个名为 [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet) 的示例文件,存储在 Amazon S3 区域 _us-east-2_ 中的名为 _databend-toronto_ 的存储桶中。您可以通过指定连接参数来查询数据:
145149

146150
```sql
147151
SELECT
@@ -157,20 +161,22 @@ FROM
157161
FILES => ('books.parquet')
158162
);
159163
```
164+
160165
</TabItem>
161-
<TabItem value="Remote" label="Remote">
166+
<TabItem value="Remote" label="远程文件">
162167

163-
Let's assume you have a sample file named [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet) stored in a remote server. You can query the data by specifying the file URI:
168+
假设您有一个名为 [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet) 的示例文件,存储在远程服务器上。您可以通过指定文件 URI 来查询数据:
164169

165170
```sql
166171
SELECT * FROM 'https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet';
167172
```
173+
168174
</TabItem>
169175
</Tabs>
170176

171-
### Tutorial 2: Querying Data with PATTERN
177+
### 教程 2:使用 PATTERN 查询数据
172178

173-
Let's assume you have the following Parquet files with the same schema, as well as some files of other formats, stored in a bucket named *databend-toronto* on Amazon S3 in the region *us-east-2*.
179+
假设您有以下具有相同模式的 Parquet 文件,以及一些其他格式的文件,存储在 Amazon S3 区域 _us-east-2_ 中的名为 _databend-toronto_ 的存储桶中。
174180

175181
```text
176182
databend-toronto/
@@ -181,7 +187,7 @@ databend-toronto/
181187
└── books-2019.parquet
182188
```
183189

184-
To query data from all Parquet files in the folder, you can use the `PATTERN` option:
190+
要从文件夹中的所有 Parquet 文件查询数据,您可以使用 `PATTERN` 选项:
185191

186192
```sql
187193
SELECT
@@ -198,7 +204,7 @@ FROM
198204
);
199205
```
200206

201-
To query data from the Parquet files "books-2023.parquet", "books-2022.parquet", and "books-2021.parquet" in the folder, you can use the FILES option:
207+
要从文件夹中的 Parquet 文件 "books-2023.parquet""books-2022.parquet" "books-2021.parquet" 查询数据,您可以使用 FILES 选项:
202208

203209
```sql
204210
SELECT
@@ -217,4 +223,4 @@ FROM
217223
'books-2021.parquet'
218224
)
219225
);
220-
```
226+
```

0 commit comments

Comments
 (0)