You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: Efficient Data Transformation with Databend
2
+
title: 查询与转换
3
3
slug: querying-stage
4
4
---
5
5
6
-
Databend introduces a transformative approach to data processing with its ELT (Extract, Load, Transform) model. The important aspect of this model is to query data in staged files.
When the stage path contains special characters such as spaces or parentheses, you can enclose the entire path in single quotes, as demonstrated in the following SQL statements:
The FILE_FORMAT parameter allows you to specify the format of your file, which can be one of the following options: CSV, TSV, NDJSON, PARQUET, or a custom format that you've defined using the [CREATE FILE FORMAT](/sql/sql-commands/ddl/file-format/ddl-create-file-format)command. For example,
Please note that when you need to query or perform a COPY INTO operation from a staged file, it is necessary to explicitly specify the file format during the creation of the stage. Otherwise, the default format, Parquet, will be applied. See an example below:
50
+
请注意,当您需要从已暂存文件查询或执行 COPY INTO 操作时,必须在创建 Stage 时显式指定文件格式。否则,将应用默认格式 Parquet。请参见以下示例:
49
51
50
52
```sql
51
53
CREATE STAGE my_stage FILE_FORMAT = (TYPE = CSV);
52
54
```
53
-
In cases where you have staged a file in a format different from the specified stage format, you can explicitly specify the file format within the SELECT or COPY INTO statement. Here are examples:
55
+
56
+
在已暂存文件格式与指定 Stage 格式不同的情况下,您可以在 SELECT 或 COPY INTO 语句中显式指定文件格式。以下是示例:
54
57
55
58
```sql
56
59
SELECT $1FROM @my_stage (FILE_FORMAT=>'NDJSON');
@@ -60,88 +63,89 @@ COPY INTO my_table FROM (SELECT $1 SELECT @my_stage t) FILE_FORMAT = (TYPE = NDJ
60
63
61
64
### PATTERN
62
65
63
-
The PATTERN option allows you to specify a [PCRE2](https://www.pcre.org/current/doc/html/)-based regular expression pattern enclosed in single quotes to match file names. It is used to filter and select files based on the provided pattern. For example, you can use a pattern like '.*parquet' to match all file names ending with "parquet". For detailed information on the PCRE2 syntax, you can refer to the documentation available at http://www.pcre.org/current/doc/html/pcre2syntax.html.
The FILES option, on the other hand, enables you to explicitly specify one or more file names separated by commas. This option allows you to directly filter and query data from specific files within a folder. For example, if you want to query data from the Parquet files "books-2023.parquet", "books-2022.parquet", and "books-2021.parquet", you can provide these file names within the FILES option.
When working with staged files in a SELECT statement where no table name is available, you can assign an alias to the files. This allows you to treat the files as a table, with its fields serving as columns within the table. This is useful when working with multiple tables within the SELECT statement or when selecting specific columns. Here's an example:
--The alias 't1' represents the staged file, while 't2' is a regular table
77
+
--别名 't1' 代表已暂存文件,而 't2' 是常规表
75
78
SELECT t1.$1, t2.$2FROM @my_stage t1, t2;
76
79
```
77
80
78
81
### $<col_position>
79
82
80
-
When selecting from a staged file, you can use column positions, and these positions start from 1. At present, the feature to utilize column positions for SELECT operations from staged files is limited to Parquet, NDJSON, CSV, and TSV formats.
SELECT $2FROM @my_stage (FILES=>('sample.csv')) ORDER BY $1;
84
87
```
85
88
86
-
It is important to note that when working with NDJSON, only $1 is allowed, representing the entire row and having the data type Variant. To select a specific field, use `$1:<field_name>`.
--Select a specific field named "a" using column position:
95
+
-- 使用列位置选择名为 "a" 的特定字段:
93
96
SELECT $1:a FROM @my_stage (FILE_FORMAT=>'NDJSON')
94
97
```
95
98
96
-
When using COPY INTO to copy data from a staged file, Databend matches the field names at the top level of the NDJSON file with the column names in the destination table, rather than relying on column positions. In the example below, the table *my_table* should have identical column definitions as the top-level field names in the NDJSON files:
COPY INTO my_table FROM (SELECT $1SELECT @my_stage t) FILE_FORMAT = (type = NDJSON)
100
103
```
101
104
102
105
### connection_parameters
103
106
104
-
To query data files in a bucket or container on your storage service, provide the necessary connection parameters. For the available connection parameters for each storage service, refer to [Connection Parameters](/sql/sql-reference/connect-parameters).
Specify the URI of remote files accessible via HTTPS.
111
+
指定通过 HTTPS 可访问的远程文件的 URI。
109
112
110
-
## Limitations
113
+
## 限制
111
114
112
-
When querying a staged file, the following limitations are applicable in terms of format-specific constraints:
115
+
在查询已暂存文件时,以下格式特定约束的限制适用:
113
116
114
-
-Selecting all fields with the symbol * is only supported for Parquet files.
115
-
-When selecting from a CSV or TSV file, all fields are parsed as strings, and the SELECT statement only allows the use of column positions. Additionally, there is a restriction on the number of fields in the file, which must not exceed max.N+1000. For example, if the statement is `SELECT $1, $2 FROM @my_stage (FILES=>('sample.csv'))`, the sample.csv file can have a maximum of 1,002 fields.
This example shows how to query data in a Parquet file stored in different locations. Click the tabs below to see details.
127
+
本示例展示了如何查询存储在不同位置的 Parquet 文件中的数据。点击下面的选项卡查看详细信息。
125
128
126
129
<TabsgroupId="query2stage">
127
-
<TabItemvalue="Stages"label="Stages">
130
+
<TabItemvalue="Stages"label="Stage">
128
131
129
-
Let's assume you have a sample file named [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet)and you have uploaded it to your user stage, an internal stage named *my_internal_stage*, and an external stage named *my_external_stage*. To upload files to a stage, use the [PRESIGN](/sql/sql-commands/ddl/stage/presign)method.
Let's assume you have a sample file named [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet)stored in a bucket named *databend-toronto* on Amazon S3 in the region *us-east-2*. You can query the data by specifying the connection parameters:
Let's assume you have a sample file named [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet)stored in a remote server. You can query the data by specifying the file URI:
168
+
假设您有一个名为 [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet)的示例文件,存储在远程服务器上。您可以通过指定文件 URI 来查询数据:
Let's assume you have the following Parquet files with the same schema, as well as some files of other formats, stored in a bucket named *databend-toronto* on Amazon S3 in the region *us-east-2*.
0 commit comments