Skip to content

Commit ecaeeba

Browse files
authored
loading: refine the file format (#2213)
* loading: refine the file format * remove parquet & orc postion visit
1 parent 7e02654 commit ecaeeba

File tree

15 files changed

+116
-85
lines changed

15 files changed

+116
-85
lines changed

docs/en/guides/40-load-data/01-load/index.md

Lines changed: 18 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -2,50 +2,27 @@
22
title: Loading from Files
33
---
44

5-
import DetailsWrap from '@site/src/components/DetailsWrap';
5+
Databend offers simple, powerful commands to load data files into tables. Most operations require just a single command. Your data must be in a [supported format](/sql/sql-reference/file-format-options).
66

7-
Databend provides a variety of tools and commands that can help you load your data files into a table. Most of them are straightforward, meaning you can load your data with just a single command. Please note that your data files must be in one of the formats supported by Databend. See [Input & Output File Formats](/sql/sql-reference/file-format-options) for a list of supported file formats. The following is an overview of the data loading and unloading flows and their respective methods. Please refer to the topics in this chapter for detailed instructions.
7+
![Data Loading and Unloading Overview](/img/load/load-unload.jpeg)
88

9-
![Alt text](/img/load/load-unload.jpeg)
9+
## Supported File Formats
1010

11-
This topic does not cover all of the available data loading methods, but it provides recommendations based on the location where your data files are stored. To find the recommended method and a link to the corresponding details page, toggle the block below:
11+
| Format | Type | Description |
12+
|--------|------|-------------|
13+
| [**CSV**](/guides/load-data/load-semistructured/load-csv), [**TSV**](/guides/load-data/load-semistructured/load-tsv) | Delimited | Text files with customizable delimiters |
14+
| [**NDJSON**](/guides/load-data/load-semistructured/load-ndjson) | Semi-structured | JSON objects, one per line |
15+
| [**Parquet**](/guides/load-data/load-semistructured/load-parquet) | Semi-structured | Efficient columnar storage format |
16+
| [**ORC**](/guides/load-data/load-semistructured/load-orc) | Semi-structured | High-performance columnar format |
17+
| [**Avro**](/guides/load-data/load-semistructured/load-avro) | Semi-structured | Compact binary format with schema |
1218

13-
<DetailsWrap>
19+
## Loading by File Location
1420

15-
<details>
16-
<summary>I want to load staged data files ...</summary>
17-
<div>
18-
<div>If you have data files in an internal/external stage or the user stage, Databend recommends that you load them using the COPY INTO command. The COPY INTO command is a powerful tool that can load large amounts of data quickly and efficiently.</div>
19-
<br/>
20-
<div>To learn more about using the COPY INTO command to load data from a stage, check out the <a href="stage">Loading from Stage</a> page. This page includes detailed tutorials that show you how to use the command to load data from a sample file in an internal/external stage or the user stage.</div>
21-
</div>
22-
</details>
21+
Select the location of your files to find the recommended loading method:
2322

24-
<details>
25-
<summary>I want to load data files in a bucket ...</summary>
26-
<div>
27-
<div>If you have data files in a bucket or container on your object storage, such as Amazon S3, Google Cloud Storage, and Microsoft Azure, Databend recommends that you load them using the COPY INTO command. The COPY INTO command is a powerful tool that can load large amounts of data quickly and efficiently.</div>
28-
<br/>
29-
<div>To learn more about using the COPY INTO command to load data from a bucket or container, check out the <a href="s3">Loading from Bucket</a> page. This page includes a tutorial that shows you how to use the command to load data from a sample file in an Amazon S3 Bucket.</div>
30-
</div>
31-
</details>
32-
33-
<details>
34-
<summary>I want to load local data files ...</summary>
35-
<div>
36-
<div>If you have data files in your local system, Databend recommends that you load them using <a href="https://github.com/databendlabs/BendSQL">BendSQL</a>, the Databend native CLI tool, allowing you to establish a connection with Databend and execute queries directly from a CLI window.</div>
37-
<br/>
38-
<div>To learn more about using BendSQL to load your local data files, check out the <a href="local">Loading from Local File</a> page. This page includes tutorials that show you how to use the tool to load data from a local sample file.</div>
39-
</div>
40-
</details>
41-
42-
<details>
43-
<summary>I want to load remote data files ...</summary>
44-
<div>
45-
<div>If you have remote data files, Databend recommends that you load them using the COPY INTO command. The COPY INTO command is a powerful tool that can load large amounts of data quickly and efficiently.</div>
46-
<br/>
47-
<div>To learn more about using the COPY INTO command to load remote data files, check out the <a href="http">Loading from Remote File</a> page. This page includes a tutorial that shows you how to use the command to load data from a remote sample file.</div>
48-
</div>
49-
</details>
50-
51-
</DetailsWrap>
23+
| Data Source | Recommended Tool | Description | Documentation |
24+
|-------------|-----------------|-------------|---------------|
25+
| **Staged Data Files** | **COPY INTO** | Fast, efficient loading from internal/external stages or user stage | [Loading from Stage](stage) |
26+
| **Cloud Storage** | **COPY INTO** | Load from Amazon S3, Google Cloud Storage, Microsoft Azure | [Loading from Bucket](s3) |
27+
| **Local Files** | [**BendSQL**](https://github.com/databendlabs/BendSQL) | Databend's native CLI tool for local file loading | [Loading from Local File](local) |
28+
| **Remote Files** | **COPY INTO** | Load data from remote HTTP/HTTPS locations | [Loading from Remote File](http) |

docs/en/guides/40-load-data/03-load-semistructured/00-load-parquet.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ COPY INTO [<database>.]<table_name>
2020
FILE_FORMAT = (TYPE = PARQUET)
2121
```
2222

23-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
23+
- For more Parquet file format options, refer to [Parquet File Format Options](/sql/sql-reference/file-format-options#parquet-options).
24+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
2425

2526
## Tutorial: Loading Data from Parquet Files
2627

docs/en/guides/40-load-data/03-load-semistructured/01-load-csv.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ FROM { userStage | internalStage | externalStage | externalLocation }
3131
) ]
3232
```
3333

34-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
34+
- For more CSV file format options, refer to [CSV File Format Options](/sql/sql-reference/file-format-options#csv-options).
35+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
3536

3637
## Tutorial: Loading Data from CSV Files
3738

docs/en/guides/40-load-data/03-load-semistructured/02-load-tsv.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ FROM { userStage | internalStage | externalStage | externalLocation }
2828
) ]
2929
```
3030

31-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
31+
- For more TSV file format options, refer to [TSV File Format Options](/sql/sql-reference/file-format-options#tsv-options).
32+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
3233

3334
## Tutorial: Loading Data from TSV Files
3435

docs/en/guides/40-load-data/03-load-semistructured/03-load-ndjson.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ FROM { userStage | internalStage | externalStage | externalLocation }
2828
) ]
2929
```
3030

31-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
31+
- For more NDJSON file format options, refer to [NDJSON File Format Options](/sql/sql-reference/file-format-options#ndjson-options).
32+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
3233

3334
## Tutorial: Loading Data from NDJSON Files
3435

docs/en/guides/40-load-data/03-load-semistructured/04-load-orc.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ COPY INTO [<database>.]<table_name>
1818
FILE_FORMAT = (TYPE = ORC)
1919
```
2020

21-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
21+
- For more ORC file format options, refer to [ORC File Format Options](/sql/sql-reference/file-format-options#orc-options).
22+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
2223

2324
## Tutorial: Loading Data from ORC Files
2425

docs/en/guides/40-load-data/03-load-semistructured/05-load-avro.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ COPY INTO [<database>.]<table_name>
1818
FILE_FORMAT = (TYPE = AVRO)
1919
```
2020

21-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
21+
- For more Avro file format options, refer to [Avro File Format Options](/sql/sql-reference/file-format-options#avro-options).
22+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
2223

2324
## Tutorial: Loading Avro Data into Databend from Remote HTTP URL
2425

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,18 @@
11
---
22
title: Loading Semi-structured Formats
33
---
4-
import IndexOverviewList from '@site/src/components/IndexOverviewList';
54

65
## What is Semi-structured Data?
76

8-
Semi-structured data is a form of data that does not conform to a rigid structure like traditional databases but still contains tags or markers to separate semantic elements and enforce hierarchies of records and fields.
7+
Semi-structured data contains tags or markers to separate semantic elements while not conforming to rigid database structures. Databend efficiently loads these formats using the `COPY INTO` command, with optional on-the-fly data transformation.
98

10-
Databend facilitates the efficient and user-friendly loading of semi-structured data. It supports various formats such as **Parquet**, **CSV**, **TSV**, and **NDJSON**.
9+
## Supported File Formats
1110

12-
Additionally, Databend allows for on-the-fly transformation of data during the loading process.
13-
Copy from semi-structured data format is the most common way to load data into Databend, it is very efficient and easy to use.
14-
15-
16-
## Supported Formats
17-
18-
Databend supports several semi-structured data formats loaded using the `COPY INTO` command:
19-
20-
<IndexOverviewList />
11+
| File Format | Description | Guide |
12+
| ----------- | ----------- | ----- |
13+
| **Parquet** | Efficient columnar storage format | [Loading Parquet](load-parquet) |
14+
| **CSV** | Comma-separated values | [Loading CSV](load-csv) |
15+
| **TSV** | Tab-separated values | [Loading TSV](load-tsv) |
16+
| **NDJSON** | Newline-delimited JSON | [Loading NDJSON](load-ndjson) |
17+
| **ORC** | Optimized Row Columnar format | [Loading ORC](load-orc) |
18+
| **Avro** | Row-based format with schema definition | [Loading Avro](load-avro) |

docs/en/guides/40-load-data/04-transform/00-querying-parquet.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ sidebar_label: Parquet
77

88
Syntax:
99
```sql
10-
SELECT [<alias>.]<column> [, <column> ...] | [<alias>.]$<col_position> [, $<col_position> ...]
10+
SELECT [<alias>.]<column> [, <column> ...]
1111
FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1212
[(
1313
[<connection_parameters>],
@@ -19,7 +19,15 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1919
```
2020

2121
:::info Tips
22-
Parquet has schema information, so we can query the columns `<column> [, <column> ...]` directly.
22+
**Query Return Content Explanation:**
23+
24+
* **Return Format**: Column values in their native data types (not variants)
25+
* **Access Method**: Directly use column names `column_name`
26+
* **Example**: `SELECT id, name, age FROM @stage_name`
27+
* **Key Features**:
28+
* No need for path expressions (like `$1:name`)
29+
* No type casting required
30+
* Parquet files contain embedded schema information
2331
:::
2432

2533
## Tutorial

docs/en/guides/40-load-data/04-transform/01-querying-csv.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,16 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1919

2020

2121
:::info Tips
22-
CSV doesn't have schema information, so we can only query the columns `$<col_position> [, $<col_position> ...]` by position.
22+
**Query Return Content Explanation:**
23+
24+
* **Return Format**: Individual column values as strings by default
25+
* **Access Method**: Use positional references `$<col_position>` (e.g., `$1`, `$2`, `$3`)
26+
* **Example**: `SELECT $1, $2, $3 FROM @stage_name`
27+
* **Key Features**:
28+
* Columns accessed by position, not by name
29+
* Each `$<col_position>` refers to a single column, not the whole row
30+
* Type casting required for non-string operations (e.g., `CAST($1 AS INT)`)
31+
* No embedded schema information in CSV files
2332
:::
2433

2534
## Tutorial

docs/en/guides/40-load-data/04-transform/02-querying-tsv.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,16 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1919

2020

2121
:::info Tips
22-
TSV doesn't have schema information, so we can only query the columns `$<col_position> [, $<col_position> ...]` by position.
22+
**Query Return Content Explanation:**
23+
24+
* **Return Format**: Individual column values as strings by default
25+
* **Access Method**: Use positional references `$<col_position>` (e.g., `$1`, `$2`, `$3`)
26+
* **Example**: `SELECT $1, $2, $3 FROM @stage_name`
27+
* **Key Features**:
28+
* Columns accessed by position, not by name
29+
* Each `$<col_position>` refers to a single column, not the whole row
30+
* Type casting required for non-string operations (e.g., `CAST($1 AS INT)`)
31+
* No embedded schema information in TSV files
2332
:::
2433

2534
## Tutorial

docs/en/guides/40-load-data/04-transform/03-querying-ndjson.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,16 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1919

2020

2121
:::info Tips
22-
NDJSON is a variant for the whole row, the column is `$1:<column> [, $1:<column> ...]`.
22+
**Query Return Content Explanation:**
23+
24+
* **Return Format**: Each row as a single variant object (referenced as `$1`)
25+
* **Access Method**: Use path expressions `$1:column_name`
26+
* **Example**: `SELECT $1:title, $1:author FROM @stage_name`
27+
* **Key Features**:
28+
* Must use path notation to access specific fields
29+
* Type casting required for type-specific operations (e.g., `CAST($1:id AS INT)`)
30+
* Each NDJSON line is parsed as a complete JSON object
31+
* Whole row is represented as a single variant object
2332
:::
2433

2534
## Tutorial

docs/en/guides/40-load-data/04-transform/03-querying-orc.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ import StepContent from '@site/src/components/Steps/step-content';
88
## Syntax
99

1010
```sql
11-
SELECT [<alias>.]<column> [, <column> ...] | [<alias>.]$<col_position> [, $<col_position> ...]
11+
SELECT [<alias>.]<column> [, <column> ...]
1212
FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1313
[(
1414
[<connection_parameters>],
@@ -18,6 +18,10 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1818
)]
1919
```
2020

21+
:::info Tips
22+
ORC has schema information, so we can query the columns `<column> [, <column> ...]` directly.
23+
:::
24+
2125
## Tutorial
2226

2327
In this tutorial, we will walk you through the process of downloading the Iris dataset in ORC format, uploading it to an Amazon S3 bucket, creating an external stage, and querying the data directly from the ORC file.
@@ -75,7 +79,7 @@ You can also query the remote ORC file directly:
7579
SELECT
7680
*
7781
FROM
78-
'https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc' (file_format = > 'orc');
82+
'https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc' (file_format => 'orc');
7983
```
8084

8185
</StepContent>

docs/en/guides/40-load-data/04-transform/04-querying-avro.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,16 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1818
```
1919

2020
:::info Tips
21-
Avro files can be queried directly as variants using `$1:<column>`.
21+
**Query Return Content Explanation:**
22+
23+
* **Return Format**: Each row as a single variant object (referenced as `$1`)
24+
* **Access Method**: Use path expressions `$1:column_name`
25+
* **Example**: `SELECT $1:id, $1:name FROM @stage_name`
26+
* **Key Features**:
27+
* Must use path notation to access specific fields
28+
* Type casting required for type-specific operations (e.g., `CAST($1:id AS INT)`)
29+
* Avro schema is mapped to variant structure
30+
* Whole row is represented as a single variant object
2231
:::
2332

2433
## Avro Querying Features Overview

docs/en/guides/40-load-data/04-transform/index.md

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Querying & Transforming
33
slug: querying-stage
44
---
55

6-
Databend supports querying and transforming data directly from staged files using the `SELECT` statement. This feature is available for user, internal, and external stages, as well as buckets in object storage (e.g., Amazon S3, Google Cloud Storage, Microsoft Azure) and remote servers via HTTPS. It's useful for inspecting staged file contents before or after data loading.
6+
Databend enables direct querying of staged files without loading data into tables first. Query files from any stage type (user, internal, external) or directly from object storage and HTTPS URLs. Ideal for data inspection, validation, and transformation before or after loading.
77

88
## Syntax
99

@@ -21,24 +21,26 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
2121

2222
## Parameters Overview
2323

24-
The `SELECT` statement for staged files supports various parameters to control data access and parsing. For detailed information and examples on each parameter, please refer to their respective documentation sections:
24+
Key parameters for controlling data access and parsing:
2525

26-
- **`FILE_FORMAT`**: Specifies the format of the file (e.g., CSV, TSV, NDJSON, PARQUET, ORC, Avro, or custom formats).
27-
- **`PATTERN`**: Uses a regular expression to match and filter file names.
28-
- **`FILES`**: Explicitly lists specific file names to query.
29-
- **`CASE_SENSITIVE`**: Controls case sensitivity for column names in Parquet files.
30-
- **`table_alias`**: Assigns an alias to staged files for easier referencing in queries.
31-
- **`$col_position`**: Selects columns by their positional index (1-based).
32-
- **`connection_parameters`**: Provides connection details for external storage.
33-
- **`uri`**: Specifies the URI for remote files.
26+
| Parameter | Description |
27+
| --------- | ----------- |
28+
| `FILE_FORMAT` | File format type (CSV, TSV, NDJSON, PARQUET, ORC, Avro) |
29+
| `PATTERN` | Regex pattern to filter files |
30+
| `FILES` | Explicit list of files to query |
31+
| `CASE_SENSITIVE` | Column name case sensitivity (Parquet) |
32+
| `table_alias` | Alias for referencing staged files |
33+
| `$col_position` | Column selection by position (1-based) |
34+
| `connection_parameters` | External storage connection details |
35+
| `uri` | URI for remote files |
3436

3537
## Supported File Formats
3638

37-
| File Format | Guide |
38-
| ----------- | -------------------------------------------------- |
39-
| Parquet | [Querying Parquet Files](./00-querying-parquet.md) |
40-
| CSV | [Querying CSV Files](./01-querying-csv.md) |
41-
| TSV | [Querying TSV Files](./02-querying-tsv.md) |
42-
| NDJSON | [Querying NDJSON Files](./03-querying-ndjson.md) |
43-
| ORC | [Querying ORC Files](./03-querying-orc.md) |
44-
| Avro | [Querying Avro Files](./04-querying-avro.md) |
39+
| File Format | Return Format | Access Method | Example | Guide |
40+
| ----------- | ------------ | ------------- | ------- | ----- |
41+
| Parquet | Native data types | Direct column names | `SELECT id, name FROM` | [Querying Parquet Files](./00-querying-parquet.md) |
42+
| ORC | Native data types | Direct column names | `SELECT id, name FROM` | [Querying ORC Files](./03-querying-orc.md) |
43+
| CSV | String values | Positional references `$<position>` | `SELECT $1, $2 FROM` | [Querying CSV Files](./01-querying-csv.md) |
44+
| TSV | String values | Positional references `$<position>` | `SELECT $1, $2 FROM` | [Querying TSV Files](./02-querying-tsv.md) |
45+
| NDJSON | Variant object | Path expressions `$1:<field>` | `SELECT $1:id, $1:name FROM` | [Querying NDJSON Files](./03-querying-ndjson.md) |
46+
| Avro | Variant object | Path expressions `$1:<field>` | `SELECT $1:id, $1:name FROM` | [Querying Avro Files](./04-querying-avro.md) |

0 commit comments

Comments
 (0)