loading: refine the file format (#2213)

BohuTANG · web-flow · commit ecaeeba1fbdd · 2025-05-24T10:25:18.000+08:00
* loading: refine the file format

* remove parquet &amp; orc postion visit
diff --git a/docs/en/guides/40-load-data/01-load/index.md b/docs/en/guides/40-load-data/01-load/index.md
@@ -2,50 +2,27 @@
 title: Loading from Files
 ---
 
-import DetailsWrap from '@site/src/components/DetailsWrap';
+Databend offers simple, powerful commands to load data files into tables. Most operations require just a single command. Your data must be in a [supported format](/sql/sql-reference/file-format-options).
 
-Databend provides a variety of tools and commands that can help you load your data files into a table. Most of them are straightforward, meaning you can load your data with just a single command. Please note that your data files must be in one of the formats supported by Databend. See [Input & Output File Formats](/sql/sql-reference/file-format-options) for a list of supported file formats. The following is an overview of the data loading and unloading flows and their respective methods. Please refer to the topics in this chapter for detailed instructions.
+![Data Loading and Unloading Overview](/img/load/load-unload.jpeg)
 
-![Alt text](/img/load/load-unload.jpeg)
+## Supported File Formats
 
-This topic does not cover all of the available data loading methods, but it provides recommendations based on the location where your data files are stored. To find the recommended method and a link to the corresponding details page, toggle the block below:
+| Format | Type | Description |
+|--------|------|-------------|
+| [**CSV**](/guides/load-data/load-semistructured/load-csv), [**TSV**](/guides/load-data/load-semistructured/load-tsv) | Delimited | Text files with customizable delimiters |
+| [**NDJSON**](/guides/load-data/load-semistructured/load-ndjson) | Semi-structured | JSON objects, one per line |
+| [**Parquet**](/guides/load-data/load-semistructured/load-parquet) | Semi-structured | Efficient columnar storage format |
+| [**ORC**](/guides/load-data/load-semistructured/load-orc) | Semi-structured | High-performance columnar format |
+| [**Avro**](/guides/load-data/load-semistructured/load-avro) | Semi-structured | Compact binary format with schema |
 
-<DetailsWrap>
+## Loading by File Location
 
-<details>
-  <summary>I want to load staged data files ...</summary>
-  <div>
-    <div>If you have data files in an internal/external stage or the user stage, Databend recommends that you load them using the COPY INTO command. The COPY INTO command is a powerful tool that can load large amounts of data quickly and efficiently.</div>
-    <br/>
-    <div>To learn more about using the COPY INTO command to load data from a stage, check out the <a href="stage">Loading from Stage</a> page. This page includes detailed tutorials that show you how to use the command to load data from a sample file in an internal/external stage or the user stage.</div>
-  </div>
-</details>
+Select the location of your files to find the recommended loading method:
 
-<details>
-  <summary>I want to load data files in a bucket ...</summary>
-  <div>
-    <div>If you have data files in a bucket or container on your object storage, such as Amazon S3, Google Cloud Storage, and Microsoft Azure, Databend recommends that you load them using the COPY INTO command. The COPY INTO command is a powerful tool that can load large amounts of data quickly and efficiently.</div>
-    <br/>
-    <div>To learn more about using the COPY INTO command to load data from a bucket or container, check out the <a href="s3">Loading from Bucket</a> page. This page includes a tutorial that shows you how to use the command to load data from a sample file in an Amazon S3 Bucket.</div>
-  </div>
-</details>
-
-<details>
-  <summary>I want to load local data files ...</summary>
-  <div>
-    <div>If you have data files in your local system, Databend recommends that you load them using <a href="https://github.com/databendlabs/BendSQL">BendSQL</a>, the Databend native CLI tool, allowing you to establish a connection with Databend and execute queries directly from a CLI window.</div>
-    <br/>
-    <div>To learn more about using BendSQL to load your local data files, check out the <a href="local">Loading from Local File</a> page. This page includes tutorials that show you how to use the tool to load data from a local sample file.</div>
-  </div>
-</details>
-
-<details>
-  <summary>I want to load remote data files ...</summary>
-  <div>
-    <div>If you have remote data files, Databend recommends that you load them using the COPY INTO command. The COPY INTO command is a powerful tool that can load large amounts of data quickly and efficiently.</div>
-    <br/>
-    <div>To learn more about using the COPY INTO command to load remote data files, check out the <a href="http">Loading from Remote File</a> page. This page includes a tutorial that shows you how to use the command to load data from a remote sample file.</div>
-  </div>
-</details>
-
-</DetailsWrap>
+| Data Source | Recommended Tool | Description | Documentation |
+|-------------|-----------------|-------------|---------------|
+| **Staged Data Files** | **COPY INTO** | Fast, efficient loading from internal/external stages or user stage | [Loading from Stage](stage) |
+| **Cloud Storage** | **COPY INTO** | Load from Amazon S3, Google Cloud Storage, Microsoft Azure | [Loading from Bucket](s3) |
+| **Local Files** | [**BendSQL**](https://github.com/databendlabs/BendSQL) | Databend's native CLI tool for local file loading | [Loading from Local File](local) |
+| **Remote Files** | **COPY INTO** | Load data from remote HTTP/HTTPS locations | [Loading from Remote File](http) |
diff --git a/docs/en/guides/40-load-data/03-load-semistructured/00-load-parquet.md b/docs/en/guides/40-load-data/03-load-semistructured/00-load-parquet.md
@@ -20,7 +20,8 @@ COPY INTO [<database>.]<table_name>
 FILE_FORMAT = (TYPE = PARQUET)
 ```
 
-More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
+- For more Parquet file format options, refer to [Parquet File Format Options](/sql/sql-reference/file-format-options#parquet-options).
+- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
 
 ## Tutorial: Loading Data from Parquet Files
 
diff --git a/docs/en/guides/40-load-data/03-load-semistructured/01-load-csv.md b/docs/en/guides/40-load-data/03-load-semistructured/01-load-csv.md
@@ -31,7 +31,8 @@ FROM { userStage | internalStage | externalStage | externalLocation }
 ) ]
 ```
 
-More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
+- For more CSV file format options, refer to [CSV File Format Options](/sql/sql-reference/file-format-options#csv-options).
+- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
 
 ## Tutorial: Loading Data from CSV Files
 
diff --git a/docs/en/guides/40-load-data/03-load-semistructured/02-load-tsv.md b/docs/en/guides/40-load-data/03-load-semistructured/02-load-tsv.md
@@ -28,7 +28,8 @@ FROM { userStage | internalStage | externalStage | externalLocation }
 ) ]
 ```
 
-More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
+- For more TSV file format options, refer to [TSV File Format Options](/sql/sql-reference/file-format-options#tsv-options).
+- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
 
 ## Tutorial: Loading Data from TSV Files
 
diff --git a/docs/en/guides/40-load-data/03-load-semistructured/03-load-ndjson.md b/docs/en/guides/40-load-data/03-load-semistructured/03-load-ndjson.md
@@ -28,7 +28,8 @@ FROM { userStage | internalStage | externalStage | externalLocation }
 ) ]
 ```
 
-More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
+- For more NDJSON file format options, refer to [NDJSON File Format Options](/sql/sql-reference/file-format-options#ndjson-options).
+- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
 
 ## Tutorial: Loading Data from NDJSON Files
 
diff --git a/docs/en/guides/40-load-data/03-load-semistructured/04-load-orc.md b/docs/en/guides/40-load-data/03-load-semistructured/04-load-orc.md
@@ -18,7 +18,8 @@ COPY INTO [<database>.]<table_name>
 FILE_FORMAT = (TYPE = ORC)
 ```
 
-More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
+- For more ORC file format options, refer to [ORC File Format Options](/sql/sql-reference/file-format-options#orc-options).
+- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
 
 ## Tutorial: Loading Data from ORC Files
 
diff --git a/docs/en/guides/40-load-data/03-load-semistructured/05-load-avro.md b/docs/en/guides/40-load-data/03-load-semistructured/05-load-avro.md
@@ -18,7 +18,8 @@ COPY INTO [<database>.]<table_name>
 FILE_FORMAT = (TYPE = AVRO)
 ```
 
-More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
+- For more Avro file format options, refer to [Avro File Format Options](/sql/sql-reference/file-format-options#avro-options).
+- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
 
 ## Tutorial: Loading Avro Data into Databend from Remote HTTP URL
 
diff --git a/docs/en/guides/40-load-data/03-load-semistructured/index.md b/docs/en/guides/40-load-data/03-load-semistructured/index.md
@@ -1,20 +1,18 @@
 ---
 title: Loading Semi-structured Formats
 ---
-import IndexOverviewList from '@site/src/components/IndexOverviewList';
 
 ## What is Semi-structured Data?
 
-Semi-structured data is a form of data that does not conform to a rigid structure like traditional databases but still contains tags or markers to separate semantic elements and enforce hierarchies of records and fields. 
+Semi-structured data contains tags or markers to separate semantic elements while not conforming to rigid database structures. Databend efficiently loads these formats using the `COPY INTO` command, with optional on-the-fly data transformation.
 
-Databend facilitates the efficient and user-friendly loading of semi-structured data. It supports various formats such as **Parquet**, **CSV**, **TSV**, and **NDJSON**. 
+## Supported File Formats
 
-Additionally, Databend allows for on-the-fly transformation of data during the loading process.
-Copy from semi-structured data format is the most common way to load data into Databend, it is very efficient and easy to use.
-
-
-## Supported Formats
-
-Databend supports several semi-structured data formats loaded using the `COPY INTO` command:
-
-<IndexOverviewList />
+| File Format | Description | Guide |
+| ----------- | ----------- | ----- |
+| **Parquet** | Efficient columnar storage format | [Loading Parquet](load-parquet) |
+| **CSV** | Comma-separated values | [Loading CSV](load-csv) |
+| **TSV** | Tab-separated values | [Loading TSV](load-tsv) |
+| **NDJSON** | Newline-delimited JSON | [Loading NDJSON](load-ndjson) |
+| **ORC** | Optimized Row Columnar format | [Loading ORC](load-orc) |
+| **Avro** | Row-based format with schema definition | [Loading Avro](load-avro) |
diff --git a/docs/en/guides/40-load-data/04-transform/00-querying-parquet.md b/docs/en/guides/40-load-data/04-transform/00-querying-parquet.md
@@ -7,7 +7,7 @@ sidebar_label: Parquet
 
 Syntax:
 ```sql
-SELECT [<alias>.]<column> [, <column> ...] | [<alias>.]$<col_position> [, $<col_position> ...] 
+SELECT [<alias>.]<column> [, <column> ...] 
 FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]} 
 [( 
   [<connection_parameters>],
@@ -19,7 +19,15 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
 ```
 
 :::info Tips
-Parquet has schema information, so we can query the columns `<column> [, <column> ...]` directly.
+**Query Return Content Explanation:**
+
+* **Return Format**: Column values in their native data types (not variants)
+* **Access Method**: Directly use column names `column_name`
+* **Example**: `SELECT id, name, age FROM @stage_name`
+* **Key Features**:
+  * No need for path expressions (like `$1:name`)
+  * No type casting required
+  * Parquet files contain embedded schema information
 :::
 
 ## Tutorial
diff --git a/docs/en/guides/40-load-data/04-transform/01-querying-csv.md b/docs/en/guides/40-load-data/04-transform/01-querying-csv.md
@@ -19,7 +19,16 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
 
 
 :::info Tips
-CSV doesn't have schema information, so we can only query the columns `$<col_position> [, $<col_position> ...]` by position.
+**Query Return Content Explanation:**
+
+* **Return Format**: Individual column values as strings by default
+* **Access Method**: Use positional references `$<col_position>` (e.g., `$1`, `$2`, `$3`)
+* **Example**: `SELECT $1, $2, $3 FROM @stage_name`
+* **Key Features**:
+  * Columns accessed by position, not by name
+  * Each `$<col_position>` refers to a single column, not the whole row
+  * Type casting required for non-string operations (e.g., `CAST($1 AS INT)`)
+  * No embedded schema information in CSV files
 :::
 
 ## Tutorial
diff --git a/docs/en/guides/40-load-data/04-transform/02-querying-tsv.md b/docs/en/guides/40-load-data/04-transform/02-querying-tsv.md
@@ -19,7 +19,16 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
 
 
 :::info Tips
-TSV doesn't have schema information, so we can only query the columns `$<col_position> [, $<col_position> ...]` by position.
+**Query Return Content Explanation:**
+
+* **Return Format**: Individual column values as strings by default
+* **Access Method**: Use positional references `$<col_position>` (e.g., `$1`, `$2`, `$3`)
+* **Example**: `SELECT $1, $2, $3 FROM @stage_name`
+* **Key Features**:
+  * Columns accessed by position, not by name
+  * Each `$<col_position>` refers to a single column, not the whole row
+  * Type casting required for non-string operations (e.g., `CAST($1 AS INT)`)
+  * No embedded schema information in TSV files
 :::
 
 ## Tutorial
diff --git a/docs/en/guides/40-load-data/04-transform/03-querying-ndjson.md b/docs/en/guides/40-load-data/04-transform/03-querying-ndjson.md
@@ -19,7 +19,16 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
 
 
 :::info Tips
-NDJSON is a variant for the whole row, the column is `$1:<column> [, $1:<column> ...]`.
+**Query Return Content Explanation:**
+
+* **Return Format**: Each row as a single variant object (referenced as `$1`)
+* **Access Method**: Use path expressions `$1:column_name`
+* **Example**: `SELECT $1:title, $1:author FROM @stage_name`
+* **Key Features**:
+  * Must use path notation to access specific fields
+  * Type casting required for type-specific operations (e.g., `CAST($1:id AS INT)`)
+  * Each NDJSON line is parsed as a complete JSON object
+  * Whole row is represented as a single variant object
 :::
 
 ## Tutorial
diff --git a/docs/en/guides/40-load-data/04-transform/03-querying-orc.md b/docs/en/guides/40-load-data/04-transform/03-querying-orc.md
@@ -8,7 +8,7 @@ import StepContent from '@site/src/components/Steps/step-content';
 ## Syntax
 
 ```sql
-SELECT [<alias>.]<column> [, <column> ...] | [<alias>.]$<col_position> [, $<col_position> ...]
+SELECT [<alias>.]<column> [, <column> ...]
 FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
 [(
   [<connection_parameters>],
@@ -18,6 +18,10 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
 )]
 ```
 
+:::info Tips
+ORC has schema information, so we can query the columns `<column> [, <column> ...]` directly.
+:::
+
 ## Tutorial
 
 In this tutorial, we will walk you through the process of downloading the Iris dataset in ORC format, uploading it to an Amazon S3 bucket, creating an external stage, and querying the data directly from the ORC file.
@@ -75,7 +79,7 @@ You can also query the remote ORC file directly:
 SELECT
   *
 FROM
-  'https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc' (file_format = > 'orc');
+  'https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc' (file_format => 'orc');
 ```
 
 </StepContent>
diff --git a/docs/en/guides/40-load-data/04-transform/04-querying-avro.md b/docs/en/guides/40-load-data/04-transform/04-querying-avro.md
@@ -18,7 +18,16 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
 ```
 
 :::info Tips
-Avro files can be queried directly as variants using `$1:<column>`.
+**Query Return Content Explanation:**
+
+* **Return Format**: Each row as a single variant object (referenced as `$1`)
+* **Access Method**: Use path expressions `$1:column_name`
+* **Example**: `SELECT $1:id, $1:name FROM @stage_name`
+* **Key Features**:
+  * Must use path notation to access specific fields
+  * Type casting required for type-specific operations (e.g., `CAST($1:id AS INT)`)
+  * Avro schema is mapped to variant structure
+  * Whole row is represented as a single variant object
 :::
 
 ## Avro Querying Features Overview
diff --git a/docs/en/guides/40-load-data/04-transform/index.md b/docs/en/guides/40-load-data/04-transform/index.md
@@ -3,7 +3,7 @@ title: Querying & Transforming
 slug: querying-stage
 ---
 
-Databend supports querying and transforming data directly from staged files using the `SELECT` statement. This feature is available for user, internal, and external stages, as well as buckets in object storage (e.g., Amazon S3, Google Cloud Storage, Microsoft Azure) and remote servers via HTTPS. It's useful for inspecting staged file contents before or after data loading.
+Databend enables direct querying of staged files without loading data into tables first. Query files from any stage type (user, internal, external) or directly from object storage and HTTPS URLs. Ideal for data inspection, validation, and transformation before or after loading.
 
 ## Syntax
 
@@ -21,24 +21,26 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
 
 ## Parameters Overview
 
-The `SELECT` statement for staged files supports various parameters to control data access and parsing. For detailed information and examples on each parameter, please refer to their respective documentation sections:
+Key parameters for controlling data access and parsing:
 
-- **`FILE_FORMAT`**: Specifies the format of the file (e.g., CSV, TSV, NDJSON, PARQUET, ORC, Avro, or custom formats).
-- **`PATTERN`**: Uses a regular expression to match and filter file names.
-- **`FILES`**: Explicitly lists specific file names to query.
-- **`CASE_SENSITIVE`**: Controls case sensitivity for column names in Parquet files.
-- **`table_alias`**: Assigns an alias to staged files for easier referencing in queries.
-- **`$col_position`**: Selects columns by their positional index (1-based).
-- **`connection_parameters`**: Provides connection details for external storage.
-- **`uri`**: Specifies the URI for remote files.
+| Parameter | Description |
+| --------- | ----------- |
+| `FILE_FORMAT` | File format type (CSV, TSV, NDJSON, PARQUET, ORC, Avro) |
+| `PATTERN` | Regex pattern to filter files |
+| `FILES` | Explicit list of files to query |
+| `CASE_SENSITIVE` | Column name case sensitivity (Parquet) |
+| `table_alias` | Alias for referencing staged files |
+| `$col_position` | Column selection by position (1-based) |
+| `connection_parameters` | External storage connection details |
+| `uri` | URI for remote files |
 
 ## Supported File Formats
 
-| File Format | Guide                                              |
-| ----------- | -------------------------------------------------- |
-| Parquet     | [Querying Parquet Files](./00-querying-parquet.md) |
-| CSV         | [Querying CSV Files](./01-querying-csv.md)         |
-| TSV         | [Querying TSV Files](./02-querying-tsv.md)         |
-| NDJSON      | [Querying NDJSON Files](./03-querying-ndjson.md)   |
-| ORC         | [Querying ORC Files](./03-querying-orc.md)         |
-| Avro        | [Querying Avro Files](./04-querying-avro.md)       |
+| File Format | Return Format | Access Method | Example | Guide |
+| ----------- | ------------ | ------------- | ------- | ----- |
+| Parquet | Native data types | Direct column names | `SELECT id, name FROM` | [Querying Parquet Files](./00-querying-parquet.md) |
+| ORC | Native data types | Direct column names | `SELECT id, name FROM` | [Querying ORC Files](./03-querying-orc.md) |
+| CSV | String values | Positional references `$<position>` | `SELECT $1, $2 FROM` | [Querying CSV Files](./01-querying-csv.md) |
+| TSV | String values | Positional references `$<position>` | `SELECT $1, $2 FROM` | [Querying TSV Files](./02-querying-tsv.md) |
+| NDJSON | Variant object | Path expressions `$1:<field>` | `SELECT $1:id, $1:name FROM` | [Querying NDJSON Files](./03-querying-ndjson.md) |
+| Avro | Variant object | Path expressions `$1:<field>` | `SELECT $1:id, $1:name FROM` | [Querying Avro Files](./04-querying-avro.md) |