Skip to content

Commit c1483bc

Browse files
authored
added (#1838)
1 parent 60e5438 commit c1483bc

File tree

9 files changed

+130
-19
lines changed

9 files changed

+130
-19
lines changed

docs/en/guides/40-load-data/03-load-semistructured/00-load-parquet.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Loading Parquet File into Databend
2+
title: Loading Parquet into Databend
33
sidebar_label: Parquet
44
---
55

docs/en/guides/40-load-data/03-load-semistructured/01-load-csv.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Loading CSV File into Databend
2+
title: Loading CSV into Databend
33
sidebar_label: CSV
44
---
55

docs/en/guides/40-load-data/03-load-semistructured/02-load-tsv.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Loading TSV File into Databend
2+
title: Loading TSV into Databend
33
sidebar_label: TSV
44
---
55

docs/en/guides/40-load-data/03-load-semistructured/03-load-ndjson.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Loading NDJSON File into Databend
2+
title: Loading NDJSON into Databend
33
sidebar_label: NDJSON
44
---
55

docs/en/guides/40-load-data/03-load-semistructured/04-load-orc.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Loading ORC File into Databend
2+
title: Loading ORC into Databend
33
sidebar_label: ORC
44
---
55

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
title: Loading Avro into Databend
3+
sidebar_label: Avro
4+
---
5+
6+
## What is Avro?
7+
8+
[Apache Avro™](https://avro.apache.org/) is the leading serialization format for record data, and first choice for streaming data pipelines.
9+
10+
## Loading Avro File
11+
12+
The common syntax for loading AVRO file is as follows:
13+
14+
```sql
15+
COPY INTO [<database>.]<table_name>
16+
FROM { internalStage | externalStage | externalLocation }
17+
[ PATTERN = '<regex_pattern>' ]
18+
FILE_FORMAT = (TYPE = AVRO)
19+
```
20+
21+
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
22+
23+
## Tutorial: Loading Avro Data into Databend from Remote HTTP URL
24+
25+
In this tutorial, you will create a table in Databend using an Avro schema and load Avro data directly from a GitHub-hosted `.avro` file via HTTPS.
26+
27+
### Step 1: Review the Avro Schema
28+
29+
Before creating a table in Databend, let’s take a quick look at the Avro schema we’re working with: [userdata.avsc](https://github.com/Teradata/kylo/blob/master/samples/sample-data/avro/userdata.avsc). This schema defines a record named `User` with 13 fields, mostly of type string, along with `int` and `float`.
30+
31+
```json
32+
{
33+
"type": "record",
34+
"name": "User",
35+
"fields": [
36+
{"name": "registration_dttm", "type": "string"},
37+
{"name": "id", "type": "int"},
38+
{"name": "first_name", "type": "string"},
39+
{"name": "last_name", "type": "string"},
40+
{"name": "email", "type": "string"},
41+
{"name": "gender", "type": "string"},
42+
{"name": "ip_address", "type": "string"},
43+
{"name": "cc", "type": "string"},
44+
{"name": "country", "type": "string"},
45+
{"name": "birthdate", "type": "string"},
46+
{"name": "salary", "type": "float"},
47+
{"name": "title", "type": "string"},
48+
{"name": "comments", "type": "string"}
49+
]
50+
}
51+
```
52+
53+
### Step 2: Create a Table in Databend
54+
55+
Create a table that matches the structure defined in the schema:
56+
57+
```sql
58+
CREATE TABLE userdata (
59+
registration_dttm STRING,
60+
id INT,
61+
first_name STRING,
62+
last_name STRING,
63+
email STRING,
64+
gender STRING,
65+
ip_address STRING,
66+
cc VARIANT,
67+
country STRING,
68+
birthdate STRING,
69+
salary FLOAT,
70+
title STRING,
71+
comments STRING
72+
);
73+
```
74+
75+
### Step 3: Load Data from a Remote HTTPS URL
76+
77+
```sql
78+
COPY INTO userdata
79+
FROM 'https://raw.githubusercontent.com/Teradata/kylo/master/samples/sample-data/avro/userdata1.avro'
80+
FILE_FORMAT = (type = avro);
81+
```
82+
83+
```sql
84+
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
85+
│ File │ Rows_loaded │ Errors_seen │ First_error │ First_error_line │
86+
├──────────────────────────────────────────────────────────────┼─────────────┼─────────────┼──────────────────┼──────────────────┤
87+
│ Teradata/kylo/master/samples/sample-data/avro/userdata1.avro10000NULLNULL
88+
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
89+
```
90+
91+
### Step 4: Query the Data
92+
93+
You can now explore the data you just imported:
94+
95+
```sql
96+
SELECT id, first_name, email, salary FROM userdata LIMIT 5;
97+
```
98+
99+
```sql
100+
┌───────────────────────────────────────────────────────────────────────────────────┐
101+
│ id │ first_name │ email │ salary │
102+
├─────────────────┼──────────────────┼──────────────────────────┼───────────────────┤
103+
1 │ Amanda │ ajordan0@com.com49756.53
104+
2 │ Albert │ afreeman1@is.gd150280.17
105+
3 │ Evelyn │ emorgan2@altervista.org144972.52
106+
4 │ Denise │ driley3@gmpg.org90263.05
107+
5 │ Carlos │ cburns4@miitbeian.gov.cn │ NULL
108+
└───────────────────────────────────────────────────────────────────────────────────┘
109+
```

docs/en/guides/40-load-data/03-load-semistructured/index.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,4 @@ Copy from semi-structured data format is the most common way to load data into D
1717

1818
Databend supports several semi-structured data formats loaded using the `COPY INTO` command:
1919

20-
- **Parquet**: A columnar storage format, ideal for optimizing data storage and retrieval. It is best suited for complex data structures and offers efficient data compression and encoding schemes.
21-
22-
- **CSV (Comma-Separated Values)**: A simple format that is widely used for data exchange. CSV files are easy to read and write but might not be ideal for complex hierarchical data structures.
23-
24-
- **TSV (Tab-Separated Values)**: Similar to CSV, but uses tabs as delimiters. It's often used for data with simple structures that require a delimiter other than a comma.
25-
26-
- **NDJSON (Newline Delimited JSON)**: This format represents JSON data with each JSON object separated by a newline. It is particularly useful for streaming large datasets and handling data that changes frequently. NDJSON facilitates the processing of large volumes of data by breaking it down into manageable, line-delimited chunks.
27-
28-
29-
For detailed instructions on how to load semi-structured data, check out the following topics:
3020
<IndexOverviewList />

docs/en/sql-reference/00-sql-reference/50-file-format-options.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,13 @@ To specify a file format in a statement, use the following syntax:
1313

1414
```sql
1515
-- Specify a standard file format
16-
... FILE_FORMAT = ( TYPE = { CSV | TSV | NDJSON | PARQUET | ORC } [ formatTypeOptions ] )
16+
... FILE_FORMAT = ( TYPE = { CSV | TSV | NDJSON | PARQUET | ORC | AVRO } [ formatTypeOptions ] )
1717

1818
-- Specify a custom file format
1919
... FILE_FORMAT = ( FORMAT_NAME = '<your-custom-format>' )
2020
```
2121

22-
- Databend currently supports ORC as a source ONLY. Unloading data into an ORC file is not supported yet.
22+
- Databend currently supports ORC and AVRO as a source ONLY. Unloading data into an ORC or AVRO file is not supported yet.
2323
- If you don't specify the FILE_FORMAT when performing a COPY INTO or SELECT operation from a stage, Databend will use the file format that you initially defined for the stage when you created it. In cases where you didn't explicitly specify a file format during the stage creation, Databend defaults to using the PARQUET format. If you specify a different FILE_FORMAT from the one you defined when creating the stage, Databend will prioritize the FILE_FORMAT specified during the operation.
2424
- For managing custom file formats in Databend, see [File Format](../10-sql-commands/00-ddl/13-file-format/index.md).
2525

@@ -251,3 +251,15 @@ Determines the behavior when encountering missing fields during data loading. Re
251251
|------------------|-----------------------------------------------------------------------------------------------|
252252
| `ERROR` (Default)| Generates an error if a missing field is encountered. |
253253
| `FIELD_DEFAULT` | Uses the default value of the field for missing fields. |
254+
255+
256+
## AVRO Options
257+
258+
### MISSING_FIELD_AS (Load Only)
259+
260+
Determines the behavior when encountering missing fields during data loading. Refer to the options in the table below for possible configurations.
261+
262+
| Available Values | Description |
263+
|------------------|-----------------------------------------------------------------------------------------------|
264+
| `ERROR` (Default)| Generates an error if a missing field is encountered. |
265+
| `FIELD_DEFAULT` | Uses the default value of the field for missing fields. |

docs/en/sql-reference/10-sql-commands/10-dml/dml-copy-into-table.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ sidebar_label: "COPY INTO <table>"
55

66
import FunctionDescription from '@site/src/components/FunctionDescription';
77

8-
<FunctionDescription description="Introduced or updated: v1.2.666"/>
8+
<FunctionDescription description="Introduced or updated: v1.2.704"/>
99

1010
COPY INTO allows you to load data from files located in one of the following locations:
1111

@@ -27,7 +27,7 @@ COPY INTO [<database_name>.]<table_name>
2727
[ PATTERN = '<regex_pattern>' ]
2828
[ FILE_FORMAT = (
2929
FORMAT_NAME = '<your-custom-format>'
30-
| TYPE = { CSV | TSV | NDJSON | PARQUET | ORC } [ formatTypeOptions ]
30+
| TYPE = { CSV | TSV | NDJSON | PARQUET | ORC | AVRO } [ formatTypeOptions ]
3131
) ]
3232
[ copyOptions ]
3333
```

0 commit comments

Comments
 (0)