Skip to content

Commit bbac82c

Browse files
authored
docs: orc file format (#852)
1 parent 3bfb606 commit bbac82c

File tree

3 files changed

+128
-4
lines changed

3 files changed

+128
-4
lines changed
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
---
2+
title: Querying Staged ORC Files in Stage
3+
sidebar_label: Querying ORC File
4+
---
5+
import StepsWrap from '@site/src/components/StepsWrap';
6+
import StepContent from '@site/src/components/Steps/step-content';
7+
8+
## Syntax
9+
10+
```sql
11+
SELECT [<alias>.]<column> [, <column> ...] | [<alias>.]$<col_position> [, $<col_position> ...]
12+
FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
13+
[(
14+
[<connection_parameters>],
15+
[ PATTERN => '<regex_pattern>'],
16+
[ FILE_FORMAT => 'ORC | <custom_format_name>'],
17+
[ FILES => ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
18+
)]
19+
```
20+
21+
## Tutorial
22+
23+
In this tutorial, we will walk you through the process of downloading the Iris dataset in ORC format, uploading it to an Amazon S3 bucket, creating an external stage, and querying the data directly from the ORC file.
24+
25+
<StepsWrap>
26+
<StepContent number="1">
27+
28+
### Download Iris Dataset
29+
30+
Download the iris dataset from https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc then upload it to your Amazon S3 bucket.
31+
32+
The iris dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. It has 4 attributes: (1) sepal length, (2) sepal width, (3) petal length, (4) petal width, and the last column contains the class label.
33+
34+
</StepContent>
35+
<StepContent number="2">
36+
37+
### Create External Stage
38+
39+
Create an external stage with your Amazon S3 bucket where your iris dataset file is stored.
40+
41+
```sql
42+
CREATE STAGE orc_query_stage
43+
URL = 's3://databend-doc'
44+
CONNECTION = (
45+
AWS_KEY_ID = '<your-key-id>',
46+
AWS_SECRET_KEY = '<your-secret-key>'
47+
);
48+
```
49+
50+
</StepContent>
51+
<StepContent number="3">
52+
53+
### Query ORC File
54+
55+
```sql
56+
SELECT *
57+
FROM @orc_query_stage
58+
(
59+
FILE_FORMAT => 'orc',
60+
PATTERN => '.*[.]orc'
61+
);
62+
63+
┌──────────────────────────────────────────────────────────────────────────────────────────────────┐
64+
│ sepal_length │ sepal_width │ petal_length │ petal_width │ species │
65+
├───────────────────┼───────────────────┼───────────────────┼───────────────────┼──────────────────┤
66+
5.13.51.40.2 │ setosa │
67+
4.931.40.2 │ setosa │
68+
4.73.21.30.2 │ setosa │
69+
4.63.11.50.2 │ setosa │
70+
53.61.40.2 │ setosa │
71+
5.43.91.70.4 │ setosa │
72+
4.63.41.40.3 │ setosa │
73+
53.41.50.2 │ setosa │
74+
4.42.91.40.2 │ setosa │
75+
4.93.11.50.1 │ setosa │
76+
5.43.71.50.2 │ setosa │
77+
4.83.41.60.2 │ setosa │
78+
4.831.40.1 │ setosa │
79+
4.331.10.1 │ setosa │
80+
5.841.20.2 │ setosa │
81+
5.74.41.50.4 │ setosa │
82+
5.43.91.30.4 │ setosa │
83+
5.13.51.40.3 │ setosa │
84+
5.73.81.70.3 │ setosa │
85+
5.13.81.50.3 │ setosa │
86+
│ · │ · │ · │ · │ · │
87+
│ · │ · │ · │ · │ · │
88+
│ · │ · │ · │ · │ · │
89+
7.42.86.11.9 │ virginica │
90+
7.93.86.42 │ virginica │
91+
6.42.85.62.2 │ virginica │
92+
6.32.85.11.5 │ virginica │
93+
6.12.65.61.4 │ virginica │
94+
7.736.12.3 │ virginica │
95+
6.33.45.62.4 │ virginica │
96+
6.43.15.51.8 │ virginica │
97+
634.81.8 │ virginica │
98+
6.93.15.42.1 │ virginica │
99+
6.73.15.62.4 │ virginica │
100+
6.93.15.12.3 │ virginica │
101+
5.82.75.11.9 │ virginica │
102+
6.83.25.92.3 │ virginica │
103+
6.73.35.72.5 │ virginica │
104+
6.735.22.3 │ virginica │
105+
6.32.551.9 │ virginica │
106+
6.535.22 │ virginica │
107+
6.23.45.42.3 │ virginica │
108+
5.935.11.8 │ virginica │
109+
150 rows │ │ │ │ │
110+
│ (40 shown) │ │ │ │ │
111+
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
112+
```
113+
114+
You can also query the remote ORC file directly:
115+
116+
```sql
117+
SELECT
118+
*
119+
FROM
120+
'https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc' (file_format = > 'orc');
121+
```
122+
123+
</StepContent>
124+
</StepsWrap>

docs/en/guides/40-load-data/04-transform/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
2121
[(
2222
[<connection_parameters>],
2323
[ PATTERN => '<regex_pattern>'],
24-
[ FILE_FORMAT => 'CSV | TSV | NDJSON | PARQUET | <custom_format_name>'],
24+
[ FILE_FORMAT => 'CSV | TSV | NDJSON | PARQUET | ORC | <custom_format_name>'],
2525
[ FILES => ( '<file_name>' [ , '<file_name>' ... ])]
2626
)]
2727
```

docs/en/sql-reference/00-sql-reference/50-file-format-options.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Input & Output File Formats
33
---
44
import FunctionDescription from '@site/src/components/FunctionDescription';
55

6-
<FunctionDescription description="Introduced or updated: v1.2.216"/>
6+
<FunctionDescription description="Introduced or updated: v1.2.530"/>
77

88
Databend accepts a variety of file formats both as a source and as a target for data loading or unloading. This page explains the supported file formats and their available options.
99

@@ -13,13 +13,13 @@ To specify a file format in a statement, use the following syntax:
1313

1414
```sql
1515
-- Specify a standard file format
16-
... FILE_FORMAT = ( TYPE = { CSV | TSV | NDJSON | PARQUET | XML } [ formatTypeOptions ] )
16+
... FILE_FORMAT = ( TYPE = { CSV | TSV | NDJSON | PARQUET | ORC } [ formatTypeOptions ] )
1717

1818
-- Specify a custom file format
1919
... FILE_FORMAT = ( FORMAT_NAME = '<your-custom-format>' )
2020
```
2121

22-
- Databend currently supports XML as a source ONLY. Unloading data into an XML file is not supported yet.
22+
- Databend currently supports ORC as a source ONLY. Unloading data into an ORC file is not supported yet.
2323
- If you don't specify the FILE_FORMAT when performing a COPY INTO or SELECT operation from a stage, Databend will use the file format that you initially defined for the stage when you created it. In cases where you didn't explicitly specify a file format during the stage creation, Databend defaults to using the PARQUET format. If you specify a different FILE_FORMAT from the one you defined when creating the stage, Databend will prioritize the FILE_FORMAT specified during the operation.
2424
- For managing custom file formats in Databend, see [File Format](../10-sql-commands/00-ddl/13-file-format/index.md).
2525

0 commit comments

Comments
 (0)