Skip to content

Commit 2eaaf15

Browse files
authored
Merge pull request #201 from marklogic/feature/json-gzip
Added docs for compression with Spark data sources
2 parents 37c51aa + 0ddda35 commit 2eaaf15

File tree

8 files changed

+55
-3
lines changed

8 files changed

+55
-3
lines changed

docs/common-options.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,10 @@ are followed by a list of options common to every Flux command.
3232
You can specify a command name without entering its full name, as long as you enter a sufficient number of characters
3333
such that Flux can uniquely identify the command name.
3434

35-
For example, instead of entering `import-aggregate-xml-files`, you can enter `import-ag` as it is the only command in
36-
Flux with that sequence of letters:
35+
For example, instead of entering `import-parquet-files`, you can enter `import-p` as it is the only command in
36+
Flux beginning with that sequence of letters:
3737

38-
./bin/flux import-ag --path path/to/data etc...
38+
./bin/flux import-p --path path/to/data etc...
3939

4040
If Flux cannot uniquely identify the command name, it will print an error and list the command names that match what
4141
you entered.

docs/import/import-files/avro.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,12 @@ it may be important to query for documents that have a particular field with a v
8686
The `import-avro-files` command supports aggregating related rows together to produce hierarchical documents. See
8787
[Aggregating rows](../aggregating-rows.md) for more information.
8888

89+
## Reading compressed files
90+
91+
Flux will automatically read files compressed with GZIP when they have a filename ending in `.gz`; you do not need to
92+
specify a compression option. As noted in the "Advanced options" section below, you can use `-Pcompression=` to
93+
explicitly specify a compression algorithm if Flux is not able to read your compressed files automatically.
94+
8995
## Advanced options
9096

9197
The `import-avro-files` command reuses Spark's support for reading Avro files. You can include any of

docs/import/import-files/delimited-text.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,12 @@ the content can be correctly translated to UTF-8 when written to MarkLogic - e.g
106106
The `import-delimited-files` command supports aggregating related rows together to produce hierarchical documents. See
107107
[Aggregating rows](../aggregating-rows.md) for more information.
108108

109+
## Reading compressed files
110+
111+
Flux will automatically read files compressed with GZIP when they have a filename ending in `.gz`; you do not need to
112+
specify a compression option. As noted in the "Advanced options" section below, you can use `-Pcompression=` to
113+
explicitly specify a compression algorithm if Flux is not able to read your compressed files automatically.
114+
109115
## Advanced options
110116

111117
The `import-delimited-files` command reuses Spark's support for reading delimited text data. You can include any of

docs/import/import-files/json.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,12 @@ the content can be correctly translated to UTF-8 when written to MarkLogic:
8383
etc...
8484
```
8585

86+
## Reading compressed files
87+
88+
Flux will automatically read files compressed with GZIP when they have a filename ending in `.gz`; you do not need to
89+
specify a compression option. As noted in the "Advanced options" section below, you can use `-Pcompression=` to
90+
explicitly specify a compression algorithm if Flux is not able to read your compressed files automatically.
91+
8692
## Advanced options
8793

8894
The `import-aggregate-json-files` command reuses Spark's support for reading JSON files. You can include any of

docs/import/import-files/orc.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,12 @@ it may be important to query for documents that have a particular field with a v
8686
The `import-orc-files` command supports aggregating related rows together to produce hierarchical documents. See
8787
[Aggregating rows](../aggregating-rows.md) for more information.
8888

89+
## Reading compressed files
90+
91+
Flux will automatically read files compressed with GZIP when they have a filename ending in `.gz`; you do not need to
92+
specify a compression option. As noted in the "Advanced options" section below, you can use `-Pcompression=` to
93+
explicitly specify a compression algorithm if Flux is not able to read your compressed files automatically.
94+
8995
## Advanced options
9096

9197
The `import-orc-files` command reuses Spark's support for reading ORC files. You can include any of

docs/import/import-files/parquet.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,12 @@ it may be important to query for documents that have a particular field with a v
8686
The `import-parquet-files` command supports aggregating related rows together to produce hierarchical documents. See
8787
[Aggregating rows](../aggregating-rows.md) for more information.
8888

89+
## Reading compressed files
90+
91+
Flux will automatically read files compressed with GZIP when they have a filename ending in `.gz`; you do not need to
92+
specify a compression option. As noted in the "Advanced options" section below, you can use `-Pcompression=` to
93+
explicitly specify a compression algorithm if Flux is not able to read your compressed files automatically.
94+
8995
## Advanced options
9096

9197
The `import-parquet-files` command reuses Spark's support for reading Parquet files. You can include any of

flux-cli/src/test/java/com/marklogic/flux/impl/importdata/ImportAggregateJsonFilesTest.java

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,28 @@ void jsonLines() {
124124
verifyDoc("/delimited/lastName-3.json", "firstName-3", "lastName-3");
125125
}
126126

127+
@Test
128+
void gzippedJsonLines() {
129+
run(
130+
"import-aggregate-json-files",
131+
"--path", "src/test/resources/delimited-files/line-delimited-json.txt.gz",
132+
"--json-lines",
133+
"--connection-string", makeConnectionString(),
134+
"--permissions", DEFAULT_PERMISSIONS,
135+
"--collections", "delimited-json-test",
136+
"--uri-template", "/delimited/{lastName}.json"
137+
);
138+
139+
assertCollectionSize(
140+
"Spark data sources will automatically handle .gz files without -Pcompression=gzip being specified.",
141+
"delimited-json-test", 3
142+
);
143+
verifyDoc("/delimited/lastName-1.json", "firstName-1", "lastName-1");
144+
verifyDoc("/delimited/lastName-2.json", "firstName-2", "lastName-2");
145+
verifyDoc("/delimited/lastName-3.json", "firstName-3", "lastName-3");
146+
}
147+
148+
127149
@Test
128150
void jsonRootName() {
129151
run(
Binary file not shown.

0 commit comments

Comments
 (0)