Skip to content

Commit c38fb0e

Browse files
committed
Added docs for --spark-master-url and -C
1 parent ec70e8e commit c38fb0e

File tree

9 files changed

+67
-10
lines changed

9 files changed

+67
-10
lines changed

docs/common-options.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,3 +254,29 @@ potentially being helpful. The initial error message displayed by Flux is intend
254254
Flux uses a [Log4J2 properties file](https://logging.apache.org/log4j/2.x/manual/configuration.html#Properties) to
255255
configure its logging. The file is located in a Flux installation at `./conf/log4j2.properties`. You are free to
256256
customize this file to meet your needs for logging.
257+
258+
## Advanced Spark options
259+
260+
Flux is built on top of [Apache Spark](https://spark.apache.org/) and provides a number of command line options for
261+
configuring the underlying Spark runtime environment used by Flux.
262+
263+
### Configuring a Spark URL
264+
265+
By default, Flux creates a Spark session with a master URL of `local[*]`. You can change this via the
266+
`--spark-master-url` option; please see
267+
[the Spark documentation](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) for examples
268+
of valid values. If you are looking to run a Flux command on a remote Spark cluster, please instead see the
269+
[Spark Integration guide](spark-integration.md) for details on integrating Flux with `spark-submit`.
270+
271+
### Configuring the Spark runtime
272+
273+
Some Flux commands reuse [Spark data sources](https://spark.apache.org/docs/latest/sql-data-sources.html) that
274+
accept configuration items via the Spark runtime. You can provide these configuration items via the `-C` option.
275+
For example, the [Spark Avro data source](https://spark.apache.org/docs/latest/sql-data-sources-avro.html#configuration)
276+
identifies several configuration items, such as `spark.sql.avro.compression.codec`. You can set this value by
277+
including `-Cspark.sql.avro.compression.codec=snappy` as a command line option.
278+
279+
Note that the majority of [Spark cluster configuration properties](https://spark.apache.org/docs/latest/configuration.html)
280+
cannot be set via the `-C` option as those options must be set before a Spark session is created. For further control
281+
over the Spark session, please see the [Spark Integration guide](spark-integration.md) for details on integrating Flux
282+
with `spark-submit`.

docs/export/export-rows.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,9 +76,12 @@ Rows selected via an Optic query can be exported to any of the below file format
7676

7777
The `export-avro-files` command writes one or more Avro files to the directory specified by the `--path` option. This
7878
command reuses Spark's support for writing Avro files. You can include any of the
79-
[Spark Avro options](https://spark.apache.org/docs/latest/sql-data-sources-avro.html) via the `-P` option to
79+
[Spark Avro data source options](https://spark.apache.org/docs/latest/sql-data-sources-avro.html) via the `-P` option to
8080
control how Avro content is written. These options are expressed as `-PoptionName=optionValue`.
8181

82+
For configuration options listed in the above Spark Avro guide, use the `-C` option instead. For example,
83+
`-Cspark.sql.avro.compression.codec=deflate` would change the type of compression used for writing Avro files.
84+
8285
### Delimited text
8386

8487
The `export-delimited-files` command writes one or more delimited text (commonly CSV) files to the directory
@@ -125,16 +128,22 @@ By default, each file will be written using the UTF-8 encoding. You can specify
125128

126129
The `export-orc-files` command writes one or more ORC files to the directory specified by the `--path` option. This
127130
command reuses Spark's support for writing ORC files. You can include any of the
128-
[Spark ORC options](https://spark.apache.org/docs/latest/sql-data-sources-orc.html) via the `-P` option to
131+
[Spark ORC data source options](https://spark.apache.org/docs/latest/sql-data-sources-orc.html) via the `-P` option to
129132
control how ORC content is written. These options are expressed as `-PoptionName=optionValue`.
130133

134+
For configuration options listed in the above Spark ORC guide, use the `-C` option instead. For example,
135+
`-Cspark.sql.orc.impl=hive` would change the type of ORC implementation.
136+
131137
### Parquet
132138

133139
The `export-parquet-files` command writes one or more Parquet files to the directory specified by the `--path` option. This
134140
command reuses Spark's support for writing Parquet files. You can include any of the
135-
[Spark Parquet options](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) via the `-P` option to
141+
[Spark Parquet data source options](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) via the `-P` option to
136142
control how Parquet content is written. These options are expressed as `-PoptionName=optionValue`.
137143

144+
For configuration options listed in the above Spark Parquet guide, use the `-C` option instead. For example,
145+
`-Cspark.sql.parquet.compression.codec=gzip` would change the compressed used for writing Parquet files.
146+
138147
## Controlling the save mode
139148

140149
Each of the commands for exporting rows to files supports a `--mode` option that controls how data is written to a

docs/import/import-files/avro.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,5 +95,9 @@ explicitly specify a compression algorithm if Flux is not able to read your comp
9595
## Advanced options
9696

9797
The `import-avro-files` command reuses Spark's support for reading Avro files. You can include any of
98-
the [Spark Avro options](https://spark.apache.org/docs/latest/sql-data-sources-avro.html) via the `-P` option
98+
the [Spark Avro data source options](https://spark.apache.org/docs/latest/sql-data-sources-avro.html) via the `-P` option
9999
to control how Avro content is read. These options are expressed as `-PoptionName=optionValue`.
100+
101+
For the configuration options listed in the above Spark Avro guide, use the `-C` option instead. For example,
102+
`-Cspark.sql.avro.filterPushdown.enabled=false` would configure Spark Avro to not push down filters.
103+

docs/import/import-files/orc.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,8 @@ explicitly specify a compression algorithm if Flux is not able to read your comp
9595
## Advanced options
9696

9797
The `import-orc-files` command reuses Spark's support for reading ORC files. You can include any of
98-
the [Spark ORC options](https://spark.apache.org/docs/latest/sql-data-sources-orc.html) via the `-P` option
99-
to control how Avro content is read. These options are expressed as `-PoptionName=optionValue`.
98+
the [Spark ORC data source options](https://spark.apache.org/docs/latest/sql-data-sources-orc.html) via the `-P` option
99+
to control how ORC content is read. These options are expressed as `-PoptionName=optionValue`.
100100

101+
For the configuration options listed in the above Spark ORC guide, use the `-C` option instead. For example,
102+
`-Cspark.sql.orc.filterPushdown=false` would configure Spark ORC to not push down filters.

docs/import/import-files/parquet.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,5 +95,8 @@ explicitly specify a compression algorithm if Flux is not able to read your comp
9595
## Advanced options
9696

9797
The `import-parquet-files` command reuses Spark's support for reading Parquet files. You can include any of
98-
the [Spark Parquet options](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) via the `-P` option
98+
the [Spark Parquet data source options](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) via the `-P` option
9999
to control how Parquet content is read. These options are expressed as `-PoptionName=optionValue`.
100+
101+
For the configuration options listed in the above Spark Parquet guide, use the `-C` option instead. For example,
102+
`-Cspark.sql.parquet.filterPushdown=false` would configure Spark Parquet to not push down filters.

flux-cli/src/main/java/com/marklogic/flux/impl/SparkUtil.java

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,10 @@ public static SparkSession buildSparkSession() {
2222
public static SparkSession buildSparkSession(String masterUrl, boolean showConsoleProgress) {
2323
SparkSession.Builder builder = SparkSession.builder()
2424
.master(masterUrl)
25+
// Spark config options can be provided now or at runtime via spark.conf().set(). The downside to setting
26+
// options now that are defined by the user is that they won't work when used with spark-submit, which
27+
// handles constructing a SparkSession. We may eventually provide a feature though for providing options
28+
// at this point for local users that want more control over the SparkSession itself.
2529
.config("spark.sql.session.timeZone", "UTC");
2630

2731
if (showConsoleProgress) {

flux-cli/src/test/java/com/marklogic/flux/impl/importdata/ImportAvroFilesTest.java

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@ void defaultSettingsMultipleFiles() {
2525

2626
// Including these for manual verification of progress logging.
2727
"--batch-size", "1",
28-
"--log-progress", "2"
28+
"--log-progress", "2",
29+
30+
// Including this to ensure a valid -C option doesn't cause an error.
31+
"-Cspark.sql.avro.filterPushdown.enabled=false"
2932
);
3033

3134
assertCollectionSize("avro-test", 6);

flux-cli/src/test/java/com/marklogic/flux/impl/importdata/ImportOrcFilesTest.java

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@ void orcFileTest() {
2525

2626
// Including these for manual verification of progress logging.
2727
"--batch-size", "5",
28-
"--log-progress", "5"
28+
"--log-progress", "5",
29+
30+
// Including this to ensure a valid -C option doesn't cause an error.
31+
"-Cspark.sql.orc.filterPushdown=false"
2932
);
3033

3134
getUrisInCollection("orcFile-test", 15).forEach(this::verifyDocContent);

flux-cli/src/test/java/com/marklogic/flux/impl/importdata/ImportParquetFilesTest.java

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,10 @@ void defaultSettingsSingleFile() {
2626

2727
// Including these for manual verification of progress logging.
2828
"--batch-size", "5",
29-
"--log-progress", "10"
29+
"--log-progress", "10",
30+
31+
// Including this to ensure a valid -C option doesn't cause an error.
32+
"-Cspark.sql.parquet.filterPushdown=false"
3033
);
3134

3235
assertCollectionSize("parquet-test", 32);

0 commit comments

Comments
 (0)