Merge pull request #195 from marklogic/feature/export-docs

Fixes for export docs
marklogic · Jul 22, 2024 · ebca27f · ebca27f
2 parents 58cbd39 + 200d7f2
commit ebca27f
Show file tree

Hide file tree

Showing 11 changed files with 181 additions and 127 deletions.
diff --git a/docs/common-options.md b/docs/common-options.md
@@ -72,31 +72,31 @@ All available connection options are shown in the table below:
 
 | Option | Description | 
 | --- | --- |
-| --auth-type | Type of authentication to use. Possible values are `BASIC`, `DIGEST`, `CLOUD`, `KERBEROS`, `CERTIFICATE`, and `SAML`.|
-| --base-path | Path to prepend to each call to a MarkLogic [REST API app server](https://docs.marklogic.com/guide/rest-dev). |
-| --certificate-file | File path for a keystore to be used for `CERTIFICATE` authentication. |
-| --certificate-password | Password for the keystore referenced by `--certificate-file`. |
-| --connection-string |  Defines a connection string as user:password@host:port/optionalDatabaseName; only usable when using `DIGEST` or `BASIC` authentication. |
-| --cloud-api-key | API key for authenticating with a MarkLogic Cloud cluster when authentication type is `CLOUD`. |
-| --connection-type |  Set to `DIRECT` if connections can be made directly to each host in the MarkLogic cluster. Defaults to `GATEWAY`. Possible values are `DIRECT` and `GATEWAY`. |
-| --database | Name of a database to connect if it differs from the one associated with the app server identified by `--port`. |
-| --disable-gzipped-responses | If included, responses from MarkLogic will not be gzipped. May improve performance when responses are very small.
-| --host | The MarkLogic host to connect to. |
-| --kerberos-principal | Principal to be used with `KERBEROS` authentication. |
-| --keystore-algorithm |  Algorithm of the keystore identified by `--keystore-path`; defaults to `SunX509`. |
-| --keystore-password | Password for the keystore identified by `--keystore-path`. |
-| --keystore-path | File path for a keystore for two-way SSL connections. |
-| --keystore-type | Type of the keystore identified by `--keystore-path`; defaults to `JKS`. |
-| --password | Password when using `DIGEST` or `BASIC` authentication. |
-| --port | Port of the [REST API app server](https://docs.marklogic.com/guide/rest-dev) to connect to. |
-| --saml-token | Token to be used with `SAML` authentication. |
-| --ssl-hostname-verifier | Hostname verification strategy when connecting via SSL. Possible values are `ANY`, `COMMON`, and `STRICT`. |
-| --ssl-protocol | SSL protocol to use when the MarkLogic app server requires an SSL connection. If a keystore or truststore is configured, defaults to `TLSv1.2`. |
-| --truststore-algorithm | Algorithm of the truststore identified by `--truststore-path`; defaults to `SunX509`. |
-| --truststore-password | Password for the truststore identified by `--truststore-path`. |
-| --truststore-path | File path for a truststore for establishing trust with the certificate used by the MarkLogic app server. |
-| --truststore-type | Type of the truststore identified by `--truststore-path`; defaults to `JKS`. |
-| --username | Username when using `DIGEST` or `BASIC` authentication. |
+| `--auth-type` | Type of authentication to use. Possible values are `BASIC`, `DIGEST`, `CLOUD`, `KERBEROS`, `CERTIFICATE`, and `SAML`.|
+| `--base-path` | Path to prepend to each call to a MarkLogic [REST API app server](https://docs.marklogic.com/guide/rest-dev). |
+| `--certificate-file` | File path for a keystore to be used for `CERTIFICATE` authentication. |
+| `--certificate-password` | Password for the keystore referenced by `--certificate-file`. |
+| `--connection-string` |  Defines a connection string as user:password@host:port/optionalDatabaseName; only usable when using `DIGEST` or `BASIC` authentication. |
+| `--cloud-api-key` | API key for authenticating with a MarkLogic Cloud cluster when authentication type is `CLOUD`. |
+| `--connection-type` |  Set to `DIRECT` if connections can be made directly to each host in the MarkLogic cluster. Defaults to `GATEWAY`. Possible values are `DIRECT` and `GATEWAY`. |
+| `--database` | Name of a database to connect if it differs from the one associated with the app server identified by `--port`. |
+| `--disable-gzipped-responses` | If included, responses from MarkLogic will not be gzipped. May improve performance when responses are very small.
+| `--host` | The MarkLogic host to connect to. |
+| `--kerberos-principal` | Principal to be used with `KERBEROS` authentication. |
+| `--keystore-algorithm` |  Algorithm of the keystore identified by `--keystore-path`; defaults to `SunX509`. |
+| `--keystore-password` | Password for the keystore identified by `--keystore-path`. |
+| `--keystore-path` | File path for a keystore for two-way SSL connections. |
+| `--keystore-type` | Type of the keystore identified by `--keystore-path`; defaults to `JKS`. |
+| `--password` | Password when using `DIGEST` or `BASIC` authentication. |
+| `--port` | Port of the [REST API app server](https://docs.marklogic.com/guide/rest-dev) to connect to. |
+| `--saml-token` | Token to be used with `SAML` authentication. |
+| `--ssl-hostname-verifier` | Hostname verification strategy when connecting via SSL. Possible values are `ANY`, `COMMON`, and `STRICT`. |
+| `--ssl-protocol` | SSL protocol to use when the MarkLogic app server requires an SSL connection. If a keystore or truststore is configured, defaults to `TLSv1.2`. |
+| `--truststore-algorithm` | Algorithm of the truststore identified by `--truststore-path`; defaults to `SunX509`. |
+| `--truststore-password` | Password for the truststore identified by `--truststore-path`. |
+| `--truststore-path` | File path for a truststore for establishing trust with the certificate used by the MarkLogic app server. |
+| `--truststore-type` | Type of the truststore identified by `--truststore-path`; defaults to `JKS`. |
+| `--username` | Username when using `DIGEST` or `BASIC` authentication. |
 
 
 ## Reading options from a file

diff --git a/docs/copy.md b/docs/copy.md
@@ -22,14 +22,15 @@ The following options control which documents are read from MarkLogic:
 
 | Option | Description | 
 | --- |--- |
-| --string-query | A string query utilizing MarkLogic's search grammar. |
-| --query | A structured, serialized CTS, or combined query expressed as JSON or XML. |
-| --options | Name of a REST API search options document; typically used with a string query. |
-| --collections | Comma-delimited sequence of collection names. |
-| --directory | A database directory for constraining on URIs. |
-
-You must specify at least one of `--string-query`, `--query`, `--collections`, or `--directory`. You may specify any
-combination of those options as well.
+| `--collections` | Comma-delimited sequence of collection names. |
+| `--directory` | A database directory for constraining on URIs. |
+| `--options` | Name of a REST API search options document; typically used with a string query. |
+| `--query` | A structured, serialized CTS, or combined query expressed as JSON or XML. |
+| `--string-query` | A string query utilizing MarkLogic's search grammar. |
+| `--uris` | Newline-delimited sequence of document URIs to retrieve. |
+
+You must specify at least one of `--collections`, `--directory`, `--query`, `--string-query`, or `--uris`. You may specify any
+combination of those options as well, with the exception that `--query` will be ignored if `--uris` is specified.
 
 For examples of what the `--query` option support, please see 
 [the MarkLogic search documentation](https://docs.marklogic.com/guide/rest-dev/search#id_49329).

diff --git a/docs/export/custom-export.md b/docs/export/custom-export.md
@@ -17,8 +17,8 @@ MarkLogic and write the results to a custom target.
 ## Usage
 
 With the required `--target` option, you can specify
-[any Spark data source](https://spark.apache.org/docs/latest/sql-data-sources.html) or the name of a thirdparty Spark
-connector. For a thirdparty Spark connector, you must include the necessary JAR files for the connector in the
+[any Spark data source](https://spark.apache.org/docs/latest/sql-data-sources.html) or the name of a third-party Spark
+connector. For a third-party Spark connector, you must include the necessary JAR files for the connector in the
 `./ext` directory of your Flux installation. Note that if the connector is not available as a single "uber" jar, you
 will need to ensure that the connector and all of its dependencies are added to the `./ext` directory.
 
@@ -27,15 +27,18 @@ As an example, Flux does not provide an out-of-the-box command that uses the
 via `custom-export-rows`:
 
 ```
-./bin/flux custom-export-rows --target text \
-  -Ppath=export \
-  --connection-string user:password@localhost:8000 \
-  --query "op.fromView('schema', 'view')" etc...
+./bin/flux custom-export-rows \
+    --target text \
+    -Ppath=export \
+    --connection-string user:password@localhost:8000 \
+    --query "op.fromView('schema', 'view')" etc...
 ```
 
 ## Exporting rows
 
-When using `custom-export-rows` with an Optic query to select rows from MarkLogic, each row sent to the connector or 
+When using `custom-export-rows` with
+[an Optic query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710) to select rows from MarkLogic, 
+each row sent to the connector or 
 data source defined by `--target` will have a schema based on the output of the Optic query. You may find the 
 `--preview` and `--preview-schema` options helpful in understanding what data will be in these rows. 
 See [Common Options](../common-options.md) for more information.
@@ -54,7 +57,7 @@ the following column definitions:
 7. `properties` containing an XML document serialized to a string.
 8. `metadataValues` containing a map of string keys and string values.
 
-These are normal Spark rows that can be written via Spark data sources like Parquet and ORC. If using a thirdparty 
+These are normal Spark rows that can be written via Spark data sources like Parquet and ORC. If using a third-party 
 Spark connector, you will likely need to understand how that connector will make use of rows defined via the above 
 schema in order to get your desired results. 
 
diff --git a/docs/export/export-archives.md b/docs/export/export-archives.md
@@ -21,18 +21,19 @@ database.
 The `export-archive-files` command requires a query for selecting documents to export and a directory path for writing 
 archive files to. 
 
-The following options then control which documents are selected to be exported:
+The following options control which documents are selected to be exported:
 
 | Option | Description | 
 | --- |--- |
-| --string-query | A string query utilizing MarkLogic's search grammar. |
-| --query | A structured, serialized CTS, or combined query expressed as JSON or XML. |
-| --options | Name of a REST API search options document; typically used with a string query. |
-| --collections | Comma-delimited sequence of collection names. |
-| --directory | A database directory for constraining on URIs. |
+| `--collections` | Comma-delimited sequence of collection names. |
+| `--directory` | A database directory for constraining on URIs. |
+| `--options` | Name of a REST API search options document; typically used with a string query. |
+| `--query` | A structured, serialized CTS, or combined query expressed as JSON or XML. |
+| `--string-query` | A string query utilizing MarkLogic's search grammar. |
+| `--uris` | Newline-delimited sequence of document URIs to retrieve.  |
 
-You must specify at least one of `--string-query`, `--query`, `--collections`, or `--directory`. You may specify any
-combination of those options as well.
+You must specify at least one of `--collections`, `--directory`, `--query`, `--string-query`, or `--uris`. You may specify any
+combination of those options as well, with the exception that `--query` will be ignored if `--uris` is specified.
 
 You must then use the `--path` option to specify a directory to write archive files to.
 
@@ -57,15 +58,20 @@ to each document before it is written to an archive. A transform is configured v
 
 | Option | Description | 
 | --- | --- |
-| --transform | Name of a MarkLogic REST transform to apply to the document before writing it. |
-| --transform-params | Comma-delimited list of transform parameter names and values - e.g. param1,value1,param2,value2. |
-| --transform-params-delimiter | Delimiter for `--transform-params`; typically set when a value contains a comma. |
+| `--transform` | Name of a MarkLogic REST transform to apply to the document before writing it. |
+| `--transform-params` | Comma-delimited list of transform parameter names and values - e.g. param1,value1,param2,value2. |
+| `--transform-params-delimiter` | Delimiter for `--transform-params`; typically set when a value contains a comma. |
 
 ## Specifying an encoding
 
 MarkLogic stores all content [in the UTF-8 encoding](https://docs.marklogic.com/guide/search-dev/encodings_collations#id_87576).
 You can specify an alternate encoding when exporting archives via the `--encoding` option - e.g.:
 
-    ./bin/flux export-archives --path destination --encoding ISO-8859-1 ...
+```
+./bin/flux export-archives \
+    --path destination \
+    --encoding ISO-8859-1 \
+    etc...
+```
 
 The encoding will be used for both document and metadata entries in each archive zip file. 
diff --git a/docs/export/export-documents.md b/docs/export/export-documents.md
@@ -15,24 +15,29 @@ Flux can export documents to files, with each document being written as a separa
 
 ## Usage
 
-The `export-files` command selects documents in a MarkLogic database and write them to a filesystem.
+The `export-files` command selects documents in a MarkLogic database and writes them to a filesystem.
 You must specify a `--path` option for where files should be written along with connection information for the
-MarkLogic database you wish to write to:
+MarkLogic database you wish to write to - for example:
 
-    ./bin/flux export-files --path /path/to/files --connection-string "user:password@localhost:8000"
+```
+./bin/flux export-files \
+    --path /path/to/files \
+    --connection-string "user:password@localhost:8000" etc..
+```
 
-The following options then control which documents are selected to be exported:
+The following options control which documents are selected to be exported:
 
 | Option | Description | 
 | --- |--- |
-| --collections | Comma-delimited sequence of collection names. |
-| --directory | A database directory for constraining on URIs. |
-| --options | Name of a REST API search options document; typically used with a string query. |
-| --query | A structured, serialized CTS, or combined query expressed as JSON or XML. |
-| --string-query | A string query utilizing MarkLogic's search grammar. |
+| `--collections` | Comma-delimited sequence of collection names. |
+| `--directory` | A database directory for constraining on URIs. |
+| `--options` | Name of a REST API search options document; typically used with a string query. |
+| `--query` | A structured, serialized CTS, or combined query expressed as JSON or XML. |
+| `--string-query` | A string query utilizing MarkLogic's search grammar. |
+| `--uris` | Newline-delimited sequence of document URIs to retrieve. |
 
-You must specify at least one of `--collections`, `--directory`, `--query`, or `--string-query`. You may specify any
-combination of those options as well.
+You must specify at least one of `--collections`, `--directory`, `--query`, `--string-query`, or `--uris`. You may specify any
+combination of those options as well, with the exception that `--query` will be ignored if `--uris` is specified.
 
 ## Transforming document content
 
@@ -41,9 +46,9 @@ to each document before it is written to a file. A transform is configured via t
 
 | Option | Description | 
 | --- | --- |
-| --transform | Name of a MarkLogic REST transform to apply to the document before writing it. |
-| --transform-params | Comma-delimited list of transform parameter names and values - e.g. param1,value1,param2,value2. |
-| --transform-params-delimiter | Delimiter for `--transform-params`; typically set when a value contains a comma. |
+| `--transform` | Name of a MarkLogic REST transform to apply to each document before writing it to its destination. |
+| `--transform-params` | Comma-delimited list of transform parameter names and values - e.g. param1,value1,param2,value2. |
+| `--transform-params-delimiter` | Delimiter for `--transform-params`; typically set when a value contains a comma. |
 
 ## Compressing content
 
@@ -52,31 +57,38 @@ The `--compression` option is used to write files either to Gzip or ZIP files.
 To Gzip each file, include `--compression GZIP`. 
 
 To write multiple files to one or more ZIP files, include `--compression ZIP`. A zip file will be created for each 
-partition that was created when reading data via Optic. You can include `--repartition 1` to force all documents to be
-written to a single ZIP file. See the next section on "Understanding partitions" for more information. 
+partition that was created when reading data via Optic. You can include `--zip-file-count 1` to force all documents to be
+written to a single ZIP file. See the below section on "Understanding partitions" for more information. 
 
 ## Specifying an encoding
 
 MarkLogic stores all content [in the UTF-8 encoding](https://docs.marklogic.com/guide/search-dev/encodings_collations#id_87576).
 You can specify an alternate encoding when exporting documents to files via the `--encoding` option - e.g.:
 
-    ./bin/flux export-generic-files --path destination --encoding ISO-8859-1 ...
+```
+./bin/flux export-files \
+    --path destination \
+    --encoding ISO-8859-1 \
+    etc...
+```
 
 ## Understanding partitions
 
 As Flux is built on top of Apache Spark, it is heavily influenced by how Spark
 [defines and manages partitions](https://sparkbyexamples.com/spark/spark-partitioning-understanding/). Within the
 context of Flux, partitions can be thought of as "workers", with each worker operating in parallel on a different subset
-of data. Generally, more partitions allow for more parallel work and thus improved performance.
+of data. Generally, more partitions allow for more parallel work and improved performance.
 
 When exporting documents to files, the number of partitions impacts how many files will be written. For example, run
 the following command below from the [Getting Started guide](getting-started.md):
 
 ```
 rm export/*.zip
-./bin/flux export-files --connection-string flux-example-user:password@localhost:8004 \
+./bin/flux export-files \
+    --connection-string flux-example-user:password@localhost:8004 \
     --collections employee \
-    --path export --compression zip
+    --path export \
+    --compression zip
 ```
 
 The `./export` directory will have 12 zip files in it. This count is due to how Flux reads data from MarkLogic,
@@ -88,9 +100,11 @@ from each forest in your database:
 
 ```
 rm export/*.zip
-./bin/flux export-files --connection-string flux-example-user:password@localhost:8004 \
+./bin/flux export-files \
+    --connection-string flux-example-user:password@localhost:8004 \
     --collections employee \
-    --path export --compression zip \
+    --path export \
+    --compression zip \
     --partitions-per-forest 1
 ```
 
@@ -108,3 +122,9 @@ rm export/*.zip
 ```
 
 This approach will produce a single zip file due to the use of a single partition when writing files. 
+The `--zip-file-count` option is effectively an alias for `--repartition`. Both options produce the same outcome. 
+`--zip-file-count` is included as a more intuitive option for the common case of configuring how many files should
+be written. 
+
+Note that Spark's support for repartitioning may negatively impact overall performance due to the need to read all 
+data from the data source first before writing any data.