Skip to content

Commit

Permalink
Merge pull request #195 from marklogic/feature/export-docs
Browse files Browse the repository at this point in the history
Fixes for export docs
  • Loading branch information
rjrudin authored Jul 22, 2024
2 parents 58cbd39 + 200d7f2 commit ebca27f
Show file tree
Hide file tree
Showing 11 changed files with 181 additions and 127 deletions.
50 changes: 25 additions & 25 deletions docs/common-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,31 +72,31 @@ All available connection options are shown in the table below:

| Option | Description |
| --- | --- |
| --auth-type | Type of authentication to use. Possible values are `BASIC`, `DIGEST`, `CLOUD`, `KERBEROS`, `CERTIFICATE`, and `SAML`.|
| --base-path | Path to prepend to each call to a MarkLogic [REST API app server](https://docs.marklogic.com/guide/rest-dev). |
| --certificate-file | File path for a keystore to be used for `CERTIFICATE` authentication. |
| --certificate-password | Password for the keystore referenced by `--certificate-file`. |
| --connection-string | Defines a connection string as user:password@host:port/optionalDatabaseName; only usable when using `DIGEST` or `BASIC` authentication. |
| --cloud-api-key | API key for authenticating with a MarkLogic Cloud cluster when authentication type is `CLOUD`. |
| --connection-type | Set to `DIRECT` if connections can be made directly to each host in the MarkLogic cluster. Defaults to `GATEWAY`. Possible values are `DIRECT` and `GATEWAY`. |
| --database | Name of a database to connect if it differs from the one associated with the app server identified by `--port`. |
| --disable-gzipped-responses | If included, responses from MarkLogic will not be gzipped. May improve performance when responses are very small.
| --host | The MarkLogic host to connect to. |
| --kerberos-principal | Principal to be used with `KERBEROS` authentication. |
| --keystore-algorithm | Algorithm of the keystore identified by `--keystore-path`; defaults to `SunX509`. |
| --keystore-password | Password for the keystore identified by `--keystore-path`. |
| --keystore-path | File path for a keystore for two-way SSL connections. |
| --keystore-type | Type of the keystore identified by `--keystore-path`; defaults to `JKS`. |
| --password | Password when using `DIGEST` or `BASIC` authentication. |
| --port | Port of the [REST API app server](https://docs.marklogic.com/guide/rest-dev) to connect to. |
| --saml-token | Token to be used with `SAML` authentication. |
| --ssl-hostname-verifier | Hostname verification strategy when connecting via SSL. Possible values are `ANY`, `COMMON`, and `STRICT`. |
| --ssl-protocol | SSL protocol to use when the MarkLogic app server requires an SSL connection. If a keystore or truststore is configured, defaults to `TLSv1.2`. |
| --truststore-algorithm | Algorithm of the truststore identified by `--truststore-path`; defaults to `SunX509`. |
| --truststore-password | Password for the truststore identified by `--truststore-path`. |
| --truststore-path | File path for a truststore for establishing trust with the certificate used by the MarkLogic app server. |
| --truststore-type | Type of the truststore identified by `--truststore-path`; defaults to `JKS`. |
| --username | Username when using `DIGEST` or `BASIC` authentication. |
| `--auth-type` | Type of authentication to use. Possible values are `BASIC`, `DIGEST`, `CLOUD`, `KERBEROS`, `CERTIFICATE`, and `SAML`.|
| `--base-path` | Path to prepend to each call to a MarkLogic [REST API app server](https://docs.marklogic.com/guide/rest-dev). |
| `--certificate-file` | File path for a keystore to be used for `CERTIFICATE` authentication. |
| `--certificate-password` | Password for the keystore referenced by `--certificate-file`. |
| `--connection-string` | Defines a connection string as user:password@host:port/optionalDatabaseName; only usable when using `DIGEST` or `BASIC` authentication. |
| `--cloud-api-key` | API key for authenticating with a MarkLogic Cloud cluster when authentication type is `CLOUD`. |
| `--connection-type` | Set to `DIRECT` if connections can be made directly to each host in the MarkLogic cluster. Defaults to `GATEWAY`. Possible values are `DIRECT` and `GATEWAY`. |
| `--database` | Name of a database to connect if it differs from the one associated with the app server identified by `--port`. |
| `--disable-gzipped-responses` | If included, responses from MarkLogic will not be gzipped. May improve performance when responses are very small.
| `--host` | The MarkLogic host to connect to. |
| `--kerberos-principal` | Principal to be used with `KERBEROS` authentication. |
| `--keystore-algorithm` | Algorithm of the keystore identified by `--keystore-path`; defaults to `SunX509`. |
| `--keystore-password` | Password for the keystore identified by `--keystore-path`. |
| `--keystore-path` | File path for a keystore for two-way SSL connections. |
| `--keystore-type` | Type of the keystore identified by `--keystore-path`; defaults to `JKS`. |
| `--password` | Password when using `DIGEST` or `BASIC` authentication. |
| `--port` | Port of the [REST API app server](https://docs.marklogic.com/guide/rest-dev) to connect to. |
| `--saml-token` | Token to be used with `SAML` authentication. |
| `--ssl-hostname-verifier` | Hostname verification strategy when connecting via SSL. Possible values are `ANY`, `COMMON`, and `STRICT`. |
| `--ssl-protocol` | SSL protocol to use when the MarkLogic app server requires an SSL connection. If a keystore or truststore is configured, defaults to `TLSv1.2`. |
| `--truststore-algorithm` | Algorithm of the truststore identified by `--truststore-path`; defaults to `SunX509`. |
| `--truststore-password` | Password for the truststore identified by `--truststore-path`. |
| `--truststore-path` | File path for a truststore for establishing trust with the certificate used by the MarkLogic app server. |
| `--truststore-type` | Type of the truststore identified by `--truststore-path`; defaults to `JKS`. |
| `--username` | Username when using `DIGEST` or `BASIC` authentication. |


## Reading options from a file
Expand Down
17 changes: 9 additions & 8 deletions docs/copy.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,15 @@ The following options control which documents are read from MarkLogic:

| Option | Description |
| --- |--- |
| --string-query | A string query utilizing MarkLogic's search grammar. |
| --query | A structured, serialized CTS, or combined query expressed as JSON or XML. |
| --options | Name of a REST API search options document; typically used with a string query. |
| --collections | Comma-delimited sequence of collection names. |
| --directory | A database directory for constraining on URIs. |

You must specify at least one of `--string-query`, `--query`, `--collections`, or `--directory`. You may specify any
combination of those options as well.
| `--collections` | Comma-delimited sequence of collection names. |
| `--directory` | A database directory for constraining on URIs. |
| `--options` | Name of a REST API search options document; typically used with a string query. |
| `--query` | A structured, serialized CTS, or combined query expressed as JSON or XML. |
| `--string-query` | A string query utilizing MarkLogic's search grammar. |
| `--uris` | Newline-delimited sequence of document URIs to retrieve. |

You must specify at least one of `--collections`, `--directory`, `--query`, `--string-query`, or `--uris`. You may specify any
combination of those options as well, with the exception that `--query` will be ignored if `--uris` is specified.

For examples of what the `--query` option support, please see
[the MarkLogic search documentation](https://docs.marklogic.com/guide/rest-dev/search#id_49329).
Expand Down
19 changes: 11 additions & 8 deletions docs/export/custom-export.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ MarkLogic and write the results to a custom target.
## Usage

With the required `--target` option, you can specify
[any Spark data source](https://spark.apache.org/docs/latest/sql-data-sources.html) or the name of a thirdparty Spark
connector. For a thirdparty Spark connector, you must include the necessary JAR files for the connector in the
[any Spark data source](https://spark.apache.org/docs/latest/sql-data-sources.html) or the name of a third-party Spark
connector. For a third-party Spark connector, you must include the necessary JAR files for the connector in the
`./ext` directory of your Flux installation. Note that if the connector is not available as a single "uber" jar, you
will need to ensure that the connector and all of its dependencies are added to the `./ext` directory.

Expand All @@ -27,15 +27,18 @@ As an example, Flux does not provide an out-of-the-box command that uses the
via `custom-export-rows`:

```
./bin/flux custom-export-rows --target text \
-Ppath=export \
--connection-string user:password@localhost:8000 \
--query "op.fromView('schema', 'view')" etc...
./bin/flux custom-export-rows \
--target text \
-Ppath=export \
--connection-string user:password@localhost:8000 \
--query "op.fromView('schema', 'view')" etc...
```

## Exporting rows

When using `custom-export-rows` with an Optic query to select rows from MarkLogic, each row sent to the connector or
When using `custom-export-rows` with
[an Optic query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710) to select rows from MarkLogic,
each row sent to the connector or
data source defined by `--target` will have a schema based on the output of the Optic query. You may find the
`--preview` and `--preview-schema` options helpful in understanding what data will be in these rows.
See [Common Options](../common-options.md) for more information.
Expand All @@ -54,7 +57,7 @@ the following column definitions:
7. `properties` containing an XML document serialized to a string.
8. `metadataValues` containing a map of string keys and string values.

These are normal Spark rows that can be written via Spark data sources like Parquet and ORC. If using a thirdparty
These are normal Spark rows that can be written via Spark data sources like Parquet and ORC. If using a third-party
Spark connector, you will likely need to understand how that connector will make use of rows defined via the above
schema in order to get your desired results.

30 changes: 18 additions & 12 deletions docs/export/export-archives.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,18 +21,19 @@ database.
The `export-archive-files` command requires a query for selecting documents to export and a directory path for writing
archive files to.

The following options then control which documents are selected to be exported:
The following options control which documents are selected to be exported:

| Option | Description |
| --- |--- |
| --string-query | A string query utilizing MarkLogic's search grammar. |
| --query | A structured, serialized CTS, or combined query expressed as JSON or XML. |
| --options | Name of a REST API search options document; typically used with a string query. |
| --collections | Comma-delimited sequence of collection names. |
| --directory | A database directory for constraining on URIs. |
| `--collections` | Comma-delimited sequence of collection names. |
| `--directory` | A database directory for constraining on URIs. |
| `--options` | Name of a REST API search options document; typically used with a string query. |
| `--query` | A structured, serialized CTS, or combined query expressed as JSON or XML. |
| `--string-query` | A string query utilizing MarkLogic's search grammar. |
| `--uris` | Newline-delimited sequence of document URIs to retrieve. |

You must specify at least one of `--string-query`, `--query`, `--collections`, or `--directory`. You may specify any
combination of those options as well.
You must specify at least one of `--collections`, `--directory`, `--query`, `--string-query`, or `--uris`. You may specify any
combination of those options as well, with the exception that `--query` will be ignored if `--uris` is specified.

You must then use the `--path` option to specify a directory to write archive files to.

Expand All @@ -57,15 +58,20 @@ to each document before it is written to an archive. A transform is configured v

| Option | Description |
| --- | --- |
| --transform | Name of a MarkLogic REST transform to apply to the document before writing it. |
| --transform-params | Comma-delimited list of transform parameter names and values - e.g. param1,value1,param2,value2. |
| --transform-params-delimiter | Delimiter for `--transform-params`; typically set when a value contains a comma. |
| `--transform` | Name of a MarkLogic REST transform to apply to the document before writing it. |
| `--transform-params` | Comma-delimited list of transform parameter names and values - e.g. param1,value1,param2,value2. |
| `--transform-params-delimiter` | Delimiter for `--transform-params`; typically set when a value contains a comma. |

## Specifying an encoding

MarkLogic stores all content [in the UTF-8 encoding](https://docs.marklogic.com/guide/search-dev/encodings_collations#id_87576).
You can specify an alternate encoding when exporting archives via the `--encoding` option - e.g.:

./bin/flux export-archives --path destination --encoding ISO-8859-1 ...
```
./bin/flux export-archives \
--path destination \
--encoding ISO-8859-1 \
etc...
```

The encoding will be used for both document and metadata entries in each archive zip file.
64 changes: 42 additions & 22 deletions docs/export/export-documents.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,24 +15,29 @@ Flux can export documents to files, with each document being written as a separa

## Usage

The `export-files` command selects documents in a MarkLogic database and write them to a filesystem.
The `export-files` command selects documents in a MarkLogic database and writes them to a filesystem.
You must specify a `--path` option for where files should be written along with connection information for the
MarkLogic database you wish to write to:
MarkLogic database you wish to write to - for example:

./bin/flux export-files --path /path/to/files --connection-string "user:password@localhost:8000"
```
./bin/flux export-files \
--path /path/to/files \
--connection-string "user:password@localhost:8000" etc..
```

The following options then control which documents are selected to be exported:
The following options control which documents are selected to be exported:

| Option | Description |
| --- |--- |
| --collections | Comma-delimited sequence of collection names. |
| --directory | A database directory for constraining on URIs. |
| --options | Name of a REST API search options document; typically used with a string query. |
| --query | A structured, serialized CTS, or combined query expressed as JSON or XML. |
| --string-query | A string query utilizing MarkLogic's search grammar. |
| `--collections` | Comma-delimited sequence of collection names. |
| `--directory` | A database directory for constraining on URIs. |
| `--options` | Name of a REST API search options document; typically used with a string query. |
| `--query` | A structured, serialized CTS, or combined query expressed as JSON or XML. |
| `--string-query` | A string query utilizing MarkLogic's search grammar. |
| `--uris` | Newline-delimited sequence of document URIs to retrieve. |

You must specify at least one of `--collections`, `--directory`, `--query`, or `--string-query`. You may specify any
combination of those options as well.
You must specify at least one of `--collections`, `--directory`, `--query`, `--string-query`, or `--uris`. You may specify any
combination of those options as well, with the exception that `--query` will be ignored if `--uris` is specified.

## Transforming document content

Expand All @@ -41,9 +46,9 @@ to each document before it is written to a file. A transform is configured via t

| Option | Description |
| --- | --- |
| --transform | Name of a MarkLogic REST transform to apply to the document before writing it. |
| --transform-params | Comma-delimited list of transform parameter names and values - e.g. param1,value1,param2,value2. |
| --transform-params-delimiter | Delimiter for `--transform-params`; typically set when a value contains a comma. |
| `--transform` | Name of a MarkLogic REST transform to apply to each document before writing it to its destination. |
| `--transform-params` | Comma-delimited list of transform parameter names and values - e.g. param1,value1,param2,value2. |
| `--transform-params-delimiter` | Delimiter for `--transform-params`; typically set when a value contains a comma. |

## Compressing content

Expand All @@ -52,31 +57,38 @@ The `--compression` option is used to write files either to Gzip or ZIP files.
To Gzip each file, include `--compression GZIP`.

To write multiple files to one or more ZIP files, include `--compression ZIP`. A zip file will be created for each
partition that was created when reading data via Optic. You can include `--repartition 1` to force all documents to be
written to a single ZIP file. See the next section on "Understanding partitions" for more information.
partition that was created when reading data via Optic. You can include `--zip-file-count 1` to force all documents to be
written to a single ZIP file. See the below section on "Understanding partitions" for more information.

## Specifying an encoding

MarkLogic stores all content [in the UTF-8 encoding](https://docs.marklogic.com/guide/search-dev/encodings_collations#id_87576).
You can specify an alternate encoding when exporting documents to files via the `--encoding` option - e.g.:

./bin/flux export-generic-files --path destination --encoding ISO-8859-1 ...
```
./bin/flux export-files \
--path destination \
--encoding ISO-8859-1 \
etc...
```

## Understanding partitions

As Flux is built on top of Apache Spark, it is heavily influenced by how Spark
[defines and manages partitions](https://sparkbyexamples.com/spark/spark-partitioning-understanding/). Within the
context of Flux, partitions can be thought of as "workers", with each worker operating in parallel on a different subset
of data. Generally, more partitions allow for more parallel work and thus improved performance.
of data. Generally, more partitions allow for more parallel work and improved performance.

When exporting documents to files, the number of partitions impacts how many files will be written. For example, run
the following command below from the [Getting Started guide](getting-started.md):

```
rm export/*.zip
./bin/flux export-files --connection-string flux-example-user:password@localhost:8004 \
./bin/flux export-files \
--connection-string flux-example-user:password@localhost:8004 \
--collections employee \
--path export --compression zip
--path export \
--compression zip
```

The `./export` directory will have 12 zip files in it. This count is due to how Flux reads data from MarkLogic,
Expand All @@ -88,9 +100,11 @@ from each forest in your database:

```
rm export/*.zip
./bin/flux export-files --connection-string flux-example-user:password@localhost:8004 \
./bin/flux export-files \
--connection-string flux-example-user:password@localhost:8004 \
--collections employee \
--path export --compression zip \
--path export \
--compression zip \
--partitions-per-forest 1
```

Expand All @@ -108,3 +122,9 @@ rm export/*.zip
```

This approach will produce a single zip file due to the use of a single partition when writing files.
The `--zip-file-count` option is effectively an alias for `--repartition`. Both options produce the same outcome.
`--zip-file-count` is included as a more intuitive option for the common case of configuring how many files should
be written.

Note that Spark's support for repartitioning may negatively impact overall performance due to the need to read all
data from the data source first before writing any data.
Loading

0 comments on commit ebca27f

Please sign in to comment.