Skip to content

Commit

Permalink
Merge pull request #279 from marklogic/release/1.1.2
Browse files Browse the repository at this point in the history
Merge 1.1.2 into master
  • Loading branch information
rjrudin authored Oct 17, 2024
2 parents e0796d0 + 1be7440 commit 8916c8d
Show file tree
Hide file tree
Showing 29 changed files with 321 additions and 58 deletions.
8 changes: 4 additions & 4 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ To add Flux as a dependency to your application, add the following to your Maven
<dependency>
<groupId>com.marklogic</groupId>
<artifactId>flux-api</artifactId>
<version>1.1.0</version>
<version>1.1.2</version>
</dependency>
```

Or if you are using Gradle, add the following to your `build.gradle` file:

```
dependencies {
implementation "com.marklogic:flux-api:1.1.0"
implementation "com.marklogic:flux-api:1.1.2"
}
```

Expand Down Expand Up @@ -97,7 +97,7 @@ buildscript {
mavenCentral()
}
dependencies {
classpath "com.marklogic:flux-api:1.1.0"
classpath "com.marklogic:flux-api:1.1.2"
}
}
```
Expand Down Expand Up @@ -139,7 +139,7 @@ buildscript {
mavenCentral()
}
dependencies {
classpath "com.marklogic:flux-api:1.1.0"
classpath "com.marklogic:flux-api:1.1.2"
classpath("com.marklogic:ml-gradle:4.8.0") {
exclude group: "com.fasterxml.jackson.databind"
exclude group: "com.fasterxml.jackson.core"
Expand Down
16 changes: 12 additions & 4 deletions docs/export/export-archives.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,9 @@ included with the `--categories` option. This option accepts a comma-delimited s

If the option is not included, all metadata will be included.

## Enabling point-in-time queries
## Exporting consistent results

Flux depends on MarkLogic's support for
By default, Flux uses MarkLogic's support for
[point-in-time queries](https://docs.marklogic.com/11.0/guide/app-dev/point_in_time#id_47946) when querying for
documents, thus ensuring a [consistent snapshot of data](https://docs.marklogic.com/guide/java/data-movement#id_18227).
Point-in-time queries depend on the same MarkLogic system timestamp being used for each query. Because system timestamps
Expand All @@ -102,8 +102,16 @@ by configuring the `merge timestamp` setting. The recommended practice is to
that exceeds the expected duration of the export operation. For example, a value of `-864,000,000,000` for the merge
timestamp would give the export operation 24 hours to complete.

Flux will soon include an option to not use a snapshot for queries for when the risk of inconsistent results is deemed
to be acceptable.
Alternatively, you can disable the use of point-in-time queries by including the following option:

```
--no-snapshot
```

The above option will not use a snapshot for queries but instead will query for data at multiple points in time. As
noted above in the guide for [consistent snapshots](https://docs.marklogic.com/guide/java/data-movement#id_18227), you
may get unpredictable results if your query matches on data that changes during the export operation. If your data is
not changing, this approach is recommended as it avoids the need to configure merge timestamp.

## Transforming document content

Expand Down
17 changes: 13 additions & 4 deletions docs/export/export-documents.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,9 +122,9 @@ you use for running Flux to break the value into multiple lines:
For queries expressed in XML, you may find it easier to use single quotes instead of double quotes, as single quotes
do not require any escaping.

## Enabling point-in-time queries
## Exporting consistent results

Flux depends on MarkLogic's support for
By default, Flux uses MarkLogic's support for
[point-in-time queries](https://docs.marklogic.com/11.0/guide/app-dev/point_in_time#id_47946) when querying for
documents, thus ensuring a [consistent snapshot of data](https://docs.marklogic.com/guide/java/data-movement#id_18227).
Point-in-time queries depend on the same MarkLogic system timestamp being used for each query. Because system timestamps
Expand All @@ -142,8 +142,17 @@ by configuring the `merge timestamp` setting. The recommended practice is to
that exceeds the expected duration of the export operation. For example, a value of `-864,000,000,000` for the merge
timestamp would give the export operation 24 hours to complete.

Flux will soon include an option to not use a snapshot for queries for when the risk of inconsistent results is deemed
to be acceptable.
Alternatively, you can disable the use of point-in-time queries by including the following option:

```
--no-snapshot
```

The above option will not use a snapshot for queries but instead will query for data at multiple points in time. As
noted above in the guide for [consistent snapshots](https://docs.marklogic.com/guide/java/data-movement#id_18227), you
may get unpredictable results if your query matches on data that changes during the export operation. If your data is
not changing, this approach is recommended as it avoids the need to configure merge timestamp.


## Transforming document content

Expand Down
20 changes: 14 additions & 6 deletions docs/export/export-rdf.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,11 @@ graph value that will then be associated with every triple that Flux writes to a

To compress each file written by Flux using gzip, simply include `--gzip` as an option.

## Enabling point-in-time queries
## Exporting consistent results

Flux depends on MarkLogic's support for
By default, Flux uses MarkLogic's support for
[point-in-time queries](https://docs.marklogic.com/11.0/guide/app-dev/point_in_time#id_47946) when querying for
documents containing RDF data, thus ensuring a [consistent snapshot of data](https://docs.marklogic.com/guide/java/data-movement#id_18227).
documents, thus ensuring a [consistent snapshot of data](https://docs.marklogic.com/guide/java/data-movement#id_18227).
Point-in-time queries depend on the same MarkLogic system timestamp being used for each query. Because system timestamps
can be deleted when MarkLogic [merges data](https://docs.marklogic.com/11.0/guide/admin-guide/en/understanding-and-controlling-database-merges.html),
you may encounter the following error that causes an export command to fail:
Expand All @@ -108,7 +108,15 @@ To resolve this issue, you must
by configuring the `merge timestamp` setting. The recommended practice is to
[use a negative value](https://docs.marklogic.com/11.0/guide/admin-guide/en/understanding-and-controlling-database-merges/setting-a-negative-merge-timestamp-to-preserve-fragments-for-a-rolling-window-of-time.html)
that exceeds the expected duration of the export operation. For example, a value of `-864,000,000,000` for the merge
timestamp would give the export operation 24 hours to complete.
timestamp would give the export operation 24 hours to complete.

Flux will soon include an option to not use a snapshot for queries for when the risk of inconsistent results is deemed
to be acceptable.
Alternatively, you can disable the use of point-in-time queries by including the following option:

```
--no-snapshot
```

The above option will not use a snapshot for queries but instead will query for data at multiple points in time. As
noted above in the guide for [consistent snapshots](https://docs.marklogic.com/guide/java/data-movement#id_18227), you
may get unpredictable results if your query matches on data that changes during the export operation. If your data is
not changing, this approach is recommended as it avoids the need to configure merge timestamp.
16 changes: 8 additions & 8 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,18 +15,18 @@ This guide describes how to get started with Flux with some examples demonstrati
## Setup

You can download the latest release of the Flux application zip from [the latest Flux release page](https://github.com/marklogic/flux/releases).
The Flux application zip is titled `marklogic-flux-1.1.0.zip`. You can extract this zip to any location on your
The Flux application zip is titled `marklogic-flux-1.1.2.zip`. You can extract this zip to any location on your
filesystem that you prefer.

### Deploying the example application

The examples in this guide, along with examples found throughout this documentation, depend on a small MarkLogic
application that can be deployed to your own instance of MarkLogic server. The application can be downloaded from
[the latest Flux release page](https://github.com/marklogic/flux/releases) in a zip titled
`marklogic-flux-getting-started-1.1.0.zip`. To use Flux with this example application, perform the following steps:
`marklogic-flux-getting-started-1.1.2.zip`. To use Flux with this example application, perform the following steps:

1. Extract the `marklogic-flux-getting-started-1.1.0.zip` file to any location on your local filesystem.
2. Run `cd marklogic-flux-getting-started-1.1.0` to change to the directory created by extracting the ZIP file.
1. Extract the `marklogic-flux-getting-started-1.1.2.zip` file to any location on your local filesystem.
2. Run `cd marklogic-flux-getting-started-1.1.2` to change to the directory created by extracting the ZIP file.
3. Create a file named `gradle-local.properties` and add `mlPassword=your MarkLogic admin user password` to it.
4. Examine the contents of the `gradle.properties` file to ensure that the value of `mlHost` points to your MarkLogic
server and that the value of `mlRestPort` is a port available for a new MarkLogic app server to use.
Expand All @@ -38,15 +38,15 @@ privileges for running the examples in this guide. Finally, the application incl
[MarkLogic TDE template](https://docs.marklogic.com/guide/app-dev/TDE) that creates a view in MarkLogic for the purpose
of demonstrating commands that utilize a [MarkLogic Optic query](https://docs.marklogic.com/guide/app-dev/OpticAPI).

It is recommended to extract the Flux application zip into the `marklogic-flux-getting-started-1.1.0` directory so that
It is recommended to extract the Flux application zip into the `marklogic-flux-getting-started-1.1.2` directory so that
you can easily execute the examples in this guide. After extracting the application zip, the directory should have a
structure similar to this (not all files may be shown):

```
./marklogic-flux-getting-started-1.1.0
./marklogic-flux-getting-started-1.1.2
build.gradle
./data
./marklogic-flux-1.1.0
./marklogic-flux-1.1.2
./gradle
gradle.properties
gradlew
Expand All @@ -59,7 +59,7 @@ structure similar to this (not all files may be shown):
You can run Flux without any options to see the list of available commands. If you are using Flux to run these examples,
first change your current directory to where you extract Flux:

cd marklogic-flux-1.1.0
cd marklogic-flux-1.1.2

And then run the Flux executable without any options:

Expand Down
22 changes: 21 additions & 1 deletion docs/import/import-files/json.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,25 @@ Flux will write two separate JSON documents, each with a completely different sc
The JSON Lines format is often useful for exporting data from MarkLogic as well. Please see
[this guide](../../export/export-rows.md) for more information on exporting data to JSON Lines files.

### Importing JSON Lines files as is

When importing JSON Lines files, Flux uses the
[Spark JSON data source](https://spark.apache.org/docs/latest/sql-data-sources-json.html) to read each line and conform
the JSON objects to a common schema across the entire set of lines. As noted in the Advanced Options section below,
Spark JSON provides a number of configuration options for controlling how the lines are read. These features can result
in changes to the JSON objects, such as the keys being reordered and fields being added to match the common schema.

For some use cases, you may wish to read each line "as is" without any modification to it. To do so, use the
`--json-lines-raw` option instead of `--json-lines`. With the `--json-lines-raw` option, Flux will read each line as
a JSON document and will not attempt to enforce any commonality across the lines. This option also has the following
effects on the `import-aggregate-json-files` command:

1. You cannot use any `-P` options as described in the "Advanced Options" section below.
2. The `--uri-include-file-path` option has no effect as each JSON document will default to a URI including the file path.
3. The following options also have no effect as each JSON document is intentionally left as is: `--json-root-name`, `--xml-root-name`,
`--xml-namespace`, and `--ignore-null-fields`.
4. You can still read a gzipped file if its filename ends in `.gz`.

## Specifying a JSON root name

It is often useful to have a single "root" field in a JSON document so that it is more self-describing. It
Expand Down Expand Up @@ -130,7 +149,8 @@ bin\flux import-aggregate-json-files ^

Flux will automatically read files compressed with gzip when they have a filename ending in `.gz`; you do not need to
specify a compression option. As noted in the "Advanced options" section below, you can use `-Pcompression=` to
explicitly specify a compression algorithm if Flux is not able to read your compressed files automatically.
explicitly specify a compression algorithm if Flux is not able to read your compressed files automatically. Note
that the use of `-Pcompression=` is only supported if the `--json-lines-raw` option is not used.

## Advanced options

Expand Down
8 changes: 4 additions & 4 deletions docs/spark-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ Flux integrates with [spark-submit](https://spark.apache.org/docs/latest/submitt
submit a Flux command invocation to a remote Spark cluster. Every Flux command is a Spark application, and thus every
Flux command, along with all of its option, can be invoked via `spark-submit`.

To use Flux with `spark-submit`, first download the `marklogic-flux-1.1.0-all.jar` file from the
[GitHub release page](https://github.com/marklogic/flux/releases/tag/1.1.0). This jar file includes Flux and all of
To use Flux with `spark-submit`, first download the `marklogic-flux-1.1.2-all.jar` file from the
[GitHub release page](https://github.com/marklogic/flux/releases/tag/1.1.2). This jar file includes Flux and all of
its dependencies, excluding those of Spark itself, which will be provided via the Spark cluster that you connect to
via `spark-submit`.

Expand All @@ -48,7 +48,7 @@ The following shows a notional example of running the Flux `import-files` comman
```
$SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
--master spark://changeme:7077 \
marklogic-flux-1.1.0-all.jar \
marklogic-flux-1.1.2-all.jar \
import-files \
--path path/to/data \
--connection-string user:password@host:8000 \
Expand All @@ -59,7 +59,7 @@ $SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
```
$SPARK_HOME\bin\spark-submit --class com.marklogic.flux.spark.Submit ^
--master spark://changeme:7077 ^
marklogic-flux-1.1.0-all.jar ^
marklogic-flux-1.1.2-all.jar ^
import-files ^
--path path/to/data ^
--connection-string user:password@host:8000 ^
Expand Down
4 changes: 2 additions & 2 deletions examples/client-project/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ buildscript {
mavenLocal()
}
dependencies {
classpath "com.marklogic:flux-api:1.1.0"
classpath "com.marklogic:flux-api:1.1.2"

// Demonstrates removing the Jackson libraries that otherwise cause a conflict with
// Spark, which requires Jackson >= 2.14.0 and < 2.15.0.
Expand All @@ -28,7 +28,7 @@ repositories {
}

dependencies {
implementation "com.marklogic:flux-api:1.1.0"
implementation "com.marklogic:flux-api:1.1.2"
}

tasks.register("runApp", JavaExec) {
Expand Down
2 changes: 1 addition & 1 deletion flux-cli/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ dependencies {
// The rocksdbjni dependency weighs in at 50mb and so far does not appear necessary for our use of Spark.
exclude module: "rocksdbjni"
}
implementation "com.marklogic:marklogic-spark-connector:2.4.1"
implementation "com.marklogic:marklogic-spark-connector:2.4.2"
implementation "info.picocli:picocli:4.7.6"

// Spark 3.4.3 depends on Hadoop 3.3.4, which depends on AWS SDK 1.12.262. As of August 2024, all public releases of
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,28 @@ interface ReadJsonFilesOptions extends ReadFilesOptions<ReadJsonFilesOptions> {
/**
* @param value set to true to read JSON Lines files. Defaults to reading files that either contain an array
* of JSON objects or a single JSON object.
* @deprecated since 1.1.2; use {@code jsonLines()} instead.
*/
@SuppressWarnings("java:S1133") // Telling Sonar we don't need a reminder to remove this some day.
@Deprecated(since = "1.1.2", forRemoval = true)
ReadJsonFilesOptions jsonLines(boolean value);

/**
* Call this to read JSON Lines files. Otherwise, defaults to reading files that either contain an array of
* JSON objects or a single JSON object.
*
* @since 1.1.2
*/
ReadJsonFilesOptions jsonLines();

/**
* Call this to read JSON Lines files "as is", without any alteration to the documents associated with each
* line.
*
* @since 1.1.2
*/
ReadJsonFilesOptions jsonLinesRaw();

ReadJsonFilesOptions encoding(String encoding);

ReadJsonFilesOptions uriIncludeFilePath(boolean value);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,13 @@ interface ReadTriplesDocumentsOptions {
ReadTriplesDocumentsOptions partitionsPerForest(int partitionsPerForest);

ReadTriplesDocumentsOptions logProgress(int interval);

/**
* Read documents at multiple points in time, as opposed to using a consistent snapshot.
*
* @since 1.1.2
*/
ReadTriplesDocumentsOptions noSnapshot();
}

interface WriteRdfFilesOptions extends WriteFilesOptions<WriteRdfFilesOptions> {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,11 @@ public interface ReadDocumentsOptions<T extends ReadDocumentsOptions> {
T batchSize(int batchSize);

T partitionsPerForest(int partitionsPerForest);

/**
* Read documents at multiple points in time, as opposed to using a consistent snapshot.
*
* @since 1.1.2
*/
T noSnapshot();
}
32 changes: 26 additions & 6 deletions flux-cli/src/main/java/com/marklogic/flux/cli/Main.java
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
import org.slf4j.LoggerFactory;
import picocli.CommandLine;

import java.io.PrintWriter;

@CommandLine.Command(
name = "./bin/flux",

Expand Down Expand Up @@ -98,12 +100,7 @@ private int executeCommand(CommandLine.ParseResult parseResult) {
}
command.execute(session);
} catch (Exception ex) {
if (parseResult.subcommand().hasMatchedOption("--stacktrace")) {
logger.error("Displaying stacktrace due to use of --stacktrace option", ex);
}
String message = removeStacktraceFromExceptionMessage(ex);
parseResult.commandSpec().commandLine().getErr()
.println(String.format("%nCommand failed, cause: %s", message));
printException(parseResult, ex);
return CommandLine.ExitCode.SOFTWARE;
}
return CommandLine.ExitCode.OK;
Expand All @@ -121,6 +118,18 @@ protected SparkSession buildSparkSession(Command selectedCommand) {
SparkUtil.buildSparkSession();
}

private void printException(CommandLine.ParseResult parseResult, Exception ex) {
if (parseResult.subcommand().hasMatchedOption("--stacktrace")) {
logger.error("Displaying stacktrace due to use of --stacktrace option", ex);
}
String message = removeStacktraceFromExceptionMessage(ex);
PrintWriter stderr = parseResult.commandSpec().commandLine().getErr();
stderr.println(String.format("%nCommand failed, cause: %s", message));
if (message != null && message.contains("XDMP-OLDSTAMP")) {
printMessageForTimestampError(stderr);
}
}

/**
* In some errors from our connector, such as when the custom code reader invokes invalid code,
* Spark will oddly put the entire stacktrace into the exception message. Showing that stacktrace isn't a
Expand Down Expand Up @@ -148,4 +157,15 @@ private String removeStacktraceFromExceptionMessage(Exception ex) {
private boolean isStacktraceLine(String line) {
return line != null && line.trim().startsWith("at ");
}

/**
* A user can encounter an OLDSTAMP error when exporting data with a consistent snapshot, but it can be difficult
* to know how to resolve the error. Thus, additional information is printed to help the user with resolving this
* error.
*/
private void printMessageForTimestampError(PrintWriter stderr) {
stderr.println(String.format("To resolve an XDMP-OLDSTAMP error, consider using the --no-snapshot option " +
"or consult the Flux documentation at https://marklogic.github.io/flux/ for " +
"information on configuring your database to support point-in-time queries."));
}
}
Loading

0 comments on commit 8916c8d

Please sign in to comment.