Skip to content

Commit

Permalink
Merge pull request #333 from marklogic/release/1.2.0
Browse files Browse the repository at this point in the history
Merge release/1.2.0 into main
  • Loading branch information
rjrudin authored Dec 19, 2024
2 parents 47594c0 + ef23542 commit d1640c4
Show file tree
Hide file tree
Showing 156 changed files with 4,339 additions and 672 deletions.
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ flux/lib
flux/ext
flux/bin
flux/conf
flux/export
export
flux-cli/src/dist/ext/*.jar
flux-version.properties
docker/sonarqube
40 changes: 33 additions & 7 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,15 @@ application installed:

1. Ensure you have Java 11 or higher installed; you will need Java 17 if you wish to use the Sonarqube support described below.
2. Clone this repository if you have not already.
3. From the root directory of the project, run `docker-compose up -d --build`.
3. From the root directory of the project, run `docker compose up -d --build`.
4. Wait 10 to 20 seconds and verify that <http://localhost:8001> shows the MarkLogic admin screen before proceeding.
5. Run `./gradlew -i mlDeploy` to deploy this project's test application.

Next, run the following to pull a small model for the test instance of Ollama to use; this will be used by one or more
embedder tests:

docker exec -it flux-ollama-1 ollama pull all-minilm

Some of the tests depend on the Postgres instance deployed via Docker. Follow these steps to load a sample dataset
into it:

Expand Down Expand Up @@ -39,6 +44,22 @@ If you would like to test our the Flux distribution - as either a tar or zip - p

You can now run `./bin/flux` to test out various commands.

If you're testing with the project at `./examples/getting-started`, you can run the following to install Flux in that
directory, thus allowing you to test out the examples in that project:

./gradlew buildToolForGettingStarted

If you wish to build the Flux zip with all the embedding model integration JARs included, you must first run the
`copyEmbeddingModelJarsIntoDistribution` task. That name is intentionally verbose, but it's a lot to type, so take
advantage of Gradle's ability to extrapolate task names:

./gradlew copyemb distZip

You can also do the following include the integration JARs in the Flux installation in the `examples/getting-started`
project (again taking advantage of Gradle's ability to extrapolate task names):

./gradlew copyemb buildtoolfor

## Configuring the version

You can specify a version for Flux when building Flux via any of the following:
Expand Down Expand Up @@ -70,12 +91,17 @@ If you are running the tests in Intellij with Java 17, you will need to perform
--add-opens java.base/sun.util.calendar=ALL-UNNAMED
--add-opens java.base/java.io=ALL-UNNAMED
--add-opens java.base/sun.nio.cs=ALL-UNNAMED
--add-opens java.base/sun.security.action=ALL-UNNAMED
```

When you run one or more tests, the above configuration template settings will be used, allowing all Flux tests to
pass on Java 17. If you are running a test configuration that you ran prior to making the changes, you will need to
delete that configuration first via the "Run -> Edit Configurations" panel.

If you are running tests in Intellij via Intellij and not via the Gradle wrapper, you will also need to run
`./gradlew shadowJar` first to ensure a couple shadow jars are created that are required by some of the `flux-cli`
tests. You do not need to do this if you have Intellij configured to use Gradle to run tests in Intellij.

## Generating code quality reports with SonarQube

In order to use SonarQube, you must have used Docker to run this project's `docker-compose.yml` file, and you must
Expand All @@ -92,18 +118,18 @@ To configure the SonarQube service, perform the following steps:
7. Click on "Use the global setting" and then "Create project".
8. On the "Analysis Method" page, click on "Locally".
9. In the "Provide a token" panel, click on "Generate". Copy the token.
10. Add `systemProp.sonar.token=your token pasted here` to `gradle-local.properties` in the root of your project, creating
10. Add `systemProp.sonar.login=your token pasted here` to `gradle-local.properties` in the root of your project, creating
that file if it does not exist yet.

To run SonarQube, run the following Gradle tasks with Java 17 or higher, which will run all the tests with code
coverage and then generate a quality report with SonarQube:

./gradlew test sonar

If you do not add `systemProp.sonar.token` to your `gradle-local.properties` file, you can specify the token via the
If you do not add `systemProp.sonar.login` to your `gradle-local.properties` file, you can specify the token via the
following:

./gradlew test sonar -Dsonar.token=paste your token here
./gradlew test sonar -Dsonar.login=paste your token here

When that completes, you will see a line like this near the end of the logging:

Expand Down Expand Up @@ -256,15 +282,15 @@ are all synonyms):

./gradlew shadowJar

This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.1-SNAPSHOT-all.jar`.
This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.2-SNAPSHOT-all.jar`.

You can now run any CLI command via spark-submit. This is an example of previewing an import of files - change the value
of `--path`, as an absolute path is needed, and of course change the value of `--master` to match that of your Spark
cluster:

```
$SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
--master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.1-SNAPSHOT-all.jar \
--master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.2-SNAPSHOT-all.jar \
import-files --path /Users/rudin/workspace/flux/flux-cli/src/test/resources/mixed-files \
--connection-string "admin:admin@localhost:8000" \
--preview 5 --preview-drop content
Expand All @@ -281,7 +307,7 @@ to something you can access):
$SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
--packages org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-client:3.3.4 \
--master spark://NYWHYC3G0W:7077 \
flux-cli/build/libs/marklogic-flux-1.1-SNAPSHOT-all.jar \
flux-cli/build/libs/marklogic-flux-1.2-SNAPSHOT-all.jar \
import-files --path "s3a://changeme/" \
--connection-string "admin:admin@localhost:8000" \
--s3-add-credentials \
Expand Down
1 change: 1 addition & 0 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ def runtests(){
mkdir -p $WORKSPACE/flux/docker/sonarqube;
docker-compose up -d --build;
sleep 30s;
curl "http://localhost:8008/api/pull" -d '{"model":"all-minilm"}'
'''
script{
timeout(time: 60, unit: 'SECONDS') {
Expand Down
9 changes: 4 additions & 5 deletions NOTICE.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
MarkLogic® Flux™ v1
MarkLogic® Flux™

Copyright © 2024 MarkLogic Corporation. All Rights Reserved.

Expand All @@ -9,7 +9,7 @@ Third Party Notices
aws-java-sdk-s3 1.12.262 (Apache-2.0)
hadoop-aws 3.3.4 (Apache-2.0)
hadoop-client 3.3.4 (Apache-2.0)
marklogic-spark-connector 2.4.0 (Apache-2.0)
marklogic-spark-connector 2.5.0 (Apache-2.0)
picocli 4.7.6 (Apache-2.0)
spark-avro_2.12 3.5.3 (Apache-2.0)
spark-sql_2.12 3.5.3 (Apache-2.0)
Expand All @@ -20,13 +20,12 @@ Apache License 2.0 (Apache-2.0)

Third-Party Components

The following is a list of the third-party components used by MarkLogic® Flux™ v1 (last updated July 2, 2024):
The following is a list of the third-party components used by MarkLogic® Flux™ 1.2.0 (last updated December 17, 2024):

aws-java-sdk-s3 1.12.262 (Apache-2.0)
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)


hadoop-aws 3.3.4 (Apache-2.0)
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
Expand All @@ -35,7 +34,7 @@ hadoop-client 3.3.4 (Apache-2.0)
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)

marklogic-spark-connector 2.34.0(Apache-2.0)
marklogic-spark-connector 2.5.0 (Apache-2.0)
https://repo1.maven.org/maven2/com/marklogic/marklogic-spark-connector
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)

Expand Down
9 changes: 9 additions & 0 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,15 @@ subprojects {
}
}

configurations.all {
resolutionStrategy.eachDependency { DependencyResolveDetails details ->
if (details.requested.group.startsWith('com.fasterxml.jackson')) {
details.useVersion '2.15.2'
details.because 'Need to match the version used by Spark.'
}
}
}

test {
useJUnitPlatform()
testLogging {
Expand Down
11 changes: 9 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ services:
- 8007:8007

marklogic:
image: "progressofficial/marklogic-db:11.3.0-ubi"
image: "ml-docker-db-dev-tierpoint.bed-artifactory.bedford.progress.com/marklogic/marklogic-server-ubi:latest-12"
platform: linux/amd64
environment:
- MARKLOGIC_INIT=true
Expand Down Expand Up @@ -53,7 +53,7 @@ services:

# Copied from https://docs.sonarsource.com/sonarqube/latest/setup-and-upgrade/install-the-server/#example-docker-compose-configuration .
sonarqube:
image: sonarqube:10.6.0-community
image: sonarqube:lts-community
depends_on:
- postgres
environment:
Expand All @@ -67,6 +67,13 @@ services:
ports:
- "9000:9000"

# Using Ollama for testing an embedding model.
# See https://github.com/ollama/ollama for more information.
ollama:
image: "ollama/ollama"
ports:
- 8008:11434

volumes:
sonarqube_data:
sonarqube_extensions:
Expand Down
32 changes: 7 additions & 25 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ To add Flux as a dependency to your application, add the following to your Maven
<dependency>
<groupId>com.marklogic</groupId>
<artifactId>flux-api</artifactId>
<version>1.1.3</version>
<version>1.2.0</version>
</dependency>
```

Or if you are using Gradle, add the following to your `build.gradle` file:

```
dependencies {
implementation "com.marklogic:flux-api:1.1.3"
implementation "com.marklogic:flux-api:1.2.0"
}
```

Expand Down Expand Up @@ -97,7 +97,7 @@ buildscript {
mavenCentral()
}
dependencies {
classpath "com.marklogic:flux-api:1.1.3"
classpath "com.marklogic:flux-api:1.2.0"
}
}
```
Expand Down Expand Up @@ -127,25 +127,7 @@ when running Gradle. For example, if you run Gradle with `--stacktrace` and see
The [Gradle documentation](https://docs.gradle.org/current/userguide/build_environment.html) provides more information
on the `org.gradle.jvmargs` property along with other ways to customize the Gradle environment.

If you are using a plugin like [ml-gradle](https://github.com/marklogic/ml-gradle) that brings in its own version of the
[FasterXML Jackson APIs](https://github.com/FasterXML/jackson), you need to be sure that the version of Jackson is
between 2.14.0 and 2.15.0 as required by the Apache Spark dependency of Flux. The following shows an example of excluding
these dependencies from ml-gradle in a `build.gradle` file so that ml-gradle will use the Jackson APIs brought in via
Flux:

```
buildscript {
repositories {
mavenCentral()
}
dependencies {
classpath "com.marklogic:flux-api:1.1.3"
classpath("com.marklogic:ml-gradle:4.8.0") {
exclude group: "com.fasterxml.jackson.databind"
exclude group: "com.fasterxml.jackson.core"
exclude group: "com.fasterxml.jackson.dataformat"
}
}
}
```

Please note that you cannot yet use the Flux API in your Gradle buildscript when you are also using the
MarkLogic [ml-gradle plugin](https://github.com/marklogic/ml-gradle). This is due to a classpath conflict, where the
MarkLogic Spark connector used by Flux must alter an underlying library so as not to conflict with Spark itself - but
that altered library then conflicts with ml-gradle. We will have a resolution for this soon.
64 changes: 42 additions & 22 deletions docs/common-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ to the next line:

```
--query
"op.fromView('Example', 'Employees', '')\
"op.fromView('example', 'employees', '')\
.limit(10)"
```

Expand Down Expand Up @@ -263,12 +263,12 @@ All available connection options are shown in the table below:

| Option | Description |
| --- | --- |
| `--auth-type` | Type of authentication to use. Possible values are `BASIC`, `DIGEST`, `CLOUD`, `KERBEROS`, `CERTIFICATE`, and `SAML`.|
| `--auth-type` | Type of authentication to use. Possible values are `BASIC`, `DIGEST`, `CLOUD`, `CERTIFICATE`, `KERBEROS`, `OAUTH`, and `SAML`.|
| `--base-path` | Path to prepend to each call to a MarkLogic [REST API app server](https://docs.marklogic.com/guide/rest-dev). |
| `--certificate-file` | File path for a keystore to be used for `CERTIFICATE` authentication. |
| `--certificate-password` | Password for the keystore referenced by `--certificate-file`. |
| `--connection-string` | Defines a connection string as user:password@host:port/optionalDatabaseName; only usable when using `DIGEST` or `BASIC` authentication. |
| `--cloud-api-key` | API key for authenticating with a MarkLogic Cloud cluster when authentication type is `CLOUD`. |
| `--cloud-api-key` | API key for authenticating with a Progress Data Cloud cluster when authentication type is `CLOUD`. |
| `--connection-type` | Set to `DIRECT` if connections can be made directly to each host in the MarkLogic cluster. Defaults to `GATEWAY`. Possible values are `DIRECT` and `GATEWAY`. |
| `--database` | Name of a database to connect if it differs from the one associated with the app server identified by `--port`. |
| `--disable-gzipped-responses` | If included, responses from MarkLogic will not be gzipped. May improve performance when responses are very small.
Expand All @@ -278,6 +278,7 @@ All available connection options are shown in the table below:
| `--keystore-password` | Password for the keystore identified by `--keystore-path`. |
| `--keystore-path` | File path for a keystore for two-way SSL connections. |
| `--keystore-type` | Type of the keystore identified by `--keystore-path`; defaults to `JKS`. |
| `--oauth-token` | Token to be used with `OAUTH` authentication. |
| `--password` | Password when using `DIGEST` or `BASIC` authentication. |
| `--port` | Port of the [REST API app server](https://docs.marklogic.com/guide/rest-dev) to connect to. |
| `--saml-token` | Token to be used with `SAML` authentication. |
Expand Down Expand Up @@ -331,7 +332,7 @@ instead of in a table:
{% endtab %}
{% tab log Windows %}
```
./bin/flux import-parquet-files ^
bin\flux import-parquet-files ^
--connection-string "flux-example-user:password@localhost:8004" ^
--path export\parquet ^
--preview 10 ^
Expand All @@ -355,7 +356,7 @@ that Flux log the schema and not write any data:
```
./bin/flux export-parquet-files \
--connection-string "flux-example-user:password@localhost:8004" \
--query "op.fromView('Example', 'Employees')" \
--query "op.fromView('example', 'employees')" \
--path export/parquet \
--preview-schema
```
Expand All @@ -364,7 +365,7 @@ that Flux log the schema and not write any data:
```
bin\flux export-parquet-files ^
--connection-string "flux-example-user:password@localhost:8004" ^
--query "op.fromView('Example', 'Employees')" ^
--query "op.fromView('example', 'employees')" ^
--path export\parquet ^
--preview-schema
```
Expand Down Expand Up @@ -488,27 +489,46 @@ time you run Flux:
Flux is built on top of [Apache Spark](https://spark.apache.org/) and provides a number of command line options for
configuring the underlying Spark runtime environment used by Flux.

### Configuring the number of partitions
### Configuring Spark worker threads

Flux uses Spark partitions to allow for data to be read and written in parallel. Each partition can be thought of as
a separate worker, operating in parallel with each other worker.
By default, Flux creates a Spark runtime with a master URL of `local[*]`, which runs Spark with as many worker
threads as logical cores on the machine running Flux. The number of worker threads affects how many partitions can be
processed in parallel. You can change this setting via the`--spark-master-url` option; please see
[the Spark documentation](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) for examples
of valid values. If you are looking to run a Flux command on a remote Spark cluster, please instead see the
[Spark Integration guide](spark-integration.md) for details on integrating Flux with `spark-submit`.

A number of partitions will be determined by the command that you run before it reads data. The nature of the data
source directly impacts the number of partitions that will be created.
For import commands, you typically will not need to adjust this as a partition writer in an import command supports its
own pool of threads via the [MarkLogic data movement library](https://docs.marklogic.com/guide/java/data-movement). However,
depending on the data source, additional worker threads may help with reading data in parallel.

If you find that an insufficient number of partitions are created - i.e. the writer phase of your Flux command is not
sending as much data to MarkLogic as it could - consider using the `--repartition` option to force a number of
partitions to be created after the data has been read. The downside to using `--repartition` is that all the data must
be read first. Generally, this option will help when data can be read quickly and the performance of writing can be
improved by using more partitions than were created when reading data.
For the [`reprocess` command](reprocess.md), setting the number of worker threads is critical to achieving optimal
performance. As of Flux 1.2.0, the `--thread-count` option will adjust the Spark master URL based on the number of
threads you specify. Prior to Flux 1.2.0, you can use `--repartition` to achieve the same effect.

### Configuring a Spark URL
For exporting data, please see the [exporting guide](export/export.md) for information on how to adjust the worker
threads depending on whether you are reading documents or rows from MarkLogic.

By default, Flux creates a Spark session with a master URL of `local[*]`. You can change this via the
`--spark-master-url` option; please see
[the Spark documentation](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) for examples
of valid values. If you are looking to run a Flux command on a remote Spark cluster, please instead see the
[Spark Integration guide](spark-integration.md) for details on integrating Flux with `spark-submit`.
### Configuring the number of Spark partitions

Flux uses Spark partitions to allow for data to be read and written in parallel. Each partition can be thought of as
a separate worker, operating in parallel with each other worker.

A number of partitions will be determined by the command that you run before it reads data. The nature of the data
source directly impacts the number of partitions that will be created.

For some commands, you may find improved performance by changing the number of partitions used to write data to the
target associated with the command. For example, an `export-jdbc` command may only need a small number of partitions to
read data from MarkLogic, but performance will be improved by using a far higher number of partitions to write data to
the JDBC destination. You can use the `--repartition` option to force the number of partitions to use for writing data.
The downside to this option is that it forces Flux to read all the data from the data source before writing any to the
target. Generally, this option will help when data can be read quickly and the performance of writing can be
improved by using more partitions than were created when reading data - this is almost always the case for the
`reprocess` command.

As of Flux 1.2.0, setting `--repartition` will default the value of the `--spark-master-url` option to be `local[N]`,
where `N` is the value of `--repartition`. This ensure that each partition writer has a Spark worker thread available
to it. You can still override `--spark-master-url` if you wish.

### Configuring the Spark runtime

Expand Down
Loading

0 comments on commit d1640c4

Please sign in to comment.