Skip to content

Commit d1640c4

Browse files
authored
Merge pull request #333 from marklogic/release/1.2.0
Merge release/1.2.0 into main
2 parents 47594c0 + ef23542 commit d1640c4

File tree

156 files changed

+4339
-672
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

156 files changed

+4339
-672
lines changed

.gitignore

+1-2
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@ flux/lib
99
flux/ext
1010
flux/bin
1111
flux/conf
12-
flux/export
13-
export
12+
flux-cli/src/dist/ext/*.jar
1413
flux-version.properties
1514
docker/sonarqube

CONTRIBUTING.md

+33-7
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,15 @@ application installed:
33

44
1. Ensure you have Java 11 or higher installed; you will need Java 17 if you wish to use the Sonarqube support described below.
55
2. Clone this repository if you have not already.
6-
3. From the root directory of the project, run `docker-compose up -d --build`.
6+
3. From the root directory of the project, run `docker compose up -d --build`.
77
4. Wait 10 to 20 seconds and verify that <http://localhost:8001> shows the MarkLogic admin screen before proceeding.
88
5. Run `./gradlew -i mlDeploy` to deploy this project's test application.
99

10+
Next, run the following to pull a small model for the test instance of Ollama to use; this will be used by one or more
11+
embedder tests:
12+
13+
docker exec -it flux-ollama-1 ollama pull all-minilm
14+
1015
Some of the tests depend on the Postgres instance deployed via Docker. Follow these steps to load a sample dataset
1116
into it:
1217

@@ -39,6 +44,22 @@ If you would like to test our the Flux distribution - as either a tar or zip - p
3944

4045
You can now run `./bin/flux` to test out various commands.
4146

47+
If you're testing with the project at `./examples/getting-started`, you can run the following to install Flux in that
48+
directory, thus allowing you to test out the examples in that project:
49+
50+
./gradlew buildToolForGettingStarted
51+
52+
If you wish to build the Flux zip with all the embedding model integration JARs included, you must first run the
53+
`copyEmbeddingModelJarsIntoDistribution` task. That name is intentionally verbose, but it's a lot to type, so take
54+
advantage of Gradle's ability to extrapolate task names:
55+
56+
./gradlew copyemb distZip
57+
58+
You can also do the following include the integration JARs in the Flux installation in the `examples/getting-started`
59+
project (again taking advantage of Gradle's ability to extrapolate task names):
60+
61+
./gradlew copyemb buildtoolfor
62+
4263
## Configuring the version
4364

4465
You can specify a version for Flux when building Flux via any of the following:
@@ -70,12 +91,17 @@ If you are running the tests in Intellij with Java 17, you will need to perform
7091
--add-opens java.base/sun.util.calendar=ALL-UNNAMED
7192
--add-opens java.base/java.io=ALL-UNNAMED
7293
--add-opens java.base/sun.nio.cs=ALL-UNNAMED
94+
--add-opens java.base/sun.security.action=ALL-UNNAMED
7395
```
7496

7597
When you run one or more tests, the above configuration template settings will be used, allowing all Flux tests to
7698
pass on Java 17. If you are running a test configuration that you ran prior to making the changes, you will need to
7799
delete that configuration first via the "Run -> Edit Configurations" panel.
78100

101+
If you are running tests in Intellij via Intellij and not via the Gradle wrapper, you will also need to run
102+
`./gradlew shadowJar` first to ensure a couple shadow jars are created that are required by some of the `flux-cli`
103+
tests. You do not need to do this if you have Intellij configured to use Gradle to run tests in Intellij.
104+
79105
## Generating code quality reports with SonarQube
80106

81107
In order to use SonarQube, you must have used Docker to run this project's `docker-compose.yml` file, and you must
@@ -92,18 +118,18 @@ To configure the SonarQube service, perform the following steps:
92118
7. Click on "Use the global setting" and then "Create project".
93119
8. On the "Analysis Method" page, click on "Locally".
94120
9. In the "Provide a token" panel, click on "Generate". Copy the token.
95-
10. Add `systemProp.sonar.token=your token pasted here` to `gradle-local.properties` in the root of your project, creating
121+
10. Add `systemProp.sonar.login=your token pasted here` to `gradle-local.properties` in the root of your project, creating
96122
that file if it does not exist yet.
97123

98124
To run SonarQube, run the following Gradle tasks with Java 17 or higher, which will run all the tests with code
99125
coverage and then generate a quality report with SonarQube:
100126

101127
./gradlew test sonar
102128

103-
If you do not add `systemProp.sonar.token` to your `gradle-local.properties` file, you can specify the token via the
129+
If you do not add `systemProp.sonar.login` to your `gradle-local.properties` file, you can specify the token via the
104130
following:
105131

106-
./gradlew test sonar -Dsonar.token=paste your token here
132+
./gradlew test sonar -Dsonar.login=paste your token here
107133

108134
When that completes, you will see a line like this near the end of the logging:
109135

@@ -256,15 +282,15 @@ are all synonyms):
256282

257283
./gradlew shadowJar
258284

259-
This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.1-SNAPSHOT-all.jar`.
285+
This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.2-SNAPSHOT-all.jar`.
260286

261287
You can now run any CLI command via spark-submit. This is an example of previewing an import of files - change the value
262288
of `--path`, as an absolute path is needed, and of course change the value of `--master` to match that of your Spark
263289
cluster:
264290

265291
```
266292
$SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
267-
--master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.1-SNAPSHOT-all.jar \
293+
--master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.2-SNAPSHOT-all.jar \
268294
import-files --path /Users/rudin/workspace/flux/flux-cli/src/test/resources/mixed-files \
269295
--connection-string "admin:admin@localhost:8000" \
270296
--preview 5 --preview-drop content
@@ -281,7 +307,7 @@ to something you can access):
281307
$SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
282308
--packages org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-client:3.3.4 \
283309
--master spark://NYWHYC3G0W:7077 \
284-
flux-cli/build/libs/marklogic-flux-1.1-SNAPSHOT-all.jar \
310+
flux-cli/build/libs/marklogic-flux-1.2-SNAPSHOT-all.jar \
285311
import-files --path "s3a://changeme/" \
286312
--connection-string "admin:admin@localhost:8000" \
287313
--s3-add-credentials \

Jenkinsfile

+1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ def runtests(){
66
mkdir -p $WORKSPACE/flux/docker/sonarqube;
77
docker-compose up -d --build;
88
sleep 30s;
9+
curl "http://localhost:8008/api/pull" -d '{"model":"all-minilm"}'
910
'''
1011
script{
1112
timeout(time: 60, unit: 'SECONDS') {

NOTICE.txt

+4-5
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
MarkLogic® Flux™ v1
1+
MarkLogic® Flux™
22

33
Copyright © 2024 MarkLogic Corporation. All Rights Reserved.
44

@@ -9,7 +9,7 @@ Third Party Notices
99
aws-java-sdk-s3 1.12.262 (Apache-2.0)
1010
hadoop-aws 3.3.4 (Apache-2.0)
1111
hadoop-client 3.3.4 (Apache-2.0)
12-
marklogic-spark-connector 2.4.0 (Apache-2.0)
12+
marklogic-spark-connector 2.5.0 (Apache-2.0)
1313
picocli 4.7.6 (Apache-2.0)
1414
spark-avro_2.12 3.5.3 (Apache-2.0)
1515
spark-sql_2.12 3.5.3 (Apache-2.0)
@@ -20,13 +20,12 @@ Apache License 2.0 (Apache-2.0)
2020

2121
Third-Party Components
2222

23-
The following is a list of the third-party components used by MarkLogic® Flux™ v1 (last updated July 2, 2024):
23+
The following is a list of the third-party components used by MarkLogic® Flux™ 1.2.0 (last updated December 17, 2024):
2424

2525
aws-java-sdk-s3 1.12.262 (Apache-2.0)
2626
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3
2727
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
2828

29-
3029
hadoop-aws 3.3.4 (Apache-2.0)
3130
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws
3231
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
@@ -35,7 +34,7 @@ hadoop-client 3.3.4 (Apache-2.0)
3534
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client
3635
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
3736

38-
marklogic-spark-connector 2.34.0(Apache-2.0)
37+
marklogic-spark-connector 2.5.0 (Apache-2.0)
3938
https://repo1.maven.org/maven2/com/marklogic/marklogic-spark-connector
4039
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
4140

build.gradle

+9
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,15 @@ subprojects {
1616
}
1717
}
1818

19+
configurations.all {
20+
resolutionStrategy.eachDependency { DependencyResolveDetails details ->
21+
if (details.requested.group.startsWith('com.fasterxml.jackson')) {
22+
details.useVersion '2.15.2'
23+
details.because 'Need to match the version used by Spark.'
24+
}
25+
}
26+
}
27+
1928
test {
2029
useJUnitPlatform()
2130
testLogging {

docker-compose.yml

+9-2
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ services:
1515
- 8007:8007
1616

1717
marklogic:
18-
image: "progressofficial/marklogic-db:11.3.0-ubi"
18+
image: "ml-docker-db-dev-tierpoint.bed-artifactory.bedford.progress.com/marklogic/marklogic-server-ubi:latest-12"
1919
platform: linux/amd64
2020
environment:
2121
- MARKLOGIC_INIT=true
@@ -53,7 +53,7 @@ services:
5353

5454
# Copied from https://docs.sonarsource.com/sonarqube/latest/setup-and-upgrade/install-the-server/#example-docker-compose-configuration .
5555
sonarqube:
56-
image: sonarqube:10.6.0-community
56+
image: sonarqube:lts-community
5757
depends_on:
5858
- postgres
5959
environment:
@@ -67,6 +67,13 @@ services:
6767
ports:
6868
- "9000:9000"
6969

70+
# Using Ollama for testing an embedding model.
71+
# See https://github.com/ollama/ollama for more information.
72+
ollama:
73+
image: "ollama/ollama"
74+
ports:
75+
- 8008:11434
76+
7077
volumes:
7178
sonarqube_data:
7279
sonarqube_extensions:

docs/api.md

+7-25
Original file line numberDiff line numberDiff line change
@@ -22,15 +22,15 @@ To add Flux as a dependency to your application, add the following to your Maven
2222
<dependency>
2323
<groupId>com.marklogic</groupId>
2424
<artifactId>flux-api</artifactId>
25-
<version>1.1.3</version>
25+
<version>1.2.0</version>
2626
</dependency>
2727
```
2828

2929
Or if you are using Gradle, add the following to your `build.gradle` file:
3030

3131
```
3232
dependencies {
33-
implementation "com.marklogic:flux-api:1.1.3"
33+
implementation "com.marklogic:flux-api:1.2.0"
3434
}
3535
```
3636

@@ -97,7 +97,7 @@ buildscript {
9797
mavenCentral()
9898
}
9999
dependencies {
100-
classpath "com.marklogic:flux-api:1.1.3"
100+
classpath "com.marklogic:flux-api:1.2.0"
101101
}
102102
}
103103
```
@@ -127,25 +127,7 @@ when running Gradle. For example, if you run Gradle with `--stacktrace` and see
127127
The [Gradle documentation](https://docs.gradle.org/current/userguide/build_environment.html) provides more information
128128
on the `org.gradle.jvmargs` property along with other ways to customize the Gradle environment.
129129

130-
If you are using a plugin like [ml-gradle](https://github.com/marklogic/ml-gradle) that brings in its own version of the
131-
[FasterXML Jackson APIs](https://github.com/FasterXML/jackson), you need to be sure that the version of Jackson is
132-
between 2.14.0 and 2.15.0 as required by the Apache Spark dependency of Flux. The following shows an example of excluding
133-
these dependencies from ml-gradle in a `build.gradle` file so that ml-gradle will use the Jackson APIs brought in via
134-
Flux:
135-
136-
```
137-
buildscript {
138-
repositories {
139-
mavenCentral()
140-
}
141-
dependencies {
142-
classpath "com.marklogic:flux-api:1.1.3"
143-
classpath("com.marklogic:ml-gradle:4.8.0") {
144-
exclude group: "com.fasterxml.jackson.databind"
145-
exclude group: "com.fasterxml.jackson.core"
146-
exclude group: "com.fasterxml.jackson.dataformat"
147-
}
148-
}
149-
}
150-
```
151-
130+
Please note that you cannot yet use the Flux API in your Gradle buildscript when you are also using the
131+
MarkLogic [ml-gradle plugin](https://github.com/marklogic/ml-gradle). This is due to a classpath conflict, where the
132+
MarkLogic Spark connector used by Flux must alter an underlying library so as not to conflict with Spark itself - but
133+
that altered library then conflicts with ml-gradle. We will have a resolution for this soon.

docs/common-options.md

+42-22
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ to the next line:
123123

124124
```
125125
--query
126-
"op.fromView('Example', 'Employees', '')\
126+
"op.fromView('example', 'employees', '')\
127127
.limit(10)"
128128
```
129129

@@ -263,12 +263,12 @@ All available connection options are shown in the table below:
263263

264264
| Option | Description |
265265
| --- | --- |
266-
| `--auth-type` | Type of authentication to use. Possible values are `BASIC`, `DIGEST`, `CLOUD`, `KERBEROS`, `CERTIFICATE`, and `SAML`.|
266+
| `--auth-type` | Type of authentication to use. Possible values are `BASIC`, `DIGEST`, `CLOUD`, `CERTIFICATE`, `KERBEROS`, `OAUTH`, and `SAML`.|
267267
| `--base-path` | Path to prepend to each call to a MarkLogic [REST API app server](https://docs.marklogic.com/guide/rest-dev). |
268268
| `--certificate-file` | File path for a keystore to be used for `CERTIFICATE` authentication. |
269269
| `--certificate-password` | Password for the keystore referenced by `--certificate-file`. |
270270
| `--connection-string` | Defines a connection string as user:password@host:port/optionalDatabaseName; only usable when using `DIGEST` or `BASIC` authentication. |
271-
| `--cloud-api-key` | API key for authenticating with a MarkLogic Cloud cluster when authentication type is `CLOUD`. |
271+
| `--cloud-api-key` | API key for authenticating with a Progress Data Cloud cluster when authentication type is `CLOUD`. |
272272
| `--connection-type` | Set to `DIRECT` if connections can be made directly to each host in the MarkLogic cluster. Defaults to `GATEWAY`. Possible values are `DIRECT` and `GATEWAY`. |
273273
| `--database` | Name of a database to connect if it differs from the one associated with the app server identified by `--port`. |
274274
| `--disable-gzipped-responses` | If included, responses from MarkLogic will not be gzipped. May improve performance when responses are very small.
@@ -278,6 +278,7 @@ All available connection options are shown in the table below:
278278
| `--keystore-password` | Password for the keystore identified by `--keystore-path`. |
279279
| `--keystore-path` | File path for a keystore for two-way SSL connections. |
280280
| `--keystore-type` | Type of the keystore identified by `--keystore-path`; defaults to `JKS`. |
281+
| `--oauth-token` | Token to be used with `OAUTH` authentication. |
281282
| `--password` | Password when using `DIGEST` or `BASIC` authentication. |
282283
| `--port` | Port of the [REST API app server](https://docs.marklogic.com/guide/rest-dev) to connect to. |
283284
| `--saml-token` | Token to be used with `SAML` authentication. |
@@ -331,7 +332,7 @@ instead of in a table:
331332
{% endtab %}
332333
{% tab log Windows %}
333334
```
334-
./bin/flux import-parquet-files ^
335+
bin\flux import-parquet-files ^
335336
--connection-string "flux-example-user:password@localhost:8004" ^
336337
--path export\parquet ^
337338
--preview 10 ^
@@ -355,7 +356,7 @@ that Flux log the schema and not write any data:
355356
```
356357
./bin/flux export-parquet-files \
357358
--connection-string "flux-example-user:password@localhost:8004" \
358-
--query "op.fromView('Example', 'Employees')" \
359+
--query "op.fromView('example', 'employees')" \
359360
--path export/parquet \
360361
--preview-schema
361362
```
@@ -364,7 +365,7 @@ that Flux log the schema and not write any data:
364365
```
365366
bin\flux export-parquet-files ^
366367
--connection-string "flux-example-user:password@localhost:8004" ^
367-
--query "op.fromView('Example', 'Employees')" ^
368+
--query "op.fromView('example', 'employees')" ^
368369
--path export\parquet ^
369370
--preview-schema
370371
```
@@ -488,27 +489,46 @@ time you run Flux:
488489
Flux is built on top of [Apache Spark](https://spark.apache.org/) and provides a number of command line options for
489490
configuring the underlying Spark runtime environment used by Flux.
490491

491-
### Configuring the number of partitions
492+
### Configuring Spark worker threads
492493

493-
Flux uses Spark partitions to allow for data to be read and written in parallel. Each partition can be thought of as
494-
a separate worker, operating in parallel with each other worker.
494+
By default, Flux creates a Spark runtime with a master URL of `local[*]`, which runs Spark with as many worker
495+
threads as logical cores on the machine running Flux. The number of worker threads affects how many partitions can be
496+
processed in parallel. You can change this setting via the`--spark-master-url` option; please see
497+
[the Spark documentation](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) for examples
498+
of valid values. If you are looking to run a Flux command on a remote Spark cluster, please instead see the
499+
[Spark Integration guide](spark-integration.md) for details on integrating Flux with `spark-submit`.
495500

496-
A number of partitions will be determined by the command that you run before it reads data. The nature of the data
497-
source directly impacts the number of partitions that will be created.
501+
For import commands, you typically will not need to adjust this as a partition writer in an import command supports its
502+
own pool of threads via the [MarkLogic data movement library](https://docs.marklogic.com/guide/java/data-movement). However,
503+
depending on the data source, additional worker threads may help with reading data in parallel.
498504

499-
If you find that an insufficient number of partitions are created - i.e. the writer phase of your Flux command is not
500-
sending as much data to MarkLogic as it could - consider using the `--repartition` option to force a number of
501-
partitions to be created after the data has been read. The downside to using `--repartition` is that all the data must
502-
be read first. Generally, this option will help when data can be read quickly and the performance of writing can be
503-
improved by using more partitions than were created when reading data.
505+
For the [`reprocess` command](reprocess.md), setting the number of worker threads is critical to achieving optimal
506+
performance. As of Flux 1.2.0, the `--thread-count` option will adjust the Spark master URL based on the number of
507+
threads you specify. Prior to Flux 1.2.0, you can use `--repartition` to achieve the same effect.
504508

505-
### Configuring a Spark URL
509+
For exporting data, please see the [exporting guide](export/export.md) for information on how to adjust the worker
510+
threads depending on whether you are reading documents or rows from MarkLogic.
506511

507-
By default, Flux creates a Spark session with a master URL of `local[*]`. You can change this via the
508-
`--spark-master-url` option; please see
509-
[the Spark documentation](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) for examples
510-
of valid values. If you are looking to run a Flux command on a remote Spark cluster, please instead see the
511-
[Spark Integration guide](spark-integration.md) for details on integrating Flux with `spark-submit`.
512+
### Configuring the number of Spark partitions
513+
514+
Flux uses Spark partitions to allow for data to be read and written in parallel. Each partition can be thought of as
515+
a separate worker, operating in parallel with each other worker.
516+
517+
A number of partitions will be determined by the command that you run before it reads data. The nature of the data
518+
source directly impacts the number of partitions that will be created.
519+
520+
For some commands, you may find improved performance by changing the number of partitions used to write data to the
521+
target associated with the command. For example, an `export-jdbc` command may only need a small number of partitions to
522+
read data from MarkLogic, but performance will be improved by using a far higher number of partitions to write data to
523+
the JDBC destination. You can use the `--repartition` option to force the number of partitions to use for writing data.
524+
The downside to this option is that it forces Flux to read all the data from the data source before writing any to the
525+
target. Generally, this option will help when data can be read quickly and the performance of writing can be
526+
improved by using more partitions than were created when reading data - this is almost always the case for the
527+
`reprocess` command.
528+
529+
As of Flux 1.2.0, setting `--repartition` will default the value of the `--spark-master-url` option to be `local[N]`,
530+
where `N` is the value of `--repartition`. This ensure that each partition writer has a Spark worker thread available
531+
to it. You can still override `--spark-master-url` if you wish.
512532

513533
### Configuring the Spark runtime
514534

0 commit comments

Comments
 (0)