marklogic
diff --git a/‎.gitignore
Lines changed: 1 addition & 2 deletions b/‎.gitignore
Lines changed: 1 addition & 2 deletions
diff --git a/‎CONTRIBUTING.md
Lines changed: 33 additions & 7 deletions b/‎CONTRIBUTING.md
Lines changed: 33 additions & 7 deletions
diff --git a/‎Jenkinsfile
Lines changed: 1 addition & 0 deletions b/‎Jenkinsfile
Lines changed: 1 addition & 0 deletions
diff --git a/‎NOTICE.txt
Lines changed: 4 additions & 5 deletions b/‎NOTICE.txt
Lines changed: 4 additions & 5 deletions
diff --git a/‎build.gradle
Lines changed: 9 additions & 0 deletions b/‎build.gradle
Lines changed: 9 additions & 0 deletions
diff --git a/‎docker-compose.yml
Lines changed: 9 additions & 2 deletions b/‎docker-compose.yml
Lines changed: 9 additions & 2 deletions
diff --git a/‎docs/api.md
Lines changed: 7 additions & 25 deletions b/‎docs/api.md
Lines changed: 7 additions & 25 deletions
diff --git a/‎docs/common-options.md
Lines changed: 42 additions & 22 deletions b/‎docs/common-options.md
Lines changed: 42 additions & 22 deletions
@@ -9,7 +9,6 @@ flux/lib
 flux/ext
 flux/bin
 flux/conf
-flux/export
-export
+flux-cli/src/dist/ext/*.jar
 flux-version.properties
 docker/sonarqube
@@ -3,10 +3,15 @@ application installed:
 
 1. Ensure you have Java 11 or higher installed; you will need Java 17 if you wish to use the Sonarqube support described below.
 2. Clone this repository if you have not already.
-3. From the root directory of the project, run `docker-compose up -d --build`.
+3. From the root directory of the project, run `docker compose up -d --build`.
 4. Wait 10 to 20 seconds and verify that <http://localhost:8001> shows the MarkLogic admin screen before proceeding.
 5. Run `./gradlew -i mlDeploy` to deploy this project's test application.
 
+Next, run the following to pull a small model for the test instance of Ollama to use; this will be used by one or more
+embedder tests:
+
+    docker exec -it flux-ollama-1 ollama pull all-minilm
+
 Some of the tests depend on the Postgres instance deployed via Docker. Follow these steps to load a sample dataset
 into it:
 
@@ -39,6 +44,22 @@ If you would like to test our the Flux distribution - as either a tar or zip - p
 
 You can now run `./bin/flux` to test out various commands. 
 
+If you're testing with the project at `./examples/getting-started`, you can run the following to install Flux in that 
+directory, thus allowing you to test out the examples in that project:
+
+    ./gradlew buildToolForGettingStarted
+
+If you wish to build the Flux zip with all the embedding model integration JARs included, you must first run the 
+`copyEmbeddingModelJarsIntoDistribution` task. That name is intentionally verbose, but it's a lot to type, so take
+advantage of Gradle's ability to extrapolate task names:
+
+    ./gradlew copyemb distZip
+
+You can also do the following include the integration JARs in the Flux installation in the `examples/getting-started` 
+project (again taking advantage of Gradle's ability to extrapolate task names):
+
+    ./gradlew copyemb buildtoolfor
+
 ## Configuring the version
 
 You can specify a version for Flux when building Flux via any of the following:
@@ -70,12 +91,17 @@ If you are running the tests in Intellij with Java 17, you will need to perform
 --add-opens java.base/sun.util.calendar=ALL-UNNAMED 
 --add-opens java.base/java.io=ALL-UNNAMED 
 --add-opens java.base/sun.nio.cs=ALL-UNNAMED
+--add-opens java.base/sun.security.action=ALL-UNNAMED
 ```
 
 When you run one or more tests, the above configuration template settings will be used, allowing all Flux tests to 
 pass on Java 17. If you are running a test configuration that you ran prior to making the changes, you will need to 
 delete that configuration first via the "Run -> Edit Configurations" panel.
 
+If you are running tests in Intellij via Intellij and not via the Gradle wrapper, you will also need to run 
+`./gradlew shadowJar` first to ensure a couple shadow jars are created that are required by some of the `flux-cli` 
+tests. You do not need to do this if you have Intellij configured to use Gradle to run tests in Intellij.
+
 ## Generating code quality reports with SonarQube
 
 In order to use SonarQube, you must have used Docker to run this project's `docker-compose.yml` file, and you must
@@ -92,18 +118,18 @@ To configure the SonarQube service, perform the following steps:
 7. Click on "Use the global setting" and then "Create project".
 8. On the "Analysis Method" page, click on "Locally".
 9. In the "Provide a token" panel, click on "Generate". Copy the token.
-10. Add `systemProp.sonar.token=your token pasted here` to `gradle-local.properties` in the root of your project, creating
+10. Add `systemProp.sonar.login=your token pasted here` to `gradle-local.properties` in the root of your project, creating
     that file if it does not exist yet.
 
 To run SonarQube, run the following Gradle tasks with Java 17 or higher, which will run all the tests with code 
 coverage and then generate a quality report with SonarQube:
 
     ./gradlew test sonar
 
-If you do not add `systemProp.sonar.token` to your `gradle-local.properties` file, you can specify the token via the
+If you do not add `systemProp.sonar.login` to your `gradle-local.properties` file, you can specify the token via the
 following:
 
-    ./gradlew test sonar -Dsonar.token=paste your token here
+    ./gradlew test sonar -Dsonar.login=paste your token here
 
 When that completes, you will see a line like this near the end of the logging:
 
@@ -256,15 +282,15 @@ are all synonyms):
 
     ./gradlew shadowJar
 
-This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.1-SNAPSHOT-all.jar`.
+This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.2-SNAPSHOT-all.jar`.
 
 You can now run any CLI command via spark-submit. This is an example of previewing an import of files - change the value
 of `--path`, as an absolute path is needed, and of course change the value of `--master` to match that of your Spark
 cluster:
 
 ```
 $SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
---master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.1-SNAPSHOT-all.jar \
+--master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.2-SNAPSHOT-all.jar \
 import-files --path /Users/rudin/workspace/flux/flux-cli/src/test/resources/mixed-files \
 --connection-string "admin:admin@localhost:8000" \
 --preview 5 --preview-drop content
@@ -281,7 +307,7 @@ to something you can access):
 $SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
 --packages org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-client:3.3.4 \
 --master spark://NYWHYC3G0W:7077 \
-flux-cli/build/libs/marklogic-flux-1.1-SNAPSHOT-all.jar \
+flux-cli/build/libs/marklogic-flux-1.2-SNAPSHOT-all.jar \
 import-files --path "s3a://changeme/" \
 --connection-string "admin:admin@localhost:8000" \
 --s3-add-credentials \
 
@@ -6,6 +6,7 @@ def runtests(){
           mkdir -p $WORKSPACE/flux/docker/sonarqube;
           docker-compose up -d --build;
           sleep 30s;
+          curl "http://localhost:8008/api/pull" -d '{"model":"all-minilm"}'
         '''
   script{
     timeout(time: 60, unit: 'SECONDS') {
 
@@ -1,4 +1,4 @@
-MarkLogic® Flux™ v1
+MarkLogic® Flux™
 
 Copyright © 2024 MarkLogic Corporation. All Rights Reserved.
 
@@ -9,7 +9,7 @@ Third Party Notices
 aws-java-sdk-s3 1.12.262 (Apache-2.0)
 hadoop-aws 3.3.4 (Apache-2.0)
 hadoop-client 3.3.4 (Apache-2.0)
-marklogic-spark-connector 2.4.0 (Apache-2.0)
+marklogic-spark-connector 2.5.0 (Apache-2.0)
 picocli 4.7.6 (Apache-2.0)
 spark-avro_2.12 3.5.3 (Apache-2.0)
 spark-sql_2.12 3.5.3 (Apache-2.0)
@@ -20,13 +20,12 @@ Apache License 2.0 (Apache-2.0)
 
 Third-Party Components
 
-The following is a list of the third-party components used by MarkLogic® Flux™ v1 (last updated July 2, 2024):
+The following is a list of the third-party components used by MarkLogic® Flux™ 1.2.0 (last updated December 17, 2024):
 
 aws-java-sdk-s3 1.12.262 (Apache-2.0)
 https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
-
 hadoop-aws 3.3.4 (Apache-2.0)
 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
@@ -35,7 +34,7 @@ hadoop-client 3.3.4 (Apache-2.0)
 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
-marklogic-spark-connector 2.34.0(Apache-2.0)
+marklogic-spark-connector 2.5.0 (Apache-2.0)
 https://repo1.maven.org/maven2/com/marklogic/marklogic-spark-connector
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
 
@@ -16,6 +16,15 @@ subprojects {
     }
   }
 
+  configurations.all {
+    resolutionStrategy.eachDependency { DependencyResolveDetails details ->
+      if (details.requested.group.startsWith('com.fasterxml.jackson')) {
+        details.useVersion '2.15.2'
+        details.because 'Need to match the version used by Spark.'
+      }
+    }
+  }
+
   test {
     useJUnitPlatform()
     testLogging {
 
@@ -15,7 +15,7 @@ services:
       - 8007:8007
 
   marklogic:
-    image: "progressofficial/marklogic-db:11.3.0-ubi"
+    image: "ml-docker-db-dev-tierpoint.bed-artifactory.bedford.progress.com/marklogic/marklogic-server-ubi:latest-12"
     platform: linux/amd64
     environment:
       - MARKLOGIC_INIT=true
@@ -53,7 +53,7 @@ services:
 
   # Copied from https://docs.sonarsource.com/sonarqube/latest/setup-and-upgrade/install-the-server/#example-docker-compose-configuration .
   sonarqube:
-    image: sonarqube:10.6.0-community
+    image: sonarqube:lts-community
     depends_on:
       - postgres
     environment:
@@ -67,6 +67,13 @@ services:
     ports:
       - "9000:9000"
 
+  # Using Ollama for testing an embedding model.
+  # See https://github.com/ollama/ollama for more information.
+  ollama:
+    image: "ollama/ollama"
+    ports:
+      - 8008:11434
+
 volumes:
   sonarqube_data:
   sonarqube_extensions:
 
@@ -22,15 +22,15 @@ To add Flux as a dependency to your application, add the following to your Maven
 <dependency>
   <groupId>com.marklogic</groupId>
   <artifactId>flux-api</artifactId>
-  <version>1.1.3</version>
+  <version>1.2.0</version>
 </dependency>
 ```
 
 Or if you are using Gradle, add the following to your `build.gradle` file:
 
 ```
 dependencies {
-  implementation "com.marklogic:flux-api:1.1.3"
+  implementation "com.marklogic:flux-api:1.2.0"
 }
 ```
 
@@ -97,7 +97,7 @@ buildscript {
     mavenCentral()
   }
   dependencies {
-    classpath "com.marklogic:flux-api:1.1.3"
+    classpath "com.marklogic:flux-api:1.2.0"
   }
 }
 ```
@@ -127,25 +127,7 @@ when running Gradle. For example, if you run Gradle with `--stacktrace` and see
 The [Gradle documentation](https://docs.gradle.org/current/userguide/build_environment.html)  provides more information
 on the `org.gradle.jvmargs` property along with other ways to customize the Gradle environment.
 
-If you are using a plugin like [ml-gradle](https://github.com/marklogic/ml-gradle) that brings in its own version of the
-[FasterXML Jackson APIs](https://github.com/FasterXML/jackson), you need to be sure that the version of Jackson is 
-between 2.14.0 and 2.15.0 as required by the Apache Spark dependency of Flux. The following shows an example of excluding
-these dependencies from ml-gradle in a `build.gradle` file so that ml-gradle will use the Jackson APIs brought in via 
-Flux:
-
-```
-buildscript {
-  repositories {
-    mavenCentral()
-  }
-  dependencies {
-    classpath "com.marklogic:flux-api:1.1.3"
-    classpath("com.marklogic:ml-gradle:4.8.0") {
-      exclude group: "com.fasterxml.jackson.databind"
-      exclude group: "com.fasterxml.jackson.core"
-      exclude group: "com.fasterxml.jackson.dataformat"
-    }
-  }
-}
-```
-
+Please note that you cannot yet use the Flux API in your Gradle buildscript when you are also using the
+MarkLogic [ml-gradle plugin](https://github.com/marklogic/ml-gradle). This is due to a classpath conflict, where the
+MarkLogic Spark connector used by Flux must alter an underlying library so as not to conflict with Spark itself - but
+that altered library then conflicts with ml-gradle. We will have a resolution for this soon.
@@ -123,7 +123,7 @@ to the next line:
 
 ```
 --query
-"op.fromView('Example', 'Employees', '')\
+"op.fromView('example', 'employees', '')\
    .limit(10)"
 ```
 
@@ -263,12 +263,12 @@ All available connection options are shown in the table below:
 
 | Option | Description | 
 | --- | --- |
-| `--auth-type` | Type of authentication to use. Possible values are `BASIC`, `DIGEST`, `CLOUD`, `KERBEROS`, `CERTIFICATE`, and `SAML`.|
+| `--auth-type` | Type of authentication to use. Possible values are `BASIC`, `DIGEST`, `CLOUD`, `CERTIFICATE`, `KERBEROS`, `OAUTH`, and `SAML`.|
 | `--base-path` | Path to prepend to each call to a MarkLogic [REST API app server](https://docs.marklogic.com/guide/rest-dev). |
 | `--certificate-file` | File path for a keystore to be used for `CERTIFICATE` authentication. |
 | `--certificate-password` | Password for the keystore referenced by `--certificate-file`. |
 | `--connection-string` |  Defines a connection string as user:password@host:port/optionalDatabaseName; only usable when using `DIGEST` or `BASIC` authentication. |
-| `--cloud-api-key` | API key for authenticating with a MarkLogic Cloud cluster when authentication type is `CLOUD`. |
+| `--cloud-api-key` | API key for authenticating with a Progress Data Cloud cluster when authentication type is `CLOUD`. |
 | `--connection-type` |  Set to `DIRECT` if connections can be made directly to each host in the MarkLogic cluster. Defaults to `GATEWAY`. Possible values are `DIRECT` and `GATEWAY`. |
 | `--database` | Name of a database to connect if it differs from the one associated with the app server identified by `--port`. |
 | `--disable-gzipped-responses` | If included, responses from MarkLogic will not be gzipped. May improve performance when responses are very small.
@@ -278,6 +278,7 @@ All available connection options are shown in the table below:
 | `--keystore-password` | Password for the keystore identified by `--keystore-path`. |
 | `--keystore-path` | File path for a keystore for two-way SSL connections. |
 | `--keystore-type` | Type of the keystore identified by `--keystore-path`; defaults to `JKS`. |
+| `--oauth-token` | Token to be used with `OAUTH` authentication. |
 | `--password` | Password when using `DIGEST` or `BASIC` authentication. |
 | `--port` | Port of the [REST API app server](https://docs.marklogic.com/guide/rest-dev) to connect to. |
 | `--saml-token` | Token to be used with `SAML` authentication. |
@@ -331,7 +332,7 @@ instead of in a table:
 {% endtab %}
 {% tab log Windows %}
 ```
-./bin/flux import-parquet-files ^
+bin\flux import-parquet-files ^
     --connection-string "flux-example-user:password@localhost:8004" ^
     --path export\parquet ^
     --preview 10 ^
@@ -355,7 +356,7 @@ that Flux log the schema and not write any data:
 ```
 ./bin/flux export-parquet-files \
     --connection-string "flux-example-user:password@localhost:8004" \
-    --query "op.fromView('Example', 'Employees')" \
+    --query "op.fromView('example', 'employees')" \
     --path export/parquet \
     --preview-schema
 ```
@@ -364,7 +365,7 @@ that Flux log the schema and not write any data:
 ```
 bin\flux export-parquet-files ^
     --connection-string "flux-example-user:password@localhost:8004" ^
-    --query "op.fromView('Example', 'Employees')" ^
+    --query "op.fromView('example', 'employees')" ^
     --path export\parquet ^
     --preview-schema
 ```
@@ -488,27 +489,46 @@ time you run Flux:
 Flux is built on top of [Apache Spark](https://spark.apache.org/) and provides a number of command line options for 
 configuring the underlying Spark runtime environment used by Flux. 
 
-### Configuring the number of partitions 
+### Configuring Spark worker threads
 
-Flux uses Spark partitions to allow for data to be read and written in parallel. Each partition can be thought of as 
-a separate worker, operating in parallel with each other worker. 
+By default, Flux creates a Spark runtime with a master URL of `local[*]`, which runs Spark with as many worker 
+threads as logical cores on the machine running Flux. The number of worker threads affects how many partitions can be
+processed in parallel. You can change this setting via the`--spark-master-url` option; please see 
+[the Spark documentation](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) for examples
+of valid values. If you are looking to run a Flux command on a remote Spark cluster, please instead see the 
+[Spark Integration guide](spark-integration.md) for details on integrating Flux with `spark-submit`.
 
-A number of partitions will be determined by the command that you run before it reads data. The nature of the data 
-source directly impacts the number of partitions that will be created. 
+For import commands, you typically will not need to adjust this as a partition writer in an import command supports its
+own pool of threads via the [MarkLogic data movement library](https://docs.marklogic.com/guide/java/data-movement). However,
+depending on the data source, additional worker threads may help with reading data in parallel. 
 
-If you find that an insufficient number of partitions are created - i.e. the writer phase of your Flux command is not
-sending as much data to MarkLogic as it could - consider using the `--repartition` option to force a number of 
-partitions to be created after the data has been read. The downside to using `--repartition` is that all the data must
-be read first. Generally, this option will help when data can be read quickly and the performance of writing can be 
-improved by using more partitions than were created when reading data.
+For the [`reprocess` command](reprocess.md), setting the number of worker threads is critical to achieving optimal 
+performance. As of Flux 1.2.0, the `--thread-count` option will adjust the Spark master URL based on the number of 
+threads you specify. Prior to Flux 1.2.0, you can use `--repartition` to achieve the same effect. 
 
-### Configuring a Spark URL
+For exporting data, please see the [exporting guide](export/export.md) for information on how to adjust the worker 
+threads depending on whether you are reading documents or rows from MarkLogic.
 
-By default, Flux creates a Spark session with a master URL of `local[*]`. You can change this via the 
-`--spark-master-url` option; please see 
-[the Spark documentation](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) for examples
-of valid values. If you are looking to run a Flux command on a remote Spark cluster, please instead see the 
-[Spark Integration guide](spark-integration.md) for details on integrating Flux with `spark-submit`.
+### Configuring the number of Spark partitions
+
+Flux uses Spark partitions to allow for data to be read and written in parallel. Each partition can be thought of as
+a separate worker, operating in parallel with each other worker.
+
+A number of partitions will be determined by the command that you run before it reads data. The nature of the data
+source directly impacts the number of partitions that will be created.
+
+For some commands, you may find improved performance by changing the number of partitions used to write data to the
+target associated with the command. For example, an `export-jdbc` command may only need a small number of partitions to 
+read data from MarkLogic, but performance will be improved by using a far higher number of partitions to write data to
+the JDBC destination. You can use the `--repartition` option to force the number of partitions to use for writing data. 
+The downside to this option is that it forces Flux to read all the data from the data source before writing any to the
+target. Generally, this option will help when data can be read quickly and the performance of writing can be
+improved by using more partitions than were created when reading data - this is almost always the case for the 
+`reprocess` command.
+
+As of Flux 1.2.0, setting `--repartition` will default the value of the `--spark-master-url` option to be `local[N]`, 
+where `N` is the value of `--repartition`. This ensure that each partition writer has a Spark worker thread available
+to it. You can still override `--spark-master-url` if you wish.
 
 ### Configuring the Spark runtime