Skip to content

Releases: marklogic/marklogic-spark-connector

3.1.1

15 May 14:21
86748aa

Choose a tag to compare

This patch release addresses the following item:

  • The org.apache.thrift:libthrift transitive dependency, brought in via the Apache jena-arq dependency used for processing RDF data, was bumped from 0.22.0 to 0.23.0 to address several CVEs.

3.1.0

18 Mar 14:55
0877fc9

Choose a tag to compare

This minor release provides the following enhancements and bug fixes:

  • Incremental write, as introduced in the MarkLogic Java Client 8.1.0 release, is now available through a set of options prefixed with spark.marklogic.write.incremental.
  • To support incremental write, the option spark.marklogic.write.logSkippedDocuments can be used to control how often messages are logged when documents are skipped.
  • The spark.marklogic.streamTransformBinaryExtensions option allows for defining a comma-delimited list of URI extensions that controls which documents are sent to a REST transform when streaming documents during import. This supports a use case of streaming documents into MarkLogic when a subset of the documents must be sent to a REST transform, typically to alter the document type - e.g. ensuring that a JSON or XML document is loaded as a binary. This allows for only passing a subset of documents through a REST transform, as using a REST transform requires loaded the document into memory in MarkLogic which runs counter to the purpose of streaming.
  • To support exporting and importing archives where JSON and/or XML documents are stored in MarkLogic as binaries, the name of a metadata entry in an archive file will now include the document type - e.g. "JSON", "XML", "TEXT", or "BINARY". The document type is not required to be present in an archive file - i.e. archive files created with previous version of the connector will still be imported correctly.
  • Fixed a bug where empty documents would cause an error when written to an archive file.

3.0.0

13 Jan 21:23
c5f2810

Choose a tag to compare

This major release is built against Apache Spark 4.1.0 instead of Spark 3, and thus requires Java 17. Please see below for a complete list of breaking changes, enhancements, and bug fixes.

Breaking Changes

  1. When splitting text and creating chunks, the connector now defaults to creating one sidecar document per chunk as opposed to defaulting to adding all chunks to the source document.
  2. Vector embeddings in XML documents now default to a QName of vector with a namespace of http://marklogic.com/vector, matching upcoming default index exclusions in the MarkLogic server.
  3. Vector embeddings in JSON documents now default to a name of _vector, also matching upcoming default index exclusions in the MarkLogic server.
  4. The deprecated spark.marklogic.write.fileRows.documentType option has been removed. This option was intended to be used with Spark's binaryFile data source, but binary files should instead be read with this connector's own support for reading files.
  5. The com.marklogic.client, okhttp3, okio, and com.burgstaller.okhttp packages are no longer shaded in the connector jar. These had to be shaded when the connector depended on Spark 3 as Spark 3 included its own older version of OkHttp.

Enhancements

  1. When a URI template has an expression that cannot be resolved for a given document, the new spark.marklogic.write.uriTemplate.warnOnMissingField option can be set to true to log a warning instead of failing. The expression will have its value replaced with UNRESOLVED- prepended to a random UUID.
  2. When reading files, the connector now defaults a number of partitions equal to the value of spark.default.parallelism, helping avoid performance issues due to large numbers of very small partitions.
  3. When classifying text via a Semaphore instance in Progress Data Cloud (PDC), the PDC token will be renewed if it expires during the course of a connector job.
  4. When exporting documents to zip file, a warning will be logged once a zip file contains 500,000 entries. Writing multiple large zip files at once can lead to heap space exhaustion in the JVM; users can avoid this by increasing the number of partitions.
  5. Added spark.marklogic.read.partitions.vars. as a prefix for defining variables to send to the custom code for reading partitions when reading items via custom code.

Bug Fixes

  1. Fixed a bug with writing triples where datatype is only set if lang does not exist.
  2. Fixed a bug where, when reading files, a partition could have zero files. Files are now evenly distributed across partitions.
  3. Fixed a bug with exporting documents to zip files on Windows.

2.7.0

05 Aug 21:05
d61de9b

Choose a tag to compare

This minor release addresses the following items:

  • Can now provide a secondary query when reading documents from MarkLogic. This is supported via the following new options:
    • spark.marklogic.read.secondaryUris.invoke
    • spark.marklogic.read.secondaryUris.javascript
    • spark.marklogic.read.secondaryUris.javascriptFile
    • spark.marklogic.read.secondaryUris.xquery
    • spark.marklogic.read.secondaryUris.xqueryFile
    • spark.marklogic.read.secondaryUris.vars.
  • Can now provide a prompt when generating an embedding via the new spark.marklogic.write.embedder.prompt option.
  • Can now encode vectors in documents when generating embeddings via the new spark.marklogic.write.embedder.base64encode option.
  • Fixed a bug where classifying text and generating embeddings did not work when data was read from a structured data source such as JDBC or a delimited text file.
  • Fixed a bug where a document with a URI containing multiple colons could not be read from MarkLogic and written to a file.
  • Fixed a bug where URIs were incorrectly modified when documents were written as entries in a zip file. URIs are now used as the zip entry name.

2.6.0

02 May 14:43

Choose a tag to compare

This release addresses the following items:

  • Can now extract text from binary documents via Apache Tika .
  • Can now classify text via Progress Semaphore.
  • Can now specify document properties and metadata values when writing documents to MarkLogic.

2.5.1

07 Jan 19:04

Choose a tag to compare

This patch release addresses the following items:

  1. Depends on the MarkLogic Java Client 7.1.0 release, which includes an important bug fix that affects how the connector reads data via custom code.
  2. Added debug-level logging for reading and writing data via custom code.
  3. Fixed an issue with logging progress when reading rows via an Optic query.

2.5.0

17 Dec 21:49
d0d6d9c

Choose a tag to compare

This release addresses the following items:

  1. Can now split text in documents when writing them to MarkLogic. Chunks of text can be added to the source document itself or written to separate sidecar documents.
  2. Can now add embeddings to chunks in documents before writing them to MarkLogic. You can reuse the Flux embedding model integrations available from the Flux releases site by adding one or more of these JAR files to your Spark classpath.
  3. When reading rows via an Optic query, the Optic query no longer requires the use of op.fromView. However, when not using op.fromView, the Optic query will be executed in a single call to MarkLogic.
  4. When writing files to a directory, the given path will be created automatically if it does not exist, matching the behavior of Spark file-based data sources.

Please see the writing guide for more information on the splitter and embedder features.

2.4.2

17 Oct 18:44
afd19a3

Choose a tag to compare

This patch release addresses the following two issues:

  1. spark.marklogic.read.snapshot was added to allow a user to configure a non-consistent snapshot when reading documents by setting the option to false. This avoids bugs where a consistent snapshot is not feasible and the downsides of reading at multiple times are not a concern.
  2. Issues with importing JSON Lines files via Flux - such as keys being reordered and added - can be avoided by setting the existing
    spark.marklogic.read.files.type option to a value of json_lines. The connector will read each line as a separate JSON document and will not perform any modifications on any line, thereby avoiding the issue in Flux of JSON documents being unexpectedly altered.

2.4.1

17 Oct 17:06
7ff9e0f

Choose a tag to compare

This patch release addresses a single issue:

  • The org.slf4j:slf4j-api transitive dependency is forced to be version 2.0.13, ensuring that no occurrences of the 1.x version of that dependency are included in the connector jar. This resolves a logging issue in the Flux application.

2.4.0

02 Oct 19:25
168cf5f

Choose a tag to compare

This minor release addresses the following items:

  1. Can now stream regular files, ZIP files, gzip files, and archive files by setting the new spark.marklogic.streamFiles option to a value of true. Using this option in the reader phase results in the reading of files being deferred until the writer phase. Using this option in the writer phase results in each file being streamed to MarkLogic in a separate request to MarkLogic, thus avoiding ever reading the contents of the file or zip entry into memory.
  2. Can now stream documents from MarkLogic to regular files, ZIP files, gzip files, and archive files by setting the same option above - spark.marklogic.streamFiles - to a value of `true. Using this option in the reader phase results in the reading of documents being deferred until the writer phase. Using this option in the writer phase results in each document being streamed from MarkLogic to a file or zip entry, thus avoiding ever reading the contents of the document into memory.
  3. Files with spaces in the path are now handled correctly when reading files into MarkLogic. However, when streaming files into MarkLogic, the spaces in the path will be encoded due to a pending server fix.
  4. Archive files - zip files containing content and metadata - now contain the metadata entry followed by the content entry for each document. This supports streaming archive files. Archive files generated by version 2.3.x of the connector - with the content entry followed by the metadata entry - can still be read, though they cannot be streamed.
  5. Now compiled and tested against Spark 3.5.3.