Skip to content

Conversation

@ymuppala
Copy link
Collaborator

@ymuppala ymuppala commented Feb 12, 2026

Problem Statement

Spark KIF repush fails for stores with chunking enabled due to two bugs:

  1. Schema validation rejects Venice internal columns: validateDataFrame() rejects all fields starting with _, but chunked KIF DataFrames contain Venice internal columns (__schema_id__, __offset__, __message_type__, __chunked_key_suffix__, __replication_metadata_version_id__) that are required for chunk assembly. These columns are consumed by applyChunkAssembly() and dropped later in runComputeJob(), but validation runs before they are dropped.
  2. NotSerializableException in Spark lambdas: The lambdas in applyChunkAssembly() and applyTTLFilter() reference the instance field accumulatorsForDataWriterJob directly, which captures DataWriterSparkJob. Since
    DataWriterSparkJob doesn't implement Serializable, Spark fails when serializing the task to send to executors.
  3. GHA has updated the docker version which in turn is causing the PulsarIntegrationTests to fail.

Solution

  1. In validateDataFrame(), when isSourceKafka is true, the five known Venice internal column names are added to an allowlist and exempted from the underscore check. For regular VPJ unknown
    underscore columns are still rejected.
  2. Following the existing pattern in applyDedup(), the accumulators are extracted into final local variables before the lambda, so only the serializable LongAccumulator is captured instead of this.
  3. Pinning the version of docker.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

The PR was tested against a test instance in airflow rdev http://rdev-aks-wus3-14.privatelink.westus3.azmk8s.io:42229/dags/vpj-spark-repush-test_faro__voldemort-build-and-push/grid?dag_run_id=manual__2026-02-12T18%3A23%3A00.588766%2B00%3A00&task_id=spark_repush_job&tab=logs

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@ymuppala ymuppala force-pushed the repush_integration_test branch 2 times, most recently from 79eb63a to 87b9351 Compare February 12, 2026 21:20
@ymuppala ymuppala force-pushed the repush_integration_test branch from 87b9351 to 71679f7 Compare February 12, 2026 21:43
@ymuppala ymuppala merged commit 3c0d285 into linkedin:main Feb 13, 2026
69 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants