fix: update Spark plugin for compatibility with Spark 3.x and 4.x #3350

kevinliao852 · 2025-11-12T22:26:31Z

Tracking issue

Why are the changes needed?

The current Spark plugin imports and uses Spark 4-specific data types that do not exist in Spark 3.x (including Spark 3.4).
This results in runtime import errors as

File "/home/kevinliao/.local/share/uv/python/cpython-3.10.17-linux-x86_64-gnu/lib/python3.10/importlib/util.py", line 94, in find_spec
    parent = __import__(parent_name, fromlist=['__path__'])
ModuleNotFoundError: No module named 'pyspark.sql.classic'

Because Flyte users run Spark workloads across mixed versions (Spark 3.x or Spark 4.x), the plugin must not assume Spark 4 APIs exist at runtime.

Without this patch, Spark 3.x tasks fail immediately, even if their logic does not depend on Spark-4-only features.

What changes were proposed in this pull request?

Added runtime checks for Spark major version.
Implemented conditional import of Spark 4 data types.

How was this patch tested?

Tested PySpark locally with Spark 3.4 and Spark 4.x.
Built a Flyte sandbox image with the updated transformer and schema.
Ran Spark tasks in Flyte using:
Spark 3.4 base image → passed
Spark 4.x base image → passed

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This pull request enhances the Flyte Spark plugin to ensure compatibility with both Spark 3.x and 4.x by implementing runtime checks for Spark major versions.
It introduces conditional imports for Spark 4-specific data types to prevent runtime errors when using Spark 3.x.
The modifications include the registration of encoding and decoding handlers that adapt to the available Spark version.
Overall, this pull request touches on the Flyte Spark plugin, enhancing compatibility across Spark versions and introducing runtime checks to prevent errors.

Signed-off-by: Kevin Liao <[email protected]>

welcome · 2025-11-12T22:26:34Z

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
Sign off your commits (Reference: DCO Guide).

machichima · 2025-11-13T00:40:20Z

Sorry not familiar with this part of code. I think the code path you are modifying will be executed when user set the parameter with type ClassicDataFrame? Please correct me if it's not.

If so, could you also try running a workflow with parameter type set as ClassicDataFrame with spark 4.x to ensure it works?
Thank you!

kevinliao852 · 2025-11-13T06:24:41Z

Hi Nary, thanks for the review.

As my understanding, Spark introduced Spark Connect in Spark 3.4+, and Spark 4 moves toward making it the primary execution backend. To support both the classic JVM engine and the Connect engine under a unified API, pyspark.sql.DataFrame now acts as a high-level abstract entrypoint rather than a concrete implementation. When a DataFrame is created, Spark redirects the construction to the appropriate backend:

pyspark.sql.DataFrame → pyspark.sql.classic.dataframe.DataFrame

I also tested this behavior in PySpark 4, and the resulting DataFrame type is indeed pyspark.sql.classic.dataframe.DataFrame which is ClassicDataFrame, and it runs without issues.

This is the reference you may be interested in: apache/spark@393a84f

Additionally, in PySpark 3.x there is actually no pyspark.sql.classic.dataframe module, but the Flyte Spark plugin still attempts to load it. This mismatch is the root cause of the issue. So the main fix in this PR is to conditionally load the new data type only when the DataFrame instance is actually a ClassicDataFrame, ensuring the transformer works correctly in both PySpark 3.x (where the module does not exist) and PySpark 4.x (where ClassicDataFrame is now explicitly defined).

machichima

LGTM! Thank you for the detailed explanation!
cc @pingsutw

fix: update Spark plugin for compatibility with Spark 3.x and 4.x

10aa9a0

Signed-off-by: Kevin Liao <[email protected]>

kevinliao852 requested review from cosmicBboy, davidmirror-ops, kumare3, machichima, pingsutw, samhita-alla and wild-endeavor as code owners November 12, 2025 22:26

machichima approved these changes Nov 13, 2025

View reviewed changes

pingsutw approved these changes Nov 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: update Spark plugin for compatibility with Spark 3.x and 4.x #3350

fix: update Spark plugin for compatibility with Spark 3.x and 4.x #3350

kevinliao852 commented Nov 12, 2025 •

edited by flyte-bot

Loading

Uh oh!

welcome bot commented Nov 12, 2025

Uh oh!

machichima commented Nov 13, 2025

Uh oh!

kevinliao852 commented Nov 13, 2025

Uh oh!

machichima left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: update Spark plugin for compatibility with Spark 3.x and 4.x #3350

Are you sure you want to change the base?

fix: update Spark plugin for compatibility with Spark 3.x and 4.x #3350

Conversation

kevinliao852 commented Nov 12, 2025 • edited by flyte-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

Summary by Bito

Uh oh!

welcome bot commented Nov 12, 2025

Uh oh!

machichima commented Nov 13, 2025

Uh oh!

kevinliao852 commented Nov 13, 2025

Uh oh!

machichima left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kevinliao852 commented Nov 12, 2025 •

edited by flyte-bot

Loading