Skip to content

Conversation

@kevinliao852
Copy link

@kevinliao852 kevinliao852 commented Nov 12, 2025

Tracking issue

Why are the changes needed?

The current Spark plugin imports and uses Spark 4-specific data types that do not exist in Spark 3.x (including Spark 3.4).
This results in runtime import errors as

File "/home/kevinliao/.local/share/uv/python/cpython-3.10.17-linux-x86_64-gnu/lib/python3.10/importlib/util.py", line 94, in find_spec
    parent = __import__(parent_name, fromlist=['__path__'])
ModuleNotFoundError: No module named 'pyspark.sql.classic'

Because Flyte users run Spark workloads across mixed versions (Spark 3.x or Spark 4.x), the plugin must not assume Spark 4 APIs exist at runtime.

Without this patch, Spark 3.x tasks fail immediately, even if their logic does not depend on Spark-4-only features.

What changes were proposed in this pull request?

  1. Added runtime checks for Spark major version.
  2. Implemented conditional import of Spark 4 data types.

How was this patch tested?

Tested PySpark locally with Spark 3.4 and Spark 4.x.
Built a Flyte sandbox image with the updated transformer and schema.
Ran Spark tasks in Flyte using:
Spark 3.4 base image → passed
Spark 4.x base image → passed

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

  • This pull request enhances the Flyte Spark plugin to ensure compatibility with both Spark 3.x and 4.x by implementing runtime checks for Spark major versions.
  • It introduces conditional imports for Spark 4-specific data types to prevent runtime errors when using Spark 3.x.
  • The modifications include the registration of encoding and decoding handlers that adapt to the available Spark version.
  • Overall, this pull request touches on the Flyte Spark plugin, enhancing compatibility across Spark versions and introducing runtime checks to prevent errors.

@welcome
Copy link

welcome bot commented Nov 12, 2025

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@machichima
Copy link
Member

Sorry not familiar with this part of code. I think the code path you are modifying will be executed when user set the parameter with type ClassicDataFrame? Please correct me if it's not.

If so, could you also try running a workflow with parameter type set as ClassicDataFrame with spark 4.x to ensure it works?
Thank you!

@kevinliao852
Copy link
Author

Hi Nary, thanks for the review.

As my understanding, Spark introduced Spark Connect in Spark 3.4+, and Spark 4 moves toward making it the primary execution backend. To support both the classic JVM engine and the Connect engine under a unified API, pyspark.sql.DataFrame now acts as a high-level abstract entrypoint rather than a concrete implementation. When a DataFrame is created, Spark redirects the construction to the appropriate backend:

pyspark.sql.DataFrame → pyspark.sql.classic.dataframe.DataFrame

I also tested this behavior in PySpark 4, and the resulting DataFrame type is indeed pyspark.sql.classic.dataframe.DataFrame which is ClassicDataFrame, and it runs without issues.

This is the reference you may be interested in: apache/spark@393a84f

Additionally, in PySpark 3.x there is actually no pyspark.sql.classic.dataframe module, but the Flyte Spark plugin still attempts to load it. This mismatch is the root cause of the issue. So the main fix in this PR is to conditionally load the new data type only when the DataFrame instance is actually a ClassicDataFrame, ensuring the transformer works correctly in both PySpark 3.x (where the module does not exist) and PySpark 4.x (where ClassicDataFrame is now explicitly defined).

Copy link
Member

@machichima machichima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for the detailed explanation!
cc @pingsutw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants