Skip to content

Pass bytes object from remote to local #954

@jacgoldsm

Description

@jacgoldsm

Currently, the way to transfer a dataframe from remote to local is with the command %%spark -o df. However, this is not an optimal solution as it is less efficient and not equivalent with respect to data types as the normal .toPandas() method of a spark dataframe.

The easiest workaround is to write the dataframe to a file system that both remote and local have access to. For example, you could do:

df = spark.table("data")
df.toPandas().to_parquet("/file/system/data.parquet")

%%local
df = pd.read_parquet("/file/system/data.parquet")

If you could transfer the serialized parquet file directly to local, you could avoid the external file system:

df = spark.table("data")
buf = df.toPandas().to_parquet(path=None)

%%send_bytes buf

%%local
import io
df = pd.read_parquet(io.BytesIO(buf))

How difficult would it be to add the %%send_bytes magic? From looking through the code, it seems as though it should be doable, but I may be missing something. I am happy to help as best I can with the implementation, but I am not very familiar with the codebase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions