[python] Enhance Arrow-to-Pandas nullability conveyance for `.to_pandas()` output #3642

johnkerl · 2025-01-30T17:11:57Z

Overview

@ivirshup , working on the cellxgene-census builder, has some new dataframes which have nullable-int columns.

He finds that when such data are read from TileDB-SOMA to Arrow using arrow_table = soma_dataframe.read().concat(), nullability of the int columns is conveyed correctly, but when a user then does pandas_dataframe = arrow_table.to_pandas(), nullability of the int columns is lost. This is a Pandas/Arrow issue but the current Arrow-to-Pandas current behavior for nullable-int columns is documented here so it is probably not considered a bug.

The ask here is that TileDB-SOMA can (proactively and defensively) work around this by enhancing the Arrow tables we create on soma_dataframe.read(), so that when users go on to convert those to Pandas format in their workflows, nullability of integer columns is conveyed from Arrow to Pandas as desired.

Context: nullability in Pandas, Arrow, and TileDB

There are Pandas dataframes, Arrow tables, and TileDB arrays. (It's easy to conflate the second and third since TileDB-SOMA uses Arrow schema in its data model, but there are tiledb-schema-to-arrow-schema and arrow-schema-to-tiledb-schema functions within the libtiledbsoma codebase.)

Nullability in Pandas:

You can't see a nullability flag via printing .dtypes on a Pandas dataframe
If a Pandas dataframe has a nullable Pandas int column, you can see NA values in that column when you print the Pandas dataframe
If a Pandas dataframe has a non-nullable Pandas int column, you can see NaN values in that column when you print the Pandas dataframe

Nullability in Arrow tables:

Within the nanoarrow API this can be done using a bit-flag on a given column, or, via Arrow-table metadata in the form of a key-value pair "columnname": "nullable"
- For narrative please see:
If an Arrow table has a nullable column, you can see this one of two ways in the pyarrow API:
- Print arrow_table.schema -- non-nullable fields will say not null; nullable fields won't
- More explictly, print out arrow_table.schema.field("field_name").nullable

Nullability in TileDB arrays:

You can see these one of two ways:
- Open the array with tiledb-py's tiledb.open and print A.schema and look for nullable=True or nullable=False
- Open the dataframe using tiledbsoma and print sdf.schema -- this is Arrow schema, so, as above, non-nullable fields will say not null; nullable fields won't

Context: conversions

Conversion from Pandas to Arrow:

E.g. arrow_from_pandas = pa.Table.from_pandas(pandas_df)
The output Arrow table will have a metadata key b'pandas' with JSON-encoded metadata value like this:

{
  "pandas": {
    "index_columns": [
      {
        "kind": "range",
        "name": null,
        "start": 0,
        "stop": 3,
        "step": 1
      }
    ],
    "column_indexes": [
      {
        "name": null,
        "field_name": null,
        "pandas_type": "unicode",
        "numpy_type": "object",
        "metadata": {
          "encoding": "UTF-8"
        }
      }
    ],
    "columns": [
      {
        "name": "a",
        "field_name": "a",
        "pandas_type": "int64",
        "numpy_type": "Int64",
        "metadata": null
      }
    ],
    "creator": {
      "library": "pyarrow",
      "version": "19.0.0"
    },
    "pandas_version": "2.2.3"
  }
}

Conversion from Arrow to Pandas:

E.g. pandas_from_arrow = arrow_table.to_pandas()
If the input Arrow table has a "pandas" metadata key-value pair then that is consulted: what should be an int column will be int, with null in it
If the input Arrow table does not have a "pandas" metadata key-value pair then the output Pandas dataframe that is produced will have float columns where there should be ints, with NaN in it -- THIS IS THE ISSUE (see below)

Conversion from Arrow to TileDB:

I.e. tiledbsoma.DataFrame.create
As noted in TileDB-SOMA PR 2868, we consult the Arrow schema's per-column nullable bit-flag, as well as the Arrow schema's per-table metadata (if present) in the form of {"column_name": "nullable"}
We do in all cases correctly map Arrow-table nullability into TileDB-array nullability -- this is not broken
We do not copy any Arrow-table metadata into TileDB-array metadata
In particular, we do not copy the "pandas" metadata from the input Arrow table's metadata to the output TileDB-SOMA DataFrame's metadata

Conversion from TileDB to Arrow:

I.e. tiledbsoma.DataFrame.read
We do in all cases correctly map Arrow-table nullability into TileDB-array nullability, and we correctly map TileDB-array nullability to Arrow-table nullability -- this is not broken

Here is an example of the various Pandas-to-Arrow and Arrow-to-Pandas issues involved here:

import pyarrow as pa
import pandas as pd

# ----------------------------------------------------------------
# Pandas has nullable integer types:
pandas_df = pd.DataFrame({"a": pd.Series([1, 2, None], dtype=pd.Int64Dtype())})

# >>> pandas_df.dtypes
# a    Int64
# dtype: object

# >>> pandas_df
#       a
# 0     1
# 1     2
# 2  <NA>

# ----------------------------------------------------------------
# As does Arrow:
arrow_table = pa.Table.from_pydict({"a": [1, 2, None]})

# >>> arrow_table.schema
# a: int64

# >>> arrow_table.schema.field('a').nullable
# True

# >>> arrow_table
# pyarrow.Table
# a: int64
# ----
# a: [[1,2,null]]

# >>> arrow_table.schema.metadata
# (nothing prints)

# ----------------------------------------------------------------
# But the to_pandas method on Arrow tables does not preserve the nullable type typically:
pandas_from_arrow = arrow_table.to_pandas()

# >>> pandas_from_arrow.dtypes
# a    float64
# dtype: object

# >>> pandas_from_arrow
#      a
# 0  1.0
# 1  2.0
# 2  NaN

# ----------------------------------------------------------------
# But if that table was created from a Pandas dataframe, it will:
arrow_from_pandas = pa.Table.from_pandas(pandas_df)

# >>> arrow_from_pandas.schema
# a: int64
# -- schema metadata --
# pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 358

# >>> arrow_from_pandas.schema.metadata
# {b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 3, "step": 1}],
# "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type":
# "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a",
# "pandas_type": "int64", "numpy_type": "Int64", "metadata": null}], "creator": {"library": "pyarrow",
# "version": "19.0.0"}, "pandas_version": "2.2.3"}'}

# >>> arrow_from_pandas
# pyarrow.Table
# a: int64
# ----
# a: [[1,2,null]]

# >>> arrow_from_pandas.schema.field('a').nullable
# True

# ----------------------------------------------------------------
# Round-tripping:
pandas_from_arrow_from_pandas = arrow_from_pandas.to_pandas()

# >>> pandas_from_arrow_from_pandas.dtypes
# a    Int64
# dtype: object

# >>> pandas_from_arrow_from_pandas
#       a
# 0     1
# 1     2
# 2  <NA>

Specific issue

@ivirshup's issue is the following:

He has Pandas data with nullable-int columns
He converts that to Arrow, and the resulting Arrow table's nullability is correctly conveyed from Pandas to Arrow (as described above)
He ingests that to TileDB-SOMA, and the resulting soma dataframe's nullability is correctly conveyed from Arrow to TileDB (as described above)
A user later reads the TileDB-SOMA dataframe (as Arrow), and the resulting Arrow table's nullabilty is correctly conveyed from TileDB to Arrow (as described above)
The user than does .to_pandas() on that Arrow table, and the nullability is not correctly conveyed from Arrow to Pandas (as described above -- please see "THIS IS THE ISSUE" above
This is not a TileDB-SOMA bug
Strictly speaking it's a Pandas/Arrow issue but @ivirshup asserts it's likely to remain that way. Specifically, the current behaviour is documented here so it is probably not considered a bug.

Diagramatically:

pandas nullability ----conveyed---> arrow-table nullability ---conveyed---> tiledbsoma dataframe nullability

pandas nullability <-NOT CONVEYED-- arrow-table nullability <--conveyed---- tiledbsoma dataframe nullability

Possible solutions

One solution considered

What we can do to help work around that Pandas/Arrow issue:

When we convert Arrow Table to soma DataFrame (at TileDB-SOMA create), copy over all Arrow-table metadata -- which will get the key-value pair with key b'pandas' and value which is a JSON-encoded string
When we convert TileDB-SOMA DataFrame to Arrow Table (at TileDB-SOMA read), copy all array metadata that does not start with "soma_" to the Arrow table we create
Then, when the user does .to_pandas() on the Arrow table they got from our .read() method, they'll get the intended nullabilities

Other issues / questions:

One is: At TileDB-SOMA DataFrame.create, given key named "pandas" (or "k"), should we have the TileDB-array metadata name be also "pandas" / "k" / etc.? Or maybe some prefix? (I don't care either way, we just need to decide)

A second one is: suppose there is data which did not come from Pandas: namely like this:

                                    arrow-table nullability ---conveyed---> tiledbsoma dataframe nullability

pandas nullability <-NOT CONVEYED-- arrow-table nullability <--conveyed---- tiledbsoma dataframe nullability

In this case, the TileDB-array metadata will not have any metadata with key "pandas".

Then, on our .read() method, we could do this:

If the input tiledb array has metadata key "pandas", then copy that key-value pair to the output Arrow Table
But if the input tiledb array does not have metadata key "pandas", then, we should create one

Another solution considered

A much simpler proposal from @nguyenv :

When we do arrow -> tiledb (our create), don't copy any metadata from Arrow from TileDB
- So, no codemods will be needed in the write path (i.e. the census builder)
When we go arrow <- tiledb (our read), don't copy any metadata from TileDB to Arrow
- But do freshly compute a pandas key-value pair in the output Arrow-table schema
- That helps us for the case when data did originate with Pandas, and when it did not

Benefits of this second solution:

It's simpler
It only affects the read path, i.e. we needn't delay the census-builder

Findings from option 2

Experimental PR: #3645.

Findings: Doing as proposed above leads to many existing unit-test cases failing.

Context:

NumPy arrays (one option for columns in Pandas dataframes) cannot ever have nulls -- only NaN (other than the masked_array featuree)
Pandas arrays (another option for columns in Pandas dataframes) can have true nulls (these print as NA) -- or of course they can also have NaN in floating-point columns
We have many unit-test cases that start with a pd.Series of type np.int64 (for example; other types as well)
and these (variously) move something through Pandas -> Arrow -> TileDB-SOMA -> Arrow -> Pandas dataframe and do things like assert original_df == readback_df
On the PR, the point is to make things nullable -- so we make the Arrow-to-Pandas dtype be things like pd.Int64Dtype()
That makes it possible for round-tripping of nulls from Pandas -> Arrow -> TileDB -> Arrow -> Pandas
but it makes existing unit-test cases fail

Three possibilties:

Modify our unit-test cases to not do assert original_df == readback_df but rather some new helper function assert new_helper(original_df, readback_df) where new_helper will check every except allow one side's dtype to be like np.int64 and the other to be like pd.Int64Dtype()
- This is a mod that only affects the read path: when we do TileDB -> Arrow table
Modify our unit-test cases to -- in the first place -- construct their pd.Series with types like pd.Int64Dtype(), and keep the same assert original_df == readback_df
- This is a mod that only affects the read path: when we do TileDB -> Arrow table
Only produce a nullable column on TileDB -> Arrow if the input Arrow -> TileDB data was nullable
- Requires that we store Arrow-table metadata as TileDB-array metadata
- This is a mod that affects the write and read paths: when we do Arrow table -> TileDB, and TileDB -> Arrow table

The text was updated successfully, but these errors were encountered:

johnkerl self-assigned this Jan 30, 2025

johnkerl changed the title ~~Consider Pandas metadata to enhance nullability for .to_pandas() output~~ [python] Consider Pandas metadata to enhance nullability for .to_pandas() output Jan 30, 2025

johnkerl changed the title ~~[python] Consider Pandas metadata to enhance nullability for .to_pandas() output~~ [python] Enhance Arrow-to-Pandas nullability conveyance for .to_pandas() output Jan 30, 2025

johnkerl mentioned this issue Jan 30, 2025

[python] Enhance Arrow-to-Pandas nullability conveyance for .to_pandas() output [WIP] #3645

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Enhance Arrow-to-Pandas nullability conveyance for `.to_pandas()` output #3642

[python] Enhance Arrow-to-Pandas nullability conveyance for `.to_pandas()` output #3642

johnkerl commented Jan 30, 2025 •

edited

Loading

[python] Enhance Arrow-to-Pandas nullability conveyance for .to_pandas() output #3642

[python] Enhance Arrow-to-Pandas nullability conveyance for .to_pandas() output #3642

Comments

johnkerl commented Jan 30, 2025 • edited Loading

Overview

Context: nullability in Pandas, Arrow, and TileDB

Context: conversions

Specific issue

Possible solutions

One solution considered

Another solution considered

See also

Findings from option 2

[python] Enhance Arrow-to-Pandas nullability conveyance for `.to_pandas()` output #3642

[python] Enhance Arrow-to-Pandas nullability conveyance for `.to_pandas()` output #3642

johnkerl commented Jan 30, 2025 •

edited

Loading