You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@ivirshup , working on the cellxgene-census builder, has some new dataframes which have nullable-int columns.
He finds that when such data are read from TileDB-SOMA to Arrow using arrow_table = soma_dataframe.read().concat(), nullability of the int columns is conveyed correctly, but when a user then does pandas_dataframe = arrow_table.to_pandas(), nullability of the int columns is lost. This is a Pandas/Arrow issue but the current Arrow-to-Pandas current behavior for nullable-int columns is documented here so it is probably not considered a bug.
The ask here is that TileDB-SOMA can (proactively and defensively) work around this by enhancing the Arrow tables we create on soma_dataframe.read(), so that when users go on to convert those to Pandas format in their workflows, nullability of integer columns is conveyed from Arrow to Pandas as desired.
Context: nullability in Pandas, Arrow, and TileDB
There are Pandas dataframes, Arrow tables, and TileDB arrays. (It's easy to conflate the second and third since TileDB-SOMA uses Arrow schema in its data model, but there are tiledb-schema-to-arrow-schema and arrow-schema-to-tiledb-schema functions within the libtiledbsoma codebase.)
Nullability in Pandas:
You can't see a nullability flag via printing .dtypes on a Pandas dataframe
If a Pandas dataframe has a nullable Pandas int column, you can see NA values in that column when you print the Pandas dataframe
If a Pandas dataframe has a non-nullable Pandas int column, you can see NaN values in that column when you print the Pandas dataframe
Nullability in Arrow tables:
Within the nanoarrow API this can be done using a bit-flag on a given column, or, via Arrow-table metadata in the form of a key-value pair "columnname": "nullable"
If an Arrow table has a nullable column, you can see this one of two ways in the pyarrow API:
Print arrow_table.schema -- non-nullable fields will say not null; nullable fields won't
More explictly, print out arrow_table.schema.field("field_name").nullable
Nullability in TileDB arrays:
You can see these one of two ways:
Open the array with tiledb-py's tiledb.open and print A.schema and look for nullable=True or nullable=False
Open the dataframe using tiledbsoma and print sdf.schema -- this is Arrow schema, so, as above, non-nullable fields will say not null; nullable fields won't
Context: conversions
Conversion from Pandas to Arrow:
E.g. arrow_from_pandas = pa.Table.from_pandas(pandas_df)
The output Arrow table will have a metadata key b'pandas' with JSON-encoded metadata value like this:
If the input Arrow table has a "pandas" metadata key-value pair then that is consulted: what should be an int column will be int, with null in it
If the input Arrow table does not have a "pandas" metadata key-value pair then the output Pandas dataframe that is produced will have float columns where there should be ints, with NaN in it -- THIS IS THE ISSUE (see below)
Conversion from Arrow to TileDB:
I.e. tiledbsoma.DataFrame.create
As noted in TileDB-SOMA PR 2868, we consult the Arrow schema's per-column nullable bit-flag, as well as the Arrow schema's per-table metadata (if present) in the form of {"column_name": "nullable"}
We do in all cases correctly map Arrow-table nullability into TileDB-array nullability -- this is not broken
We do not copy any Arrow-table metadata into TileDB-array metadata
In particular, we do not copy the "pandas" metadata from the input Arrow table's metadata to the output TileDB-SOMA DataFrame's metadata
Conversion from TileDB to Arrow:
I.e. tiledbsoma.DataFrame.read
We do in all cases correctly map Arrow-table nullability into TileDB-array nullability, and we correctly map TileDB-array nullability to Arrow-table nullability -- this is not broken
Here is an example of the various Pandas-to-Arrow and Arrow-to-Pandas issues involved here:
import pyarrow as pa
import pandas as pd
# ----------------------------------------------------------------
# Pandas has nullable integer types:
pandas_df = pd.DataFrame({"a": pd.Series([1, 2, None], dtype=pd.Int64Dtype())})
# >>> pandas_df.dtypes
# a Int64
# dtype: object
# >>> pandas_df
# a
# 0 1
# 1 2
# 2 <NA>
# ----------------------------------------------------------------
# As does Arrow:
arrow_table = pa.Table.from_pydict({"a": [1, 2, None]})
# >>> arrow_table.schema
# a: int64
# >>> arrow_table.schema.field('a').nullable
# True
# >>> arrow_table
# pyarrow.Table
# a: int64
# ----
# a: [[1,2,null]]
# >>> arrow_table.schema.metadata
# (nothing prints)
# ----------------------------------------------------------------
# But the to_pandas method on Arrow tables does not preserve the nullable type typically:
pandas_from_arrow = arrow_table.to_pandas()
# >>> pandas_from_arrow.dtypes
# a float64
# dtype: object
# >>> pandas_from_arrow
# a
# 0 1.0
# 1 2.0
# 2 NaN
# ----------------------------------------------------------------
# But if that table was created from a Pandas dataframe, it will:
arrow_from_pandas = pa.Table.from_pandas(pandas_df)
# >>> arrow_from_pandas.schema
# a: int64
# -- schema metadata --
# pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 358
# >>> arrow_from_pandas.schema.metadata
# {b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 3, "step": 1}],
# "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type":
# "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a",
# "pandas_type": "int64", "numpy_type": "Int64", "metadata": null}], "creator": {"library": "pyarrow",
# "version": "19.0.0"}, "pandas_version": "2.2.3"}'}
# >>> arrow_from_pandas
# pyarrow.Table
# a: int64
# ----
# a: [[1,2,null]]
# >>> arrow_from_pandas.schema.field('a').nullable
# True
# ----------------------------------------------------------------
# Round-tripping:
pandas_from_arrow_from_pandas = arrow_from_pandas.to_pandas()
# >>> pandas_from_arrow_from_pandas.dtypes
# a Int64
# dtype: object
# >>> pandas_from_arrow_from_pandas
# a
# 0 1
# 1 2
# 2 <NA>
He converts that to Arrow, and the resulting Arrow table's nullability is correctly conveyed from Pandas to Arrow (as described above)
He ingests that to TileDB-SOMA, and the resulting soma dataframe's nullability is correctly conveyed from Arrow to TileDB (as described above)
A user later reads the TileDB-SOMA dataframe (as Arrow), and the resulting Arrow table's nullabilty is correctly conveyed from TileDB to Arrow (as described above)
The user than does .to_pandas() on that Arrow table, and the nullability is not correctly conveyed from Arrow to Pandas (as described above -- please see "THIS IS THE ISSUE" above
This is not a TileDB-SOMA bug
Strictly speaking it's a Pandas/Arrow issue but @ivirshup asserts it's likely to remain that way. Specifically, the current behaviour is documented here so it is probably not considered a bug.
What we can do to help work around that Pandas/Arrow issue:
When we convert Arrow Table to soma DataFrame (at TileDB-SOMA create), copy over all Arrow-table metadata -- which will get the key-value pair with key b'pandas' and value which is a JSON-encoded string
When we convert TileDB-SOMA DataFrame to Arrow Table (at TileDB-SOMA read), copy all array metadata that does not start with "soma_" to the Arrow table we create
Then, when the user does .to_pandas() on the Arrow table they got from our .read() method, they'll get the intended nullabilities
Other issues / questions:
One is: At TileDB-SOMA DataFrame.create, given key named "pandas" (or "k"), should we have the TileDB-array metadata name be also "pandas" / "k" / etc.? Or maybe some prefix? (I don't care either way, we just need to decide)
A second one is: suppose there is data which did not come from Pandas: namely like this:
Findings: Doing as proposed above leads to many existing unit-test cases failing.
Context:
NumPy arrays (one option for columns in Pandas dataframes) cannot ever have nulls -- only NaN (other than the masked_array featuree)
Pandas arrays (another option for columns in Pandas dataframes) can have true nulls (these print as NA) -- or of course they can also have NaN in floating-point columns
We have many unit-test cases that start with a pd.Series of type np.int64 (for example; other types as well)
and these (variously) move something through Pandas -> Arrow -> TileDB-SOMA -> Arrow -> Pandas dataframe and do things like assert original_df == readback_df
On the PR, the point is to make things nullable -- so we make the Arrow-to-Pandas dtype be things like pd.Int64Dtype()
That makes it possible for round-tripping of nulls from Pandas -> Arrow -> TileDB -> Arrow -> Pandas
but it makes existing unit-test cases fail
Three possibilties:
Modify our unit-test cases to not do assert original_df == readback_df but rather some new helper function assert new_helper(original_df, readback_df) where new_helper will check every except allow one side's dtype to be like np.int64 and the other to be like pd.Int64Dtype()
This is a mod that only affects the read path: when we do TileDB -> Arrow table
Modify our unit-test cases to -- in the first place -- construct their pd.Series with types like pd.Int64Dtype(), and keep the same assert original_df == readback_df
This is a mod that only affects the read path: when we do TileDB -> Arrow table
Only produce a nullable column on TileDB -> Arrow if the input Arrow -> TileDB data was nullable
Requires that we store Arrow-table metadata as TileDB-array metadata
This is a mod that affects the write and read paths: when we do Arrow table -> TileDB, and TileDB -> Arrow table
The text was updated successfully, but these errors were encountered:
johnkerl
changed the title
Consider Pandas metadata to enhance nullability for .to_pandas() output
[python] Consider Pandas metadata to enhance nullability for .to_pandas() output
Jan 30, 2025
johnkerl
changed the title
[python] Consider Pandas metadata to enhance nullability for .to_pandas() output
[python] Enhance Arrow-to-Pandas nullability conveyance for .to_pandas() output
Jan 30, 2025
Overview
@ivirshup , working on the
cellxgene-census
builder, has some new dataframes which have nullable-int columns.He finds that when such data are read from TileDB-SOMA to Arrow using
arrow_table = soma_dataframe.read().concat()
, nullability of the int columns is conveyed correctly, but when a user then doespandas_dataframe = arrow_table.to_pandas()
, nullability of the int columns is lost. This is a Pandas/Arrow issue but the current Arrow-to-Pandas current behavior for nullable-int columns is documented here so it is probably not considered a bug.The ask here is that TileDB-SOMA can (proactively and defensively) work around this by enhancing the Arrow tables we create on
soma_dataframe.read()
, so that when users go on to convert those to Pandas format in their workflows, nullability of integer columns is conveyed from Arrow to Pandas as desired.Context: nullability in Pandas, Arrow, and TileDB
There are Pandas dataframes, Arrow tables, and TileDB arrays. (It's easy to conflate the second and third since TileDB-SOMA uses Arrow schema in its data model, but there are tiledb-schema-to-arrow-schema and arrow-schema-to-tiledb-schema functions within the
libtiledbsoma
codebase.)Nullability in Pandas:
.dtypes
on a Pandas dataframeNA
values in that column when you print the Pandas dataframeNaN
values in that column when you print the Pandas dataframeNullability in Arrow tables:
nanoarrow
API this can be done using a bit-flag on a given column, or, via Arrow-table metadata in the form of a key-value pair"columnname": "nullable"
pyarrow
API:arrow_table.schema
-- non-nullable fields will saynot null
; nullable fields won'tarrow_table.schema.field("field_name").nullable
Nullability in TileDB arrays:
tiledb.open
and printA.schema
and look fornullable=True
ornullable=False
tiledbsoma
and printsdf.schema
-- this is Arrow schema, so, as above, non-nullable fields will saynot null
; nullable fields won'tContext: conversions
Conversion from Pandas to Arrow:
arrow_from_pandas = pa.Table.from_pandas(pandas_df)
b'pandas'
with JSON-encoded metadata value like this:Conversion from Arrow to Pandas:
pandas_from_arrow = arrow_table.to_pandas()
"pandas"
metadata key-value pair then that is consulted: what should be an int column will be int, with null in it"pandas"
metadata key-value pair then the output Pandas dataframe that is produced will have float columns where there should be ints, withNaN
in it -- THIS IS THE ISSUE (see below)Conversion from Arrow to TileDB:
tiledbsoma.DataFrame.create
{"column_name": "nullable"}
"pandas"
metadata from the input Arrow table's metadata to the output TileDB-SOMA DataFrame's metadataConversion from TileDB to Arrow:
tiledbsoma.DataFrame.read
Here is an example of the various Pandas-to-Arrow and Arrow-to-Pandas issues involved here:
Specific issue
@ivirshup's issue is the following:
.to_pandas()
on that Arrow table, and the nullability is not correctly conveyed from Arrow to Pandas (as described above -- please see "THIS IS THE ISSUE" aboveDiagramatically:
Possible solutions
One solution considered
What we can do to help work around that Pandas/Arrow issue:
create
), copy over all Arrow-table metadata -- which will get the key-value pair with keyb'pandas'
and value which is a JSON-encoded stringDataFrame
to Arrow Table (at TileDB-SOMAread
), copy all array metadata that does not start with"soma_"
to the Arrow table we create.to_pandas()
on the Arrow table they got from our.read()
method, they'll get the intended nullabilitiesOther issues / questions:
One is: At TileDB-SOMA
DataFrame.create
, given key named"pandas"
(or"k"
), should we have the TileDB-array metadata name be also"pandas"
/"k"
/ etc.? Or maybe some prefix? (I don't care either way, we just need to decide)A second one is: suppose there is data which did not come from Pandas: namely like this:
In this case, the TileDB-array metadata will not have any metadata with key
"pandas"
.Then, on our
.read()
method, we could do this:"pandas"
, then copy that key-value pair to the output Arrow Table"pandas"
, then, we should create oneAnother solution considered
A much simpler proposal from @nguyenv :
create
), don't copy any metadata from Arrow from TileDBread
), don't copy any metadata from TileDB to ArrowBenefits of this second solution:
See also
More about nullability tracking in TileDB-SOMA more broadly: #2858.
Findings from option 2
Experimental PR: #3645.
Findings: Doing as proposed above leads to many existing unit-test cases failing.
Context:
NaN
(other than themasked_array
featuree)NA
) -- or of course they can also haveNaN
in floating-point columnspd.Series
of typenp.int64
(for example; other types as well)and these (variously) move something through Pandas -> Arrow -> TileDB-SOMA -> Arrow -> Pandas dataframe and do things like
assert original_df == readback_df
dtype
be things likepd.Int64Dtype()
but it makes existing unit-test cases fail
Three possibilties:
assert original_df == readback_df
but rather some new helper functionassert new_helper(original_df, readback_df)
wherenew_helper
will check every except allow one side'sdtype
to be likenp.int64
and the other to be likepd.Int64Dtype()
pd.Series
with types likepd.Int64Dtype()
, and keep the sameassert original_df == readback_df
The text was updated successfully, but these errors were encountered: