Skip to content

BUG: loc __setitem__ has incorrect behavior when assigned a DataFrame and new columns and duplicated columns are added. #58317

@sfc-gh-vbudati

Description

@sfc-gh-vbudati

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame(
    [[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]], columns=["D", "B", "C", "A"]
)

item = pd.DataFrame(
    [[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]],
    columns=["A", "B", "C", "X"],
    index=[
        3,  # 3 does not exist in the row key, so it will be skipped
        2,
        1,
    ],
)

df.loc[[True, False, True], ["B", "E", "B"]] = item

Issue Description

Performing loc __setitem__ with pandas versions 2.2.0+ has faulty behavior when assigning a DataFrame to another DataFrame when inserting new columns with duplicated columns present. However, the column keys have to follow the pattern of [existing column(s), non-existent column(s), duplicated existing column(s)]. In the example provided, "B" exists but "E" does not. This can be reproduced with the following loc __setitem__ operations as well.

df.loc[[True, False, True], ["B", "E", 1, "B"]] = item
df.loc[[True, False, True], ["B", "E", 1, "B", "C", "X", "C", 2, "C"]] = item

Also, note that in some cases the output cannot be printed out, and if printing is tried it'll result in the error I ran into below:

>>> df = pd.DataFrame(
...     [[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]], columns=["D", "B", "C", "A"]
... )
>>> df
   D  B  C   A
0  1  2  3   4
1  4  5  6   7
2  7  8  9  10

>>> item = pd.DataFrame(
...     [[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]],
...     columns=["A", "B", "C", "X"],
...     index=[
...         3,  # 3 does not exist in the row key, so it will be skipped
...         2,
...         1,
...     ],
... )
>>> item
   A  B  C   X
3  1  2  3   4
2  4  5  6   7
1  7  8  9  10

>>> df.loc[[True, False, True], ["B", "E", "B"]] = item
>>> df

# ERROR!
"""
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/frame.py", line 1203, in __repr__
    return self.to_string(**repr_params)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/util/_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/frame.py", line 1383, in to_string
    return fmt.DataFrameRenderer(formatter).to_string(
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 962, in to_string
    string = string_formatter.to_string()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/string.py", line 29, in to_string
    text = self._get_string_representation()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/string.py", line 44, in _get_string_representation
    strcols = self._get_strcols()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/string.py", line 35, in _get_strcols
    strcols = self.fmt.get_strcols()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 476, in get_strcols
    strcols = self._get_strcols_without_index()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 740, in _get_strcols_without_index
    fmt_values = self.format_col(i)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 754, in format_col
    return format_array(
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 1161, in format_array
    return fmt_obj.get_result()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 1194, in get_result
    fmt_values = self._format_strings()
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/io/formats/format.py", line 1250, in _format_strings
    & np.all(notna(vals), axis=tuple(range(1, len(vals.shape))))
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/dtypes/missing.py", line 457, in notna
    res = isna(obj)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/dtypes/missing.py", line 178, in isna
    return _isna(obj)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/dtypes/missing.py", line 207, in _isna
    return _isna_array(obj, inf_as_na=inf_as_na)
  File "/Users/vbudati/anaconda3/envs/pandas-dev-39/lib/python3.9/site-packages/pandas/core/dtypes/missing.py", line 300, in _isna_array
    result = np.isnan(values)
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
"""

Since I cannot directly print out what the new df is, I tried doing this via iloc __getitem__, row-by-row:

>>> df.iloc[0]
D      1
B    NaN
B    NaN
C      4
A     []
E    NaN
1    NaN
Name: 0, dtype: object

>>> df.iloc[1]
D      4
B    5.0
B    6.0
C      7
A     []
E    NaN
1    NaN
Name: 1, dtype: object

>>> df.iloc[2]
D      7
B    5.0
B    5.0
C     10
A     []
E    NaN
1    NaN
Name: 2, dtype: object

The expected result is:

   D    B    B  C   A    E 
0  1  NaN  NaN  3   4  NaN
1  4  5.0  5.0  6   7  NaN 
2  7  5.0  5.0  9  10  NaN 

Notice column A -- originally it had values [3, 7, 10] (as seen from expected behavior). In the faulty result, all values in A are replaced by [].

Expected Behavior

# Expected behavior, from pandas versions 2.1.x and below, the result would be:
   D    B    B  C   A    E 
0  1  NaN  NaN  3   4  NaN 
1  4  5.0  5.0  6   7  NaN 
2  7  5.0  5.0  9  10  NaN 

# however, pandas versions 2.2.0+ error out.

Installed Versions

Details INSTALLED VERSIONS ------------------ commit : bdc79c1 python : 3.9.18.final.0 python-bits : 64 OS : Darwin OS-release : 23.4.0 Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:49 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8

pandas : 2.2.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3.1
Cython : None
pytest : 7.4.2
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.18.1
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIndexingRelated to indexing on series/frames, not to indexes themselvesNeeds TriageIssue that has not been reviewed by a pandas team membersetitem-with-expansion

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions