Skip to content

BUG: read_csv not converting to float for python engine with decimal sep, usecols and parse_dates #38334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jan 3, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -740,6 +740,7 @@ I/O
- Bug in :meth:`DataFrame.to_hdf` was not dropping missing rows with ``dropna=True`` (:issue:`35719`)
- Bug in :func:`read_html` was raising a ``TypeError`` when supplying a ``pathlib.Path`` argument to the ``io`` parameter (:issue:`37705`)
- :meth:`DataFrame.to_excel`, :meth:`Series.to_excel`, :meth:`DataFrame.to_markdown`, and :meth:`Series.to_markdown` now support writing to fsspec URLs such as S3 and Google Cloud Storage (:issue:`33987`)
- Bug in :meth:`read_csv` returning object dtype when ``delimiter=","`` with ``usecols`` and ``parse_dates`` specified for ``engine="python"`` (:issue:`35873`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be removed off of 1.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm weird, must have missed that- Thx

- Bug in :func:`read_fwf` with ``skip_blank_lines=True`` was not skipping blank lines (:issue:`37758`)
- Parse missing values using :func:`read_json` with ``dtype=False`` to ``NaN`` instead of ``None`` (:issue:`28501`)
- :meth:`read_fwf` was inferring compression with ``compression=None`` which was not consistent with the other :meth:``read_*`` functions (:issue:`37909`)
Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ MultiIndex
I/O
^^^

-
- Bug in :meth:`read_csv` returning object dtype when ``delimiter=","`` with ``usecols`` and ``parse_dates`` specified for ``engine="python"`` (:issue:`35873`)
-

Period
Expand Down
14 changes: 7 additions & 7 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -2294,7 +2294,6 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds):

# Get columns in two steps: infer from data, then
# infer column indices from self.usecols if it is specified.
self._col_indices = None
try:
(
self.columns,
Expand Down Expand Up @@ -2336,6 +2335,9 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds):
if self.index_names is None:
self.index_names = index_names

if not hasattr(self, "_col_indices"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we always define this way on L2310 or L2297?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have acces to self.columns in L2297. Sometimes we set this in _infer_columns in L2302, so we have to check if it was already set to not override. Could move it up a bit but would have to check if it exists nevertheless

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you instead define _col_list = None then as the default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work of course, but we would get the mypy problems in again. Could do that nevertheless if this is preferable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i think its important that this is always defined. you can use Optional[List[int]], then you assert its not None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[List[int]] unfortunately raises the mypy error too. assert is not None does not help, so added the ignores back in

self._col_indices = list(range(len(self.columns)))

self._validate_parse_dates_presence(self.columns)
if self.parse_dates:
self._no_thousands_columns = self._set_no_thousands_columns()
Expand All @@ -2359,7 +2361,7 @@ def _set(x):
if is_integer(x):
noconvert_columns.add(x)
else:
noconvert_columns.add(self.columns.index(x))
noconvert_columns.add(self._col_indices[self.columns.index(x)])

if isinstance(self.parse_dates, list):
for val in self.parse_dates:
Expand Down Expand Up @@ -2709,7 +2711,6 @@ def _infer_columns(self):
# overwritten.
self._handle_usecols(columns, names)
else:
self._col_indices = None
num_original_columns = len(names)
columns = [names]
else:
Expand Down Expand Up @@ -2791,7 +2792,7 @@ def _handle_usecols(self, columns, usecols_key):
[n for i, n in enumerate(column) if i in col_indices]
for column in columns
]
self._col_indices = col_indices
self._col_indices = sorted(col_indices)
return columns

def _buffered_line(self):
Expand Down Expand Up @@ -3193,8 +3194,7 @@ def _rows_to_cols(self, content):
i < len(self.index_col)
# pandas\io\parsers.py:3159: error: Unsupported right
# operand type for in ("Optional[Any]") [operator]
or i - len(self.index_col) # type: ignore[operator]
in self._col_indices
or i - len(self.index_col) in self._col_indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you do not ignore mypy checking anymore, then the comments above are irrelevant.
Maybe xref #37715?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, that is a good point.

)
]
else:
Expand All @@ -3203,7 +3203,7 @@ def _rows_to_cols(self, content):
# operand type for in ("Optional[Any]") [operator]
a
for i, a in enumerate(zipped_content)
if i in self._col_indices # type: ignore[operator]
if i in self._col_indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (comment above about mypy error).

]
return zipped_content

Expand Down
18 changes: 17 additions & 1 deletion pandas/tests/io/parser/test_python_parser_only.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

from pandas.errors import ParserError

from pandas import DataFrame, Index, MultiIndex
from pandas import DataFrame, Index, MultiIndex, Timestamp
import pandas._testing as tm


Expand Down Expand Up @@ -314,3 +314,19 @@ def test_malformed_skipfooter(python_parser_only):
msg = "Expected 3 fields in line 4, saw 5"
with pytest.raises(ParserError, match=msg):
parser.read_csv(StringIO(data), header=1, comment="#", skipfooter=1)


def test_delimiter_with_usecols_and_parse_dates(python_parser_only):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this not work in c-parser? why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for c too, not quite sure why I tested only for python. Moved it

# GH#35873
result = python_parser_only.read_csv(
StringIO('"dump","-9,1","-9,1",20101010'),
engine="python",
names=["col", "col1", "col2", "col3"],
usecols=["col1", "col2", "col3"],
parse_dates=["col3"],
decimal=",",
)
expected = DataFrame(
{"col1": [-9.1], "col2": [-9.1], "col3": [Timestamp("2010-10-10")]}
)
tm.assert_frame_equal(result, expected)