Skip to content

BUG: read_excel with openpyxl produces trailing rows of nan #39547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Feb 8, 2021

Conversation

rhshadrach
Copy link
Member

This is on top of #39486

@jreback
Copy link
Contributor

jreback commented Feb 1, 2021

why is this not an upstream bug? (not averse to fixing in pandas at least tactically).

@rhshadrach
Copy link
Member Author

rhshadrach commented Feb 1, 2021

@jreback openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file:

We could fix this issue by calling sheet.calculate_dimension(force=True) but this involves a 50% performance hit (measured on a frame with 10000 rows and 10 columns). The method used here involves a 2% performance hit. Ref: #39001 (comment)

@jreback
Copy link
Contributor

jreback commented Feb 1, 2021

@jreback openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file:

We could fix this issue by calling sheet.calculate_dimension(force=True) but this involves a 50% performance hit (measured on a frame with 10000 rows and 10 columns). The method used here involves a 2% performance hit. Ref: #39001 (comment)

ok! yeah as long as its robust ok with this

@jreback jreback added Bug IO Excel read_excel, to_excel Performance Memory or execution speed performance labels Feb 1, 2021
data = data[: last_row_with_data + 1]

# With dimension reset, openpyxl no longer pads rows
max_width = max(len(data_row) for data_row in data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some xlsx files I get an error:

max() arg is an empty sequence

I am pretty sure this happens when data object is empty

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't share the file but I can tell it starts with empty rows and empty columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - thanks for catching this!

@rhshadrach rhshadrach marked this pull request as ready for review February 7, 2021 17:15
@rhshadrach rhshadrach mentioned this pull request Feb 7, 2021
@jreback jreback added this to the 1.2.2 milestone Feb 7, 2021
@jreback
Copy link
Contributor

jreback commented Feb 7, 2021

lgtm. @simonjayhawkins prob worth doing for 1.2.2

@simonjayhawkins
Copy link
Member

sure

@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

==================================== ERRORS ====================================
__________ ERROR collecting scripts/tests/test_validate_docstrings.py __________
ImportError while importing test module '/home/runner/work/pandas/pandas/scripts/tests/test_validate_docstrings.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/envs/pandas-dev/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
scripts/tests/test_validate_docstrings.py:5: in <module>
    import validate_docstrings
scripts/validate_docstrings.py:50: in <module>
    from numpydoc.validate import validate, Docstring  # isort:skip
E   ImportError: cannot import name 'Docstring' from 'numpydoc.validate' (/usr/share/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpydoc/validate.py)
=========================== short test summary info ============================
ERROR scripts/tests/test_validate_docstrings.py
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 0.54s ===============================

hmm is this a suprious issue on the ci / checks?

@jorisvandenbossche
Copy link
Member

numpydoc failure is also happening on master, so unrelated

@jorisvandenbossche jorisvandenbossche merged commit 64e8720 into pandas-dev:master Feb 8, 2021
@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: ExcelWriter.book --> no member ->Slow Execution Time<-
5 participants