BUG: read_excel with openpyxl produces trailing rows of nan #39547

rhshadrach · 2021-02-01T22:43:57Z

closes BUG: ExcelWriter.book --> no member ->Slow Execution Time<- #39181
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

This is on top of #39486

…enpyxl_header � Conflicts: � doc/source/whatsnew/v1.2.2.rst

jreback · 2021-02-01T22:54:47Z

why is this not an upstream bug? (not averse to fixing in pandas at least tactically).

rhshadrach · 2021-02-01T23:04:26Z

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file:

We could fix this issue by calling sheet.calculate_dimension(force=True) but this involves a 50% performance hit (measured on a frame with 10000 rows and 10 columns). The method used here involves a 2% performance hit. Ref: #39001 (comment)

jreback · 2021-02-01T23:47:11Z

@jreback openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file:

We could fix this issue by calling sheet.calculate_dimension(force=True) but this involves a 50% performance hit (measured on a frame with 10000 rows and 10 columns). The method used here involves a 2% performance hit. Ref: #39001 (comment)

ok! yeah as long as its robust ok with this

Barben360 · 2021-02-03T10:59:34Z

pandas/io/excel/_openpyxl.py

+            data = data[: last_row_with_data + 1]
+
+            # With dimension reset, openpyxl no longer pads rows
+            max_width = max(len(data_row) for data_row in data)


With some xlsx files I get an error:

max() arg is an empty sequence

I am pretty sure this happens when data object is empty

I can't share the file but I can tell it starts with empty rows and empty columns.

Yep - thanks for catching this!

…enpyxl_workbook

…enpyxl_workbook � Conflicts: � doc/source/whatsnew/v1.2.2.rst � pandas/tests/io/excel/test_openpyxl.py

into openpyxl_nans � Conflicts: � doc/source/whatsnew/v1.2.2.rst � pandas/io/excel/_openpyxl.py � pandas/tests/io/excel/test_openpyxl.py

…enpyxl_nans � Conflicts: � pandas/io/excel/_openpyxl.py � pandas/tests/io/excel/test_openpyxl.py

jreback · 2021-02-07T17:21:01Z

lgtm. @simonjayhawkins prob worth doing for 1.2.2

simonjayhawkins · 2021-02-07T17:21:46Z

sure

…enpyxl_nans � Conflicts: � pandas/tests/io/excel/test_openpyxl.py

…enpyxl_nans

jreback · 2021-02-08T14:50:16Z

==================================== ERRORS ====================================
__________ ERROR collecting scripts/tests/test_validate_docstrings.py __________
ImportError while importing test module '/home/runner/work/pandas/pandas/scripts/tests/test_validate_docstrings.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/envs/pandas-dev/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
scripts/tests/test_validate_docstrings.py:5: in <module>
    import validate_docstrings
scripts/validate_docstrings.py:50: in <module>
    from numpydoc.validate import validate, Docstring  # isort:skip
E   ImportError: cannot import name 'Docstring' from 'numpydoc.validate' (/usr/share/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpydoc/validate.py)
=========================== short test summary info ============================
ERROR scripts/tests/test_validate_docstrings.py
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 0.54s ===============================

hmm is this a suprious issue on the ci / checks?

jorisvandenbossche · 2021-02-08T15:14:08Z

numpydoc failure is also happening on master, so unrelated

simonjayhawkins · 2021-02-08T15:15:46Z

@meeseeksdev backport 1.2.x

…trailing rows of nan

…ows of nan (#39679) Co-authored-by: Richard Shadrach <[email protected]>

rhshadrach added 8 commits January 30, 2021 13:54

BUG: read_excel with openpyxl and missing dimension

2bcf35b

fixups

ea18d61

Added fixes for incorrect dimension information

d5215f7

Return "" for null date columns, trim empty trailing rows

d6c3af1

whatsnew

c51340c

Removed fix for 39181

8cd7aad

Merge branch 'master' of https://github.com/pandas-dev/pandas into op…

7ed5c36

…enpyxl_header � Conflicts: � doc/source/whatsnew/v1.2.2.rst

BUG: read_excel with openpyxl produces trailing rows of nan

b70b65d

rhshadrach mentioned this pull request Feb 1, 2021

BUG: read_excel with openpyxl and missing dimension #39486

Merged

5 tasks

Add test excel file

ea150b3

jreback added Bug IO Excel read_excel, to_excel Performance Memory or execution speed performance labels Feb 1, 2021

Barben360 reviewed Feb 3, 2021

View reviewed changes

rhshadrach added 12 commits February 3, 2021 18:30

REG: read_excel with engine specified raises on non-path/non-buffer

b887183

Merge branch 'master' of https://github.com/pandas-dev/pandas into op…

2291988

…enpyxl_workbook

Restore special-casing for xlrd.Book even when engine is None

601ad87

GH # in test

d835dff

Added wb.close() to test

e6684e9

Merge branch 'master' of https://github.com/pandas-dev/pandas into op…

1f16602

…enpyxl_workbook � Conflicts: � doc/source/whatsnew/v1.2.2.rst � pandas/tests/io/excel/test_openpyxl.py

Added logic/tests for determining if a sheet is read-only

ba2bc75

Added comment

1381ecc

Combine and reorg tests

a3db3eb

-

becd2cf

Merge branch 'openpyxl_workbook' of https://github.com/rhshadrach/pandas

41e8b81

into openpyxl_nans � Conflicts: � doc/source/whatsnew/v1.2.2.rst � pandas/io/excel/_openpyxl.py � pandas/tests/io/excel/test_openpyxl.py

Merge branch 'master' of https://github.com/pandas-dev/pandas into op…

d773418

…enpyxl_nans � Conflicts: � pandas/io/excel/_openpyxl.py � pandas/tests/io/excel/test_openpyxl.py

rhshadrach marked this pull request as ready for review February 7, 2021 17:15

Added xlsx test file

a3b6369

rhshadrach mentioned this pull request Feb 7, 2021

RLS: 1.2.2 #39295

Closed

jreback added this to the 1.2.2 milestone Feb 7, 2021

rhshadrach and others added 4 commits February 7, 2021 12:41

Merge branch 'master' of https://github.com/pandas-dev/pandas into op…

e9f6fa2

…enpyxl_nans � Conflicts: � pandas/tests/io/excel/test_openpyxl.py

xfail and improve tests

3e9bd4f

Merge branch 'master' of https://github.com/pandas-dev/pandas into op…

b5d662b

…enpyxl_nans

Merge branch 'master' into openpyxl_nans

8773c32

jorisvandenbossche merged commit 64e8720 into pandas-dev:master Feb 8, 2021

meeseeksmachine mentioned this pull request Feb 8, 2021

Backport PR #39547 on branch 1.2.x (BUG: read_excel with openpyxl produces trailing rows of nan) #39679

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 8, 2021

Backport PR pandas-dev#39547: BUG: read_excel with openpyxl produces …

74842a5

…trailing rows of nan

rhshadrach deleted the openpyxl_nans branch February 8, 2021 16:37

jreback pushed a commit that referenced this pull request Feb 8, 2021

Backport PR #39547: BUG: read_excel with openpyxl produces trailing r…

a354a5c

…ows of nan (#39679) Co-authored-by: Richard Shadrach <[email protected]>

This was referenced Apr 26, 2021

BUG: read_excel blows the memory when using openpyxl engine #40569

Closed

BUG: some read_excel engines still load trailing blank cells #41167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_excel with openpyxl produces trailing rows of nan #39547

BUG: read_excel with openpyxl produces trailing rows of nan #39547

rhshadrach commented Feb 1, 2021

jreback commented Feb 1, 2021

rhshadrach commented Feb 1, 2021 •

edited

Loading

jreback commented Feb 1, 2021

Barben360 Feb 3, 2021

Barben360 Feb 3, 2021

rhshadrach Feb 3, 2021

jreback commented Feb 7, 2021

simonjayhawkins commented Feb 7, 2021

jreback commented Feb 8, 2021

jorisvandenbossche commented Feb 8, 2021

simonjayhawkins commented Feb 8, 2021

BUG: read_excel with openpyxl produces trailing rows of nan #39547

BUG: read_excel with openpyxl produces trailing rows of nan #39547

Conversation

rhshadrach commented Feb 1, 2021

jreback commented Feb 1, 2021

rhshadrach commented Feb 1, 2021 • edited Loading

jreback commented Feb 1, 2021

Barben360 Feb 3, 2021

Choose a reason for hiding this comment

Barben360 Feb 3, 2021

Choose a reason for hiding this comment

rhshadrach Feb 3, 2021

Choose a reason for hiding this comment

jreback commented Feb 7, 2021

simonjayhawkins commented Feb 7, 2021

jreback commented Feb 8, 2021

jorisvandenbossche commented Feb 8, 2021

simonjayhawkins commented Feb 8, 2021

rhshadrach commented Feb 1, 2021 •

edited

Loading