Better inference of spreadsheet formats. #38522

cjw296 · 2020-12-16T13:09:26Z

See:

Discussion happening on #38424, code review happening here ;-)

cjw296 · 2020-12-16T13:23:53Z

pandas/tests/io/excel/test_readers.py

@@ -624,7 +624,6 @@ def test_read_from_http_url(self, read_ext):
        local_table = pd.read_excel("test1" + read_ext)
        tm.assert_frame_equal(url_table, local_table)

-    @td.skip_if_not_us_locale


Does anyone know what the intention was in this skip? I'm in the UK and wanted to make sure I hadn't broken these tests. I couldn't think of any reason why this needed to be in, but have removed it in a separate commit which I have no problem dropping out of this PR if people would prefer.

This was added in #21814. But no idea why (the test doesn't seem to have anything locale specific (like eg date parsing)). But we can see what CI says.
cc @TomAugspurger

cjw296 · 2020-12-16T13:25:06Z

pandas/tests/io/excel/test_readers.py

            pytest.skip(f"Skipped for engine: {engine}")

-        actual = pd.read_excel(basename + read_ext)
+        actual = pd.read_excel(basename + ".ods")


We can now run these tests on the None engine since read_excel will now correctly infer the odf engine rather than falling through to xlrd.

cjw296 · 2020-12-16T13:39:03Z

Why is {repr(foo)} preferred (insisted on by commit hook...) to {foo!r}?

This uses more reliable content introspection that results in better engine selection. It also removes situations where xlrd gets handed a file it no longer supports.

pandas/io/excel/_base.py

jreback · 2020-12-16T13:42:49Z

Why is {repr(foo)} preferred (insisted on by commit hook...) to {foo!r}?

I don't think we have settled on this explicitly, likely we have a mix of these in the codebase

jreback · 2020-12-16T13:43:19Z

cc @pandas-dev/pandas-core

pep8speaks · 2020-12-16T13:45:43Z

Hello @cjw296! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-16 19:16:29 UTC

cjw296 · 2020-12-16T13:46:14Z

I don't think we have settled on this explicitly, likely we have a mix of these in the codebase

I don't feel strongly, so have changed it in this PR, but was curious as to the reasoning :-)

cjw296 · 2020-12-16T13:55:40Z

(I'll squash the commits down to semantics before merging - just easier to review if I leave them unsquashed for now)

jorisvandenbossche · 2020-12-16T13:57:43Z

(I'll squash the commits down to semantics before merging - just easier to review if I leave them unsquashed for now)

Not need to do so, we squash on merge

jreback · 2020-12-16T15:37:58Z

appears we are now picking up xlrd 2.0.1 and causing some builds to fail.

cjw296 · 2020-12-16T16:02:21Z

@jreback - that should be fine on this branch, I'll dig into the CI failures once there's consensus on the approach.

jreback · 2020-12-16T16:11:59Z

@jreback - that should be fine on this branch, I'll dig into the CI failures once there's consensus on the approach.

yeah i think its just 1 or 2 tests where we maybe are not safe importing

pandas/io/excel/_base.py

bashtage · 2020-12-16T16:18:31Z

pandas/io/excel/_base.py

-    is_ods : bool
-        Boolean indication that this is indeed an ODS file or not
+    content : bytes
+        The bytes founds.
    """
    stream.seek(0)


What if stream doesn't support seek? I think this is possible with compressed files.

I have a feeling we'd have seen reports of this either in the pandas tracker or xlrd's tracker, since, unless I'm mistaken this type of seeking has been used in xlrd for many years now.

jorisvandenbossche

Just to be clear: before this PR can be considered to merge, we first need to come to a decision on #38424 (comment)

jorisvandenbossche

Thanks a lot for this @cjw296 !

Just to be clear: I think we can keep almost all of these changes, even if we decide in #38424 to keep the xlrd-fallback-with-warning, right?
In such a case, we would need to distinguish between the "inferred" engine and the user specified one. And if the inferred one is openpyxl, but openpyxl is not installed, we can fallback to xlrd with a clear warning.

(note, not saying that you should (already) implement this, but want to make sure that it's clear this PR is useful in any case, and does not strictly depend on not having the fallback)

jorisvandenbossche · 2020-12-16T19:41:32Z

pandas/io/excel/_base.py

-                if ext == ".ods":
-                    engine = "odf"
+                handles = get_handle(
+                    stringify_path(path_or_buffer),


Is the stringigy_path here needed? (I would assume that get_handle handles that)

yes, stringify_path is not needed for get_handle

I believe the user expansion that stringify_path is needed here.

jorisvandenbossche · 2020-12-16T19:46:51Z

pandas/tests/io/excel/test_readers.py

@@ -624,7 +624,6 @@ def test_read_from_http_url(self, read_ext):
        local_table = pd.read_excel("test1" + read_ext)
        tm.assert_frame_equal(url_table, local_table)

-    @td.skip_if_not_us_locale


This was added in #21814. But no idea why (the test doesn't seem to have anything locale specific (like eg date parsing)). But we can see what CI says.
cc @TomAugspurger

jorisvandenbossche · 2020-12-16T19:56:17Z

In such a case, we would need to distinguish between the "inferred" engine and the user specified one. And if the inferred one is openpyxl, but openpyxl is not installed, we can fallback to xlrd with a clear warning.

Thinking further on it: we should maybe do this anyway, even when not doing the fallback+warning. We might want to give a custom error message explaining that xlrd is no longer maintained / does no longer support files other than .xls, and so you now need to install another dependency, i.e. openpyxl.
Although the error message ("Missing optional dependency: openpyxl") is clear, it's missing the "why" (since people have always been able to read their excel files with xlrd up to now)

jreback · 2020-12-17T22:44:14Z

@c123w can you see if can get passing (also merge master).

jreback · 2020-12-18T17:08:15Z

i hate to block on this for 1.2, we can always do this in 1.2.1.

@jorisvandenbossche @simonjayhawkins

cjw296 · 2020-12-19T07:24:21Z

@jreback - when you tagged @c123w, did you mean to tag me?

jorisvandenbossche · 2020-12-21T15:52:00Z

Just to be clear: I think we can keep almost all of these changes, even if we decide in #38424 to keep the xlrd-fallback-with-warning, right?
In such a case, we would need to distinguish between the "inferred" engine and the user specified one. And if the inferred one is openpyxl, but openpyxl is not installed, we can fallback to xlrd with a clear warning.

@cjw296 assuming we keep the decision of have the fallback+warning, do you want to update this PR to reflect that behaviour? Or, would you be OK with someone of us pushing some changes to this PR for that?

cjw296 · 2020-12-21T17:26:31Z

I won't stop you, I'll just be really disappointed. https://github.com/pandas-dev/pandas/pull/38522/files#diff-63200ddb7f5656b8ee868a28d9cb7720ffe50689b0e3fb0b4e15cc5c0ae80dd7R1065 seems like an obvious place to insert an else clause that can look at the infered engine and do a set intersection with the libraries installed. It could also be commented such that it can be cleanly excised at the absolute earliest opportunity.

If you go this route, please ensure you put a LOUD WARNING (if only we had exceptions that could be really loud warnings, eh?) that using xlrd for xlsx files is explicitly unsupported and that under no circumstances should issues, PRs or other completely inappropriate comments be added to the xlrd repository.

I would also appreciate if this warning is included in the "what's news", particularly the part about being explicitly unsupported and exposing users to potentially dangerous security issues also with the most verbose form of the above warning about raising upstream issues.

jorisvandenbossche · 2020-12-21T22:52:08Z

If you go this route, please ensure you put a LOUD WARNING

Again, that is already what we do on master (except for a few corner cases, as discussed in #38424, and those can be fixed).
It's also already included on top of the whatsnew page. You actually have a PR to improve the wording in the warning messages, so if you have further suggestions about the precise message, feel free to further update that.

And I am sorry that you get such backlash on it.
(but, it's also a bit annoying situation right now with the stable pandas release still defaulting to xlrd even for xlsx files (which is not the fault ox xlrd of course, but influences the experience of pandas users). So the 1.2 release will be a big help to improve that in any case regardless of the discussion above).

cjw296 · 2020-12-22T12:00:22Z

@jorisvandenbossche - the frustration is that I advertised the deprecation of this package for non-xls files 4 years ago and nothing has been done... Now I try and help here, and the same lukewarm attitude continues.

jorisvandenbossche · 2020-12-22T22:08:59Z

And I understand that frustration. We also acted on it too slowly in pandas (although we haven't been aware of it for 4 years, I think, the main issue where the discussion happened was only opened last year: #28547). But you also have to understand that the average pandas user was not aware of this. They simply used pd.read_excel, and it worked (I suppose most pandas users wouldn't even know what xlrd is, let alone know that it was no longer maintained).

It's always difficult to communicate with users ("nowbody reads the docs"), and so a main way of communicating future changes that we use in pandas is through warnings.

cjw296 · 2020-12-22T22:43:02Z

Okay, but as I see it, to get to pandas 1.2 you will have done one of two things:

set up a new environment.
upgraded pandas.

In either case, a clear exception that says "you need to install package x" would be the cleanest and quickest way to get people using the best software.

"nobody looks at warnings that doesn't stop their code working" is sadly as true as "nobody reads the docs", especially given the poor use of warnings in cpython itself over the last few years.

gfyoung · 2020-12-22T22:46:16Z

"nobody looks at warnings that doesn't stop their code working" is sadly as true as "nobody reads the docs", especially given the poor use of warnings in cpython itself over the last few years.

Warnings have done decently well for us empirically speaking. If they don't want to heed the warnings and upgrade, then that is on them because they will have had sufficient time to consume and prepare.

Just like we unfortunately did not see your warnings (that went above and beyond BTW) about xlrd, we now have to deal with the ramifications of being slow to upgrade and switching to openpyxl.

jreback · 2020-12-23T14:11:05Z

closing in favor of #38571

This was referenced Dec 16, 2020

DOC: update wording about when xlrd engine can be used #38456

Merged

shift default excel read engine from xlrd to openpyxl #38424

Closed

cjw296 commented Dec 16, 2020

View reviewed changes

cjw296 force-pushed the infer_formats branch from d5a2753 to cb81e05 Compare December 16, 2020 13:31

jreback added the IO Excel read_excel, to_excel label Dec 16, 2020

cjw296 added 3 commits December 16, 2020 13:39

Re-work engine guessing in ExcelFile / read_excel.

a2ce319

This uses more reliable content introspection that results in better engine selection. It also removes situations where xlrd gets handed a file it no longer supports.

no reason to skip these in other locales

4266368

cleanup imports

998a778

cjw296 force-pushed the infer_formats branch from cb81e05 to 998a778 Compare December 16, 2020 13:40

jreback reviewed Dec 16, 2020

View reviewed changes

pandas/io/excel/_base.py Outdated Show resolved Hide resolved

pandas/io/excel/_base.py Show resolved Hide resolved

jreback added this to the 1.2 milestone Dec 16, 2020

cjw296 added 2 commits December 16, 2020 13:45

add param missed from docstring

29d1dd0

rename to _peek_stream

163285c

cjw296 added 2 commits December 16, 2020 13:47

line width

de99f1a

:rageface:

9aea6b9

jreback mentioned this pull request Dec 16, 2020

CI: pin xlrd <2 for now #38524

Closed

jreback reviewed Dec 16, 2020

View reviewed changes

pandas/io/excel/_base.py Show resolved Hide resolved

bashtage reviewed Dec 16, 2020

View reviewed changes

test for non-existent file

21b43e1

jorisvandenbossche requested changes Dec 16, 2020

View reviewed changes

jorisvandenbossche reviewed Dec 16, 2020

View reviewed changes

rhshadrach mentioned this pull request Dec 18, 2020

DEPR: Adjust read excel behavior for xlrd >= 2.0 #38571

Merged

5 tasks

jreback closed this Dec 23, 2020

cjw296 mentioned this pull request Oct 29, 2022

CLN: Remove xlrd < 2.0 code #49376

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better inference of spreadsheet formats. #38522

Better inference of spreadsheet formats. #38522

cjw296 commented Dec 16, 2020 •

edited

Loading

cjw296 Dec 16, 2020

jorisvandenbossche Dec 16, 2020

cjw296 Dec 16, 2020

cjw296 commented Dec 16, 2020

jreback commented Dec 16, 2020

jreback commented Dec 16, 2020

pep8speaks commented Dec 16, 2020 •

edited

Loading

cjw296 commented Dec 16, 2020

cjw296 commented Dec 16, 2020

jorisvandenbossche commented Dec 16, 2020

jreback commented Dec 16, 2020

cjw296 commented Dec 16, 2020

jreback commented Dec 16, 2020

bashtage Dec 16, 2020

cjw296 Dec 16, 2020

jorisvandenbossche left a comment

jorisvandenbossche left a comment

jorisvandenbossche Dec 16, 2020

twoertwein Dec 18, 2020

cjw296 Dec 19, 2020

jorisvandenbossche Dec 16, 2020

jorisvandenbossche commented Dec 16, 2020

jreback commented Dec 17, 2020

jreback commented Dec 18, 2020

cjw296 commented Dec 19, 2020

jorisvandenbossche commented Dec 21, 2020

cjw296 commented Dec 21, 2020

jorisvandenbossche commented Dec 21, 2020

cjw296 commented Dec 22, 2020

jorisvandenbossche commented Dec 22, 2020

cjw296 commented Dec 22, 2020

gfyoung commented Dec 22, 2020 •

edited

Loading

jreback commented Dec 23, 2020

Better inference of spreadsheet formats. #38522

Better inference of spreadsheet formats. #38522

Conversation

cjw296 commented Dec 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjw296 commented Dec 16, 2020

jreback commented Dec 16, 2020

jreback commented Dec 16, 2020

pep8speaks commented Dec 16, 2020 • edited Loading

Comment last updated at 2020-12-16 19:16:29 UTC

cjw296 commented Dec 16, 2020

cjw296 commented Dec 16, 2020

jorisvandenbossche commented Dec 16, 2020

jreback commented Dec 16, 2020

cjw296 commented Dec 16, 2020

jreback commented Dec 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 16, 2020

jreback commented Dec 17, 2020

jreback commented Dec 18, 2020

cjw296 commented Dec 19, 2020

jorisvandenbossche commented Dec 21, 2020

cjw296 commented Dec 21, 2020

jorisvandenbossche commented Dec 21, 2020

cjw296 commented Dec 22, 2020

jorisvandenbossche commented Dec 22, 2020

cjw296 commented Dec 22, 2020

gfyoung commented Dec 22, 2020 • edited Loading

jreback commented Dec 23, 2020

cjw296 commented Dec 16, 2020 •

edited

Loading

pep8speaks commented Dec 16, 2020 •

edited

Loading

gfyoung commented Dec 22, 2020 •

edited

Loading