Skip to content

Better inference of spreadsheet formats. #38522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

cjw296
Copy link
Contributor

@cjw296 cjw296 commented Dec 16, 2020

See:

#38424 (comment)
#38456

Discussion happening on #38424, code review happening here ;-)

@@ -624,7 +624,6 @@ def test_read_from_http_url(self, read_ext):
local_table = pd.read_excel("test1" + read_ext)
tm.assert_frame_equal(url_table, local_table)

@td.skip_if_not_us_locale
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone know what the intention was in this skip? I'm in the UK and wanted to make sure I hadn't broken these tests. I couldn't think of any reason why this needed to be in, but have removed it in a separate commit which I have no problem dropping out of this PR if people would prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added in #21814. But no idea why (the test doesn't seem to have anything locale specific (like eg date parsing)). But we can see what CI says.
cc @TomAugspurger

pytest.skip(f"Skipped for engine: {engine}")

actual = pd.read_excel(basename + read_ext)
actual = pd.read_excel(basename + ".ods")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now run these tests on the None engine since read_excel will now correctly infer the odf engine rather than falling through to xlrd.

@jreback jreback added the IO Excel read_excel, to_excel label Dec 16, 2020
@cjw296
Copy link
Contributor Author

cjw296 commented Dec 16, 2020

Why is {repr(foo)} preferred (insisted on by commit hook...) to {foo!r}?

This uses more reliable content introspection that results in better engine selection.
It also removes situations where xlrd gets handed a file it no longer supports.
@jreback
Copy link
Contributor

jreback commented Dec 16, 2020

Why is {repr(foo)} preferred (insisted on by commit hook...) to {foo!r}?

I don't think we have settled on this explicitly, likely we have a mix of these in the codebase

@jreback jreback added this to the 1.2 milestone Dec 16, 2020
@jreback
Copy link
Contributor

jreback commented Dec 16, 2020

cc @pandas-dev/pandas-core

@pep8speaks
Copy link

pep8speaks commented Dec 16, 2020

Hello @cjw296! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-16 19:16:29 UTC

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 16, 2020

I don't think we have settled on this explicitly, likely we have a mix of these in the codebase

I don't feel strongly, so have changed it in this PR, but was curious as to the reasoning :-)

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 16, 2020

(I'll squash the commits down to semantics before merging - just easier to review if I leave them unsquashed for now)

@jorisvandenbossche
Copy link
Member

(I'll squash the commits down to semantics before merging - just easier to review if I leave them unsquashed for now)

Not need to do so, we squash on merge

@jreback
Copy link
Contributor

jreback commented Dec 16, 2020

appears we are now picking up xlrd 2.0.1 and causing some builds to fail.

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 16, 2020

@jreback - that should be fine on this branch, I'll dig into the CI failures once there's consensus on the approach.

@jreback
Copy link
Contributor

jreback commented Dec 16, 2020

@jreback - that should be fine on this branch, I'll dig into the CI failures once there's consensus on the approach.

yeah i think its just 1 or 2 tests where we maybe are not safe importing

is_ods : bool
Boolean indication that this is indeed an ODS file or not
content : bytes
The bytes founds.
"""
stream.seek(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if stream doesn't support seek? I think this is possible with compressed files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling we'd have seen reports of this either in the pandas tracker or xlrd's tracker, since, unless I'm mistaken this type of seeking has been used in xlrd for many years now.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear: before this PR can be considered to merge, we first need to come to a decision on #38424 (comment)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this @cjw296 !

Just to be clear: I think we can keep almost all of these changes, even if we decide in #38424 to keep the xlrd-fallback-with-warning, right?
In such a case, we would need to distinguish between the "inferred" engine and the user specified one. And if the inferred one is openpyxl, but openpyxl is not installed, we can fallback to xlrd with a clear warning.

(note, not saying that you should (already) implement this, but want to make sure that it's clear this PR is useful in any case, and does not strictly depend on not having the fallback)

if ext == ".ods":
engine = "odf"
handles = get_handle(
stringify_path(path_or_buffer),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the stringigy_path here needed? (I would assume that get_handle handles that)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, stringify_path is not needed for get_handle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the user expansion that stringify_path is needed here.

@@ -624,7 +624,6 @@ def test_read_from_http_url(self, read_ext):
local_table = pd.read_excel("test1" + read_ext)
tm.assert_frame_equal(url_table, local_table)

@td.skip_if_not_us_locale
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added in #21814. But no idea why (the test doesn't seem to have anything locale specific (like eg date parsing)). But we can see what CI says.
cc @TomAugspurger

@jorisvandenbossche
Copy link
Member

In such a case, we would need to distinguish between the "inferred" engine and the user specified one. And if the inferred one is openpyxl, but openpyxl is not installed, we can fallback to xlrd with a clear warning.

Thinking further on it: we should maybe do this anyway, even when not doing the fallback+warning. We might want to give a custom error message explaining that xlrd is no longer maintained / does no longer support files other than .xls, and so you now need to install another dependency, i.e. openpyxl.
Although the error message ("Missing optional dependency: openpyxl") is clear, it's missing the "why" (since people have always been able to read their excel files with xlrd up to now)

@jreback
Copy link
Contributor

jreback commented Dec 17, 2020

@c123w can you see if can get passing (also merge master).

@jreback
Copy link
Contributor

jreback commented Dec 18, 2020

i hate to block on this for 1.2, we can always do this in 1.2.1.

@jorisvandenbossche @simonjayhawkins

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 19, 2020

@jreback - when you tagged @c123w, did you mean to tag me?

@jorisvandenbossche
Copy link
Member

Just to be clear: I think we can keep almost all of these changes, even if we decide in #38424 to keep the xlrd-fallback-with-warning, right?
In such a case, we would need to distinguish between the "inferred" engine and the user specified one. And if the inferred one is openpyxl, but openpyxl is not installed, we can fallback to xlrd with a clear warning.

@cjw296 assuming we keep the decision of have the fallback+warning, do you want to update this PR to reflect that behaviour? Or, would you be OK with someone of us pushing some changes to this PR for that?

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 21, 2020

I won't stop you, I'll just be really disappointed. https://github.com/pandas-dev/pandas/pull/38522/files#diff-63200ddb7f5656b8ee868a28d9cb7720ffe50689b0e3fb0b4e15cc5c0ae80dd7R1065 seems like an obvious place to insert an else clause that can look at the infered engine and do a set intersection with the libraries installed. It could also be commented such that it can be cleanly excised at the absolute earliest opportunity.

If you go this route, please ensure you put a LOUD WARNING (if only we had exceptions that could be really loud warnings, eh?) that using xlrd for xlsx files is explicitly unsupported and that under no circumstances should issues, PRs or other completely inappropriate comments be added to the xlrd repository.

I would also appreciate if this warning is included in the "what's news", particularly the part about being explicitly unsupported and exposing users to potentially dangerous security issues also with the most verbose form of the above warning about raising upstream issues.

@jorisvandenbossche
Copy link
Member

If you go this route, please ensure you put a LOUD WARNING

Again, that is already what we do on master (except for a few corner cases, as discussed in #38424, and those can be fixed).
It's also already included on top of the whatsnew page. You actually have a PR to improve the wording in the warning messages, so if you have further suggestions about the precise message, feel free to further update that.

And I am sorry that you get such backlash on it.
(but, it's also a bit annoying situation right now with the stable pandas release still defaulting to xlrd even for xlsx files (which is not the fault ox xlrd of course, but influences the experience of pandas users). So the 1.2 release will be a big help to improve that in any case regardless of the discussion above).

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 22, 2020

@jorisvandenbossche - the frustration is that I advertised the deprecation of this package for non-xls files 4 years ago and nothing has been done... Now I try and help here, and the same lukewarm attitude continues.

@jorisvandenbossche
Copy link
Member

And I understand that frustration. We also acted on it too slowly in pandas (although we haven't been aware of it for 4 years, I think, the main issue where the discussion happened was only opened last year: #28547). But you also have to understand that the average pandas user was not aware of this. They simply used pd.read_excel, and it worked (I suppose most pandas users wouldn't even know what xlrd is, let alone know that it was no longer maintained).

It's always difficult to communicate with users ("nowbody reads the docs"), and so a main way of communicating future changes that we use in pandas is through warnings.

@cjw296
Copy link
Contributor Author

cjw296 commented Dec 22, 2020

Okay, but as I see it, to get to pandas 1.2 you will have done one of two things:

  • set up a new environment.
  • upgraded pandas.

In either case, a clear exception that says "you need to install package x" would be the cleanest and quickest way to get people using the best software.

"nobody looks at warnings that doesn't stop their code working" is sadly as true as "nobody reads the docs", especially given the poor use of warnings in cpython itself over the last few years.

@gfyoung
Copy link
Member

gfyoung commented Dec 22, 2020

"nobody looks at warnings that doesn't stop their code working" is sadly as true as "nobody reads the docs", especially given the poor use of warnings in cpython itself over the last few years.

Warnings have done decently well for us empirically speaking. If they don't want to heed the warnings and upgrade, then that is on them because they will have had sufficient time to consume and prepare.

Just like we unfortunately did not see your warnings (that went above and beyond BTW) about xlrd, we now have to deal with the ramifications of being slow to upgrade and switching to openpyxl.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

closing in favor of #38571

@jreback jreback closed this Dec 23, 2020
@cjw296 cjw296 mentioned this pull request Oct 29, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants