-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Better inference of spreadsheet formats. #38522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -624,7 +624,6 @@ def test_read_from_http_url(self, read_ext): | |||
local_table = pd.read_excel("test1" + read_ext) | |||
tm.assert_frame_equal(url_table, local_table) | |||
|
|||
@td.skip_if_not_us_locale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone know what the intention was in this skip? I'm in the UK and wanted to make sure I hadn't broken these tests. I couldn't think of any reason why this needed to be in, but have removed it in a separate commit which I have no problem dropping out of this PR if people would prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added in #21814. But no idea why (the test doesn't seem to have anything locale specific (like eg date parsing)). But we can see what CI says.
cc @TomAugspurger
pytest.skip(f"Skipped for engine: {engine}") | ||
|
||
actual = pd.read_excel(basename + read_ext) | ||
actual = pd.read_excel(basename + ".ods") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can now run these tests on the None
engine since read_excel
will now correctly infer the odf
engine rather than falling through to xlrd
.
d5a2753
to
cb81e05
Compare
Why is |
This uses more reliable content introspection that results in better engine selection. It also removes situations where xlrd gets handed a file it no longer supports.
cb81e05
to
998a778
Compare
I don't think we have settled on this explicitly, likely we have a mix of these in the codebase |
cc @pandas-dev/pandas-core |
I don't feel strongly, so have changed it in this PR, but was curious as to the reasoning :-) |
(I'll squash the commits down to semantics before merging - just easier to review if I leave them unsquashed for now) |
Not need to do so, we squash on merge |
appears we are now picking up |
@jreback - that should be fine on this branch, I'll dig into the CI failures once there's consensus on the approach. |
yeah i think its just 1 or 2 tests where we maybe are not safe importing |
is_ods : bool | ||
Boolean indication that this is indeed an ODS file or not | ||
content : bytes | ||
The bytes founds. | ||
""" | ||
stream.seek(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if stream doesn't support seek? I think this is possible with compressed files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling we'd have seen reports of this either in the pandas tracker or xlrd's tracker, since, unless I'm mistaken this type of seeking has been used in xlrd for many years now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear: before this PR can be considered to merge, we first need to come to a decision on #38424 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this @cjw296 !
Just to be clear: I think we can keep almost all of these changes, even if we decide in #38424 to keep the xlrd-fallback-with-warning, right?
In such a case, we would need to distinguish between the "inferred" engine and the user specified one. And if the inferred one is openpyxl, but openpyxl is not installed, we can fallback to xlrd with a clear warning.
(note, not saying that you should (already) implement this, but want to make sure that it's clear this PR is useful in any case, and does not strictly depend on not having the fallback)
if ext == ".ods": | ||
engine = "odf" | ||
handles = get_handle( | ||
stringify_path(path_or_buffer), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the stringigy_path
here needed? (I would assume that get_handle
handles that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, stringify_path
is not needed for get_handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the user expansion that stringify_path
is needed here.
@@ -624,7 +624,6 @@ def test_read_from_http_url(self, read_ext): | |||
local_table = pd.read_excel("test1" + read_ext) | |||
tm.assert_frame_equal(url_table, local_table) | |||
|
|||
@td.skip_if_not_us_locale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added in #21814. But no idea why (the test doesn't seem to have anything locale specific (like eg date parsing)). But we can see what CI says.
cc @TomAugspurger
Thinking further on it: we should maybe do this anyway, even when not doing the fallback+warning. We might want to give a custom error message explaining that xlrd is no longer maintained / does no longer support files other than .xls, and so you now need to install another dependency, i.e. openpyxl. |
@c123w can you see if can get passing (also merge master). |
i hate to block on this for 1.2, we can always do this in 1.2.1. |
@cjw296 assuming we keep the decision of have the fallback+warning, do you want to update this PR to reflect that behaviour? Or, would you be OK with someone of us pushing some changes to this PR for that? |
I won't stop you, I'll just be really disappointed. https://github.com/pandas-dev/pandas/pull/38522/files#diff-63200ddb7f5656b8ee868a28d9cb7720ffe50689b0e3fb0b4e15cc5c0ae80dd7R1065 seems like an obvious place to insert an If you go this route, please ensure you put a LOUD WARNING (if only we had exceptions that could be really loud warnings, eh?) that using xlrd for xlsx files is explicitly unsupported and that under no circumstances should issues, PRs or other completely inappropriate comments be added to the xlrd repository. I would also appreciate if this warning is included in the "what's news", particularly the part about being explicitly unsupported and exposing users to potentially dangerous security issues also with the most verbose form of the above warning about raising upstream issues. |
Again, that is already what we do on master (except for a few corner cases, as discussed in #38424, and those can be fixed). And I am sorry that you get such backlash on it. |
@jorisvandenbossche - the frustration is that I advertised the deprecation of this package for non-xls files 4 years ago and nothing has been done... Now I try and help here, and the same lukewarm attitude continues. |
And I understand that frustration. We also acted on it too slowly in pandas (although we haven't been aware of it for 4 years, I think, the main issue where the discussion happened was only opened last year: #28547). But you also have to understand that the average pandas user was not aware of this. They simply used It's always difficult to communicate with users ("nowbody reads the docs"), and so a main way of communicating future changes that we use in pandas is through warnings. |
Okay, but as I see it, to get to pandas 1.2 you will have done one of two things:
In either case, a clear exception that says "you need to install package x" would be the cleanest and quickest way to get people using the best software. "nobody looks at warnings that doesn't stop their code working" is sadly as true as "nobody reads the docs", especially given the poor use of warnings in cpython itself over the last few years. |
Warnings have done decently well for us empirically speaking. If they don't want to heed the warnings and upgrade, then that is on them because they will have had sufficient time to consume and prepare. Just like we unfortunately did not see your warnings (that went above and beyond BTW) about |
closing in favor of #38571 |
See:
#38424 (comment)
#38456
Discussion happening on #38424, code review happening here ;-)