-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Preserve key order when passing list of dicts to DataFrame on py 3.6+ #27309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for following through on this! If I can ask for a small amount of scope creep, though - it seems like all other things being equal, the ordering of columns added "later on" should still match the input if possible. This would be analogous to the behavior of
That said, this is already a big improvement and I agree with you that you're covering the most common case. |
yep, I totally did. Never mind! |
So this has come up a few times in other incarnations but I don't really see the point of this PR. If the data is truly ordered then representing as a list of dicts on the way in is a rather inefficient way of storing the data. If it's not ordered picking the first row or just iterating keys in a 2D plane seems rather arbitrary. There are a lot of edge cases and nuances that make behavior undefined so I generally don't see a value add |
To supply a data point here: My teams have run in to this issue fairly frequently and have been consistently surprised by/ had to work around @WillAyd What are the edge cases and nuances you're thinking of? You may find this python-dev conversation useful. Python core developers recently (late 2017) decided that they had all of the edge cases nailed down enough to commit to insertion ordering for |
The topic of Python insertion order into a single dictionary is not entirely relevant when talking about a list of dictionaries. I'm not sure why we would assume the first row is really indicative of anything in all of the below cases (should be non-trivial to think of more) >>> [{'z': 1}, {'c': 1, 'b': 1', 'a': 1]}]
>>> [{'a': 1, 'b': 1, 'c': 1}, {'c': 1, 'b': 1', 'a': 1]}]
>>> [{'a': 1, 'b': 1, 'c': 1}, {'z': 1, 'y': 1', 'x': 1]}]
>>> [{'x': 1, 'y': 1, 'z': 1}, {'a': 1, 'y': 1', 'z': 1]}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am -0 on this change because its hard to know the 'rule' for a user and its pretty arbitrary that its the first row's keys (which is of course data dependent).
it is much simpler to simply pass columns=
if you actually do care of about the ordering or .reindex()
anytime you actually want a specific ordering.
I am not sure there is a 'good' soln here.
About the "other incarnations" that @WillAyd mentioned that recently came up, see #26587, #24859, #26113 and some others (I though we had a PR doing exactly the same change as this recently, but can't directly find it) Personally, I would be fine with a change in the line of this PR. But for the me, the main reason is that I think that we should make the treatment of "lists of records" more consistent in general. Whether it are lists of series, lists of dicts or lists of namedtuples, those cases should be broadly equivalent.
So I don't think we should try to think of a new "set or rules" for dicts (as spelled out in the top post). |
Opened a separate issue for the unsafe namedtuple behaviour: #27329 |
As a user, I beg to differ. I regularly see cases (see my previous comments) that would be more straightforward and intuitive if you didn't have to pass Completely agree with @jorisvandenbossche above re: focusing on consistency and not reinventing the wheel. It sounds like Lists of Series already follows the maintain-insertion-order (MIO) rule that I advocated above. Landing on something consistent would be really helpful to me and my colleagues. MIO seems like as good a rule as any. The data dependence of ordering is an odd side-effect, but it seems worth it for consistency. |
I'd love to know if PRs moving things more towards MIO-based unified handling of "lists of record-like objects" would be considered in general. If that direction were agreed upon, this PR seems like one of the more important steps but there would be more that I would like to contribute. |
I don't have an exact definition, but something like that yes. I think Series, dicts and namedtuples are all similar enough (and record-like) to expect them to be handled similarly.
I was describing the current behaviour (which is sorted lexically, not random as you said in the top post)
That's the basic "rule" for Series, but not the full rule. For Series, other behaviour is not undefined (we don't want undefined behaviour, at least it should be deterministically sorted as is now).
There is a perfectly fine solution, it is what Series already does, and the same logic is followed by the Index.union method. |
BTW, we already use this exact behaviour when passing OrderedDicts:
So basically what I want to say: we already have the defined behaviour and code for Series and OrderdDict, so I don't see any reason to not do this for normal dicts as well since they are now ordered as well. Given the above, I think this PR can be simplified a lot. I think it is basically making the check for OrderedDict to also allow normal dicts: pandas/pandas/core/internals/construction.py Lines 539 to 542 in 2d0b20b
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok, ping . on green.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last tiny remark, should be good otherwise (sorry for the many back and forth rounds)
@jreback this PR has seen enough back and forth on small details, I think we can just merge as is. |
I disagree |
Thanks @jorisvandenbossche for the fixes. I pushed a few more of @jreback's comments. For the record, the overhead of doing full column discovery is not negligible. I profile it at about 50% slower compared to passing columns explicitly. I think that's acceptable (it has been for In [4]: import pandas as pd
...:
...: from typing import NamedTuple
...: from collections import namedtuple
...: import gc
...:
...: Foo=namedtuple("Foo","a,b,c,d")
...:
...: d=dict(a=1,b=2,c=3,d=4)
...: data1=[d]*100000
...: %timeit pd.DataFrame(data1, columns=['a','b','c','d'])
...: %timeit pd.DataFrame(data1)
96 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
132 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) |
Yes, but note that this has not changed. Also on master, specifying the column names is faster than doing the discovery. The only change is that the union of the dict keys keep the order instead of doing a sort, and that has basically no performance implication. |
@jorisvandenbossche,thanks for saving this PR after a false-start and for your patience in reviewing. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Related to #25915, #26587, #24859, #26113, #10056 (orderedicts), #11181/#11416 (list of namedtuple). #25911.
Update: #13304/#13309 Was merged three years ago so let's just make list of dicts act like list of OrderedDict (as pointed out by @jorisvandenbossche).
Actual
Expected
Four years ago, #10056 asked for the implied order of columns in a list of `OrderedDict` to be preserved by the `DataFrame` constructor. @thatneat [commented](https://github.com//issues/10056#issuecomment-509383829) yesterday that with 3.7 guaranteed dict order, this should extend to dict-like in general. I think users have a reasonable expectation for this to work, and therefore that pandas should support it. @jreback [voted](https://github.com//issues/10056#issuecomment-98812435) +0 on adding this (four years ago).
- Only look at the first dict in the list.
- Only guarantee the column order of the keys which actually appear in it.
- Clarification the order among columns not included in the first dict is undefined, except that they will appear after all the columns that do.
- Added Changes apply to Python3.6+ only
namedtuple
has the convenient property of homogeneous keys and key-order which a list of dictsdoesn't have, dicts are allowed to omit keys, and the key order also may change from dict to dict.
Given that, I settled on a reasonable compromise that matches user expectations in practice:
In practice, I think the only case users actually care about is sensible behavior when passing a list of dicts which is homogeneous in terms of key and key-order, which this PR provides.