ENH: Preserve key order when passing list of dicts to DataFrame on py 3.6+ #27309

ghost · 2019-07-09T17:51:59Z

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Related to #25915, #26587, #24859, #26113, #10056 (orderedicts), #11181/#11416 (list of namedtuple). #25911.

Update: #13304/#13309 Was merged three years ago so let's just make list of dicts act like list of OrderedDict (as pointed out by @jorisvandenbossche).

Actual

In [63]: data= [
    ...: {'name': 'Joe', 'state': 'NY', 'age': 18},
    ...: {'name': 'Jane', 'state': 'KY', 'age': 19}
    ...: ]
    ...: pd.DataFrame(data)
Out[63]: 
   age  name state
0   18   Joe    NY
1   19  Jane    KY

Expected

In [64]: pd.DataFrame(data)
Out[64]: 
   name state  age
0   Joe    NY   18
1  Jane    KY   19

Four years ago, #10056 asked for the implied order of columns in a list of `OrderedDict` to be preserved by the `DataFrame` constructor. @thatneat [commented](https://github.com//issues/10056#issuecomment-509383829) yesterday that with 3.7 guaranteed dict order, this should extend to dict-like in general. I think users have a reasonable expectation for this to work, and therefore that pandas should support it. @jreback [voted](https://github.com//issues/10056#issuecomment-98812435) +0 on adding this (four years ago).
namedtuple has the convenient property of homogeneous keys and key-order which a list of dicts
doesn't have, dicts are allowed to omit keys, and the key order also may change from dict to dict.
Given that, I settled on a reasonable compromise that matches user expectations in practice:

Only look at the first dict in the list.

Only guarantee the column order of the keys which actually appear in it.

Clarification the order among columns not included in the first dict is undefined, except that they will appear after all the columns that do.

Added Changes apply to Python3.6+ only

In practice, I think the only case users actually care about is sensible behavior when passing a list of dicts which is homogeneous in terms of key and key-order, which this PR provides.

pandas/tests/frame/test_constructors.py

jason-curtis · 2019-07-09T18:06:05Z

Thanks for following through on this!

If I can ask for a small amount of scope creep, though - it seems like all other things being equal, the ordering of columns added "later on" should still match the input if possible. This would be analogous to the behavior of dict.update():

In [1]: d={5:4, 3:2}                                                                                                 

In [2]: list(d.keys())                                                                                               
Out[2]: [5, 3]

In [3]: d.update({3:3, 5:5, 4:4, 1:1})

In [4]: list(d.keys())
Out[4]: [5, 3, 4, 1]

That said, this is already a big improvement and I agree with you that you're covering the most common case. If implementing the above turns out to be too tricky for now, maybe you could just remove the part of the test case that explicitly shows that "added later" "XXX" and "YYY" columns are sorted - I wouldn't want to imply that that's a desired behavior and close down the option of implementing this later. [EDIT: I misread the test, it's great 😄 ]

jason-curtis · 2019-07-09T18:15:35Z

I think you've misread the test.

yep, I totally did. Never mind!

pep8speaks · 2019-07-09T21:26:24Z

Hello @pilkibun! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-16 00:14:13 UTC

WillAyd · 2019-07-10T13:00:00Z

So this has come up a few times in other incarnations but I don't really see the point of this PR. If the data is truly ordered then representing as a list of dicts on the way in is a rather inefficient way of storing the data. If it's not ordered picking the first row or just iterating keys in a 2D plane seems rather arbitrary.

There are a lot of edge cases and nuances that make behavior undefined so I generally don't see a value add

jason-curtis · 2019-07-10T19:39:41Z

I don't really see the point of this PR

To supply a data point here:

My teams have run in to this issue fairly frequently and have been consistently surprised by/ had to work around DataFrame column ordering behaviors. A common case is in tests where you want it to be extremely easy to construct a DataFrame fixture with a particular shape, but turns out to be more complex because we have to explicitly pass columns or use less-straightforward data types than dict. It's not a huge cost to deal with it, but I can attest that this is a fairly common source of friction and this PR will help.

@WillAyd What are the edge cases and nuances you're thinking of?
Does my suggestion of using the maintaining-insertion-order behavior dict and dict.update() help? It may not always be ordering the user wants, but at least it is consistent and easy to explain.

You may find this python-dev conversation useful. Python core developers recently (late 2017) decided that they had all of the edge cases nailed down enough to commit to insertion ordering for dicts. Now that they've figured it out, maybe it's a natural course for pandas to follow along?

WillAyd · 2019-07-10T19:59:10Z

The topic of Python insertion order into a single dictionary is not entirely relevant when talking about a list of dictionaries. I'm not sure why we would assume the first row is really indicative of anything in all of the below cases (should be non-trivial to think of more)

>>> [{'z': 1}, {'c': 1, 'b': 1', 'a': 1]}]
>>> [{'a': 1, 'b': 1, 'c': 1}, {'c': 1, 'b': 1', 'a': 1]}]
>>> [{'a': 1, 'b': 1, 'c': 1}, {'z': 1, 'y': 1', 'x': 1]}]
>>> [{'x': 1, 'y': 1, 'z': 1}, {'a': 1, 'y': 1', 'z': 1]}]

jreback

I am -0 on this change because its hard to know the 'rule' for a user and its pretty arbitrary that its the first row's keys (which is of course data dependent).

it is much simpler to simply pass columns= if you actually do care of about the ordering or .reindex() anytime you actually want a specific ordering.

I am not sure there is a 'good' soln here.

jorisvandenbossche · 2019-07-10T21:33:31Z

About the "other incarnations" that @WillAyd mentioned that recently came up, see #26587, #24859, #26113 and some others (I though we had a PR doing exactly the same change as this recently, but can't directly find it)

Personally, I would be fine with a change in the line of this PR. But for the me, the main reason is that I think that we should make the treatment of "lists of records" more consistent in general. Whether it are lists of series, lists of dicts or lists of namedtuples, those cases should be broadly equivalent.

Lists of Series currently preserves column order based on the index. And it seems they try to preserve the order of the first as much as possible, appending new observed ones.
Lists of namedtuples only takes the order (and length!) of the first, which leads to some broken cases (eg tests a list of namedtuples with different fields -> they get silently written to wrong columns)
Lists of dicts -> the keys are sorted.

So I don't think we should try to think of a new "set or rules" for dicts (as spelled out in the top post).
But I I would personally very much welcome a change trying to make this handling of "lists of record-like objects" consistent.

jorisvandenbossche · 2019-07-10T21:37:33Z

Opened a separate issue for the unsafe namedtuple behaviour: #27329

jason-curtis · 2019-07-10T21:50:30Z

it is much simpler to simply pass columns= if you actually do care of about the ordering or .reindex() anytime you actually want a specific ordering.

As a user, I beg to differ. I regularly see cases (see my previous comments) that would be more straightforward and intuitive if you didn't have to pass columns or reindex, as judged by seeing my team members try without columns or reindex first, and then be disappointed when they realize they have to.

Completely agree with @jorisvandenbossche above re: focusing on consistency and not reinventing the wheel. It sounds like Lists of Series already follows the maintain-insertion-order (MIO) rule that I advocated above. Landing on something consistent would be really helpful to me and my colleagues.

MIO seems like as good a rule as any. The data dependence of ordering is an odd side-effect, but it seems worth it for consistency.

jason-curtis · 2019-07-10T21:53:41Z

I'd love to know if PRs moving things more towards MIO-based unified handling of "lists of record-like objects" would be considered in general. If that direction were agreed upon, this PR seems like one of the more important steps but there would be more that I would like to contribute.

jorisvandenbossche · 2019-07-11T01:46:23Z

Just to be clear, is the definition of record "an ordered set of key-value pairs"?

I don't have an exact definition, but something like that yes. I think Series, dicts and namedtuples are all similar enough (and record-like) to expect them to be handled similarly.

Lists of dicts -> the keys are sorted

did you mean sorted (as in lexically) or ordered (what we've been discussing)?

I was describing the current behaviour (which is sorted lexically, not random as you said in the top post)

Let's be precise about the ordering guarantee pandas provides. I'm working under the assumption that It's simply

If you pass a like-ordered like-keyed list of records, pandas will preserve the key ordering

That's it. We don't need to argue about other, undefined behavior.
That rule already holds for namedtuples, and this PR makes it true also for dicts.

That's the basic "rule" for Series, but not the full rule. For Series, other behaviour is not undefined (we don't want undefined behaviour, at least it should be deterministically sorted as is now).
I would need to look more closely at the actual code, but I think the logic for Series is basically doing a "union" of the indices (without sorting, so idx1.union(idx2, sort=False) for two indexes, but then of course using another routine that can handle multiple).

"There is no good solution" as @jreback said, to making any further guarantees on ordering. Because they'll have to be arbitrary in some sense.

There is a perfectly fine solution, it is what Series already does, and the same logic is followed by the Index.union method.

jorisvandenbossche · 2019-07-11T01:53:54Z

There is a perfectly fine solution, it is what Series already does, and the same logic is followed by the Index.union method.

BTW, we already use this exact behaviour when passing OrderedDicts:

In [3]: records = [OrderedDict([('c', 1), ('a', 2)]), OrderedDict([('b', 3), ('a', 4)])]                                                                      

In [4]: pd.DataFrame(records)                                                                                                                                 
Out[4]: 
     c  a    b
0  1.0  2  NaN
1  NaN  4  3.0

So basically what I want to say: we already have the defined behaviour and code for Series and OrderdDict, so I don't see any reason to not do this for normal dicts as well since they are now ordered as well.
(and I would then separately also fix the namedtuple case, as that feels buggy, but let's discuss that separately)

Given the above, I think this PR can be simplified a lot. I think it is basically making the check for OrderedDict to also allow normal dicts:

pandas/pandas/core/internals/construction.py

Lines 539 to 542 in 2d0b20b

    
           if columns is None: 
        
               gen = (list(x.keys()) for x in data) 
        
               sort = not any(isinstance(d, OrderedDict) for d in data) 
        
               columns = lib.fast_unique_multiple_list_gen(gen, sort=sort)

jreback

looks ok, ping . on green.

doc/source/whatsnew/v0.25.0.rst

pandas/core/internals/construction.py

jorisvandenbossche

Last tiny remark, should be good otherwise (sorry for the many back and forth rounds)

pandas/core/internals/construction.py

doc/source/whatsnew/v0.25.0.rst

pandas/core/internals/construction.py

jorisvandenbossche · 2019-07-15T21:16:06Z

@jreback this PR has seen enough back and forth on small details, I think we can just merge as is.

jreback · 2019-07-15T21:18:21Z

@jreback this PR has seen enough back and forth on small details, I think we can just merge as is.

I disagree

ghost · 2019-07-15T23:24:12Z

Thanks @jorisvandenbossche for the fixes. I pushed a few more of @jreback's comments.

For the record, the overhead of doing full column discovery is not negligible. I profile it at about 50% slower compared to passing columns explicitly. I think that's acceptable (it has been for OrderedDict), but it should be clear moving forward with this for dicts, and in the future for namedtuples too.

In [4]: import pandas as pd
   ...: 
   ...: from typing import NamedTuple
   ...: from collections import namedtuple
   ...: import gc 
   ...: 
   ...: Foo=namedtuple("Foo","a,b,c,d")
   ...: 
   ...: d=dict(a=1,b=2,c=3,d=4)        
   ...: data1=[d]*100000
   ...: %timeit pd.DataFrame(data1, columns=['a','b','c','d']) 
   ...: %timeit pd.DataFrame(data1)
96 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
132 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

jorisvandenbossche · 2019-07-15T23:59:23Z

For the record, the overhead of doing full column discovery is not negligible.

Yes, but note that this has not changed. Also on master, specifying the column names is faster than doing the discovery. The only change is that the union of the dict keys keep the order instead of doing a sort, and that has basically no performance implication.

pandas/core/internals/construction.py

ghost · 2019-07-17T23:43:32Z

@jorisvandenbossche,thanks for saving this PR after a false-start and for your patience in reviewing.

ENH: Support new case of implied column ordering in Dataframe()

afa72b4

ghost changed the title ~~ENH: Support new case of implied column ordering in Dataframe()~~ ENH: Preserve implied column ordering when passing list of dict to DataFrame Jul 9, 2019

WillAyd reviewed Jul 9, 2019

View reviewed changes

pandas/tests/frame/test_constructors.py Show resolved Hide resolved

pilkibun added 3 commits July 9, 2019 13:22

Safer

8a4113c

Restrict to Index case

408ad8b

Fix tests

b732096

pilkibun added 8 commits July 9, 2019 17:02

Style

be57fd9

Fix test

717716b

rename

e9d4989

Restrict to PY37

63adbfe

Style

0ed89ff

Restrict to PY36

0a48016

Work around fake test failure on PY35

4b73536

Style

eb64d31

fix test

b5db0bc

jreback requested changes Jul 10, 2019

View reviewed changes

ghost changed the title ~~ENH: Preserve implied column ordering when passing list of dict to DataFrame~~ ENH: Use key order for column ordering when passing list of homogeneous dicts to DataFrame Jul 11, 2019

ghost changed the title ~~ENH: Use key order for column ordering when passing list of homogeneous dicts to DataFrame~~ ENH: Preserve key order when passing list of homogeneous dicts to DataFrame Jul 11, 2019

jreback requested changes Jul 12, 2019

View reviewed changes

jreback added this to the 0.25.0 milestone Jul 12, 2019

pilkibun added 4 commits July 12, 2019 12:16

fix tests

4d52802

CI

5371de5

whatsnew

9afdec3

comment

e1f5f6b

This comment has been minimized.

Sign in to view

pilkibun added 3 commits July 12, 2019 13:48

whatsnew

b8d8e28

comment

807e341

checks

209c922

jorisvandenbossche approved these changes Jul 12, 2019

View reviewed changes

pandas/core/internals/construction.py Outdated Show resolved Hide resolved

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

pilkibun added 2 commits July 12, 2019 18:03

docstring

e3dfa45

whatsnew

e0749fe

jreback requested changes Jul 15, 2019

View reviewed changes

jorisvandenbossche and others added 4 commits July 15, 2019 17:48

doc comments

4f815cd

typo

60236e5

whatsnew

f4e6309

document parameters

10024c1

jorisvandenbossche reviewed Jul 16, 2019

View reviewed changes

pandas/core/internals/construction.py Outdated Show resolved Hide resolved

remove wrong description

0d194f1

jreback approved these changes Jul 17, 2019

View reviewed changes

jreback merged commit f1684a1 into pandas-dev:master Jul 17, 2019

ghost deleted the 10056 branch July 17, 2019 14:37

ghost mentioned this pull request Jul 24, 2019

ENH: treat list of namedtuples like list of dict in DataFrame() #27494

Closed

4 tasks

WillAyd mentioned this pull request Sep 21, 2019

pd.read_json and orient="index" sorts results #28557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Preserve key order when passing list of dicts to DataFrame on py 3.6+ #27309

ENH: Preserve key order when passing list of dicts to DataFrame on py 3.6+ #27309

ghost commented Jul 9, 2019 •

edited by ghost

Loading

jason-curtis commented Jul 9, 2019 •

edited

Loading

jason-curtis commented Jul 9, 2019 •

edited

Loading

pep8speaks commented Jul 9, 2019 •

edited

Loading

WillAyd commented Jul 10, 2019 •

edited

Loading

jason-curtis commented Jul 10, 2019

WillAyd commented Jul 10, 2019

jreback left a comment

jorisvandenbossche commented Jul 10, 2019 •

edited

Loading

jorisvandenbossche commented Jul 10, 2019

jason-curtis commented Jul 10, 2019 •

edited

Loading

jason-curtis commented Jul 10, 2019

jorisvandenbossche commented Jul 11, 2019

jorisvandenbossche commented Jul 11, 2019

jreback left a comment

This comment has been minimized.

jorisvandenbossche left a comment

jorisvandenbossche commented Jul 15, 2019

jreback commented Jul 15, 2019

ghost commented Jul 15, 2019

jorisvandenbossche commented Jul 15, 2019

ghost commented Jul 17, 2019

ENH: Preserve key order when passing list of dicts to DataFrame on py 3.6+ #27309

ENH: Preserve key order when passing list of dicts to DataFrame on py 3.6+ #27309

Conversation

ghost commented Jul 9, 2019 • edited by ghost Loading

jason-curtis commented Jul 9, 2019 • edited Loading

jason-curtis commented Jul 9, 2019 • edited Loading

pep8speaks commented Jul 9, 2019 • edited Loading

Comment last updated at 2019-07-16 00:14:13 UTC

WillAyd commented Jul 10, 2019 • edited Loading

jason-curtis commented Jul 10, 2019

WillAyd commented Jul 10, 2019

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 10, 2019 • edited Loading

jorisvandenbossche commented Jul 10, 2019

jason-curtis commented Jul 10, 2019 • edited Loading

jason-curtis commented Jul 10, 2019

jorisvandenbossche commented Jul 11, 2019

jorisvandenbossche commented Jul 11, 2019

jreback left a comment

Choose a reason for hiding this comment

This comment has been minimized.

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 15, 2019

jreback commented Jul 15, 2019

ghost commented Jul 15, 2019

jorisvandenbossche commented Jul 15, 2019

ghost commented Jul 17, 2019

ghost commented Jul 9, 2019 •

edited by ghost

Loading

jason-curtis commented Jul 9, 2019 •

edited

Loading

jason-curtis commented Jul 9, 2019 •

edited

Loading

pep8speaks commented Jul 9, 2019 •

edited

Loading

WillAyd commented Jul 10, 2019 •

edited

Loading

jorisvandenbossche commented Jul 10, 2019 •

edited

Loading

jason-curtis commented Jul 10, 2019 •

edited

Loading