Skip to content

ENH: MultiIndex.from_frame #23141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Dec 9, 2018
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
79bdecb
ENH - add from_frame method and accompanying squeeze method to multii…
sds9995 Oct 13, 2018
fa82618
ENH - guarentee that order of labels is preserved in multiindex to_fr…
sds9995 Oct 13, 2018
64b45d6
CLN - adhere to PEP8 line length
sds9995 Oct 13, 2018
64c7bb1
CLN - remove trailing whitespace
sds9995 Oct 13, 2018
3ee676c
ENH - raise TypeError on inappropriate input
sds9995 Oct 16, 2018
fd266f5
TST - add tests for mi.from_frame and mi.squeeze
sds9995 Oct 16, 2018
4bc8f5b
CLN - pep8 adherence in tests
sds9995 Oct 16, 2018
9d92b70
CLN - last missed pep8 fix
sds9995 Oct 16, 2018
45595ad
BUG - remove pd.DataFrame in favor of local import
ms7463 Oct 16, 2018
3530cd3
DOC - add more detailed docstrings for from_frame and squeeze
sds9995 Oct 18, 2018
1c22791
DOC - update MultiIndex.from_frame and squeeze doctests to comply wit…
sds9995 Oct 28, 2018
cf78780
CLN - cleanup docstrings and source
sds9995 Oct 28, 2018
64c2750
TST - reorganize some of the multiindex tests
sds9995 Oct 28, 2018
ede030b
CLN - adhere to pep8 line length
sds9995 Oct 28, 2018
190c341
BUG - ensure dtypes are preserved in from_frame and to_frame
sds9995 Nov 3, 2018
e0df632
TST - add tests for ensuring dtype fidelity and custom names for from…
sds9995 Nov 3, 2018
78ff5c2
CLN - pep8 adherence
sds9995 Nov 3, 2018
0252db9
DOC - add examples and change order of kwargs for from_frame
sds9995 Nov 3, 2018
d98c8a9
TST - parameterize tests
sds9995 Nov 3, 2018
8a1906e
CLN - pep8 adherence
sds9995 Nov 3, 2018
08c120f
CLN - pep8 adherence
sds9995 Nov 3, 2018
8353c3f
DOC/CLN - add versionadded tags, add to whatsnew page, and clean up i…
sds9995 Nov 4, 2018
9df3c11
CLN - squeeze -> _squeeze
sds9995 Nov 10, 2018
6d4915e
DOC - squeeze -> _squeeze in whatsnew
ms7463 Nov 10, 2018
b5df7b2
BUG - allow repeat column names in from_frame, and falsey column name…
sds9995 Nov 11, 2018
ab3259c
DOC - whatsnew formatting
sds9995 Nov 11, 2018
cf95261
TST - reorganize and add tests for more incompatible from_frame types
sds9995 Nov 11, 2018
63051d7
Merge branch 'enhancement/from_frame' of https://github.com/ArtinSarr…
sds9995 Nov 11, 2018
a75a4a5
CLN - remove squeeze tests
sds9995 Nov 12, 2018
8d23df9
CLN - remove squeeze parameter from from_frame
sds9995 Nov 12, 2018
c8d696d
Merge branch 'master' into enhancement/from_frame
sds9995 Nov 12, 2018
7cf82d1
TST - remove callable name option
sds9995 Nov 12, 2018
1a282e5
ENH - from_data initial commit
sds9995 Nov 14, 2018
b3c6a90
DOC - reduce whatsnew entry for to_frame
sds9995 Nov 19, 2018
c760359
CLN/DOC - add examples to from_frame docstring and make code more rea…
sds9995 Nov 19, 2018
bb69314
Merge branch 'master' into enhancement/from_frame
sds9995 Nov 19, 2018
9e11180
TST - use OrderedDict for dataframe construction
sds9995 Nov 20, 2018
96c6af3
Merge branch 'master' into enhancement/from_frame
sds9995 Nov 28, 2018
a5236bf
CLN - clean up code and use pytest.raises
sds9995 Dec 1, 2018
c78f364
Merge branch 'master' into enhancement/from_frame
sds9995 Dec 1, 2018
14bfea8
DOC - move to_frame breaking changes to backwards incompatible sectio…
sds9995 Dec 2, 2018
6960804
Merge branch 'master' into enhancement/from_frame
ms7463 Dec 2, 2018
11c5947
Merge branch 'master' into enhancement/from_frame
sds9995 Dec 4, 2018
904644a
Merge branch 'enhancement/from_frame' of https://github.com/ArtinSarr…
sds9995 Dec 4, 2018
30fe0df
DOC - add advanced.rst section
sds9995 Dec 5, 2018
ec60563
Merge branch 'master' into enhancement/from_frame
sds9995 Dec 5, 2018
8fc6609
Merge branch 'master' into enhancement/from_frame
sds9995 Dec 6, 2018
9b906c6
DOC/CLN - cleanup documentation
sds9995 Dec 6, 2018
e416122
CLN - fix linting error according to pandas-dev.pandas test
sds9995 Dec 6, 2018
4ef9ec4
DOC - fix docstrings
sds9995 Dec 7, 2018
4240a1e
CLN - fix import order with isort
sds9995 Dec 7, 2018
9159b2d
Merge branch 'master' into enhancement/from_frame
sds9995 Dec 7, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 108 additions & 5 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# pylint: disable=E1101,E1103,W0232
import datetime
import warnings
from collections import OrderedDict
from sys import getsizeof

import numpy as np
Expand Down Expand Up @@ -1189,11 +1190,15 @@ def to_frame(self, index=True, name=None):
else:
idx_names = self.names

result = DataFrame({(name or level):
self._get_level_values(level)
for name, level in
zip(idx_names, range(len(self.levels)))},
copy=False)
result = DataFrame(
OrderedDict([
((name or level), self._get_level_values(level))
for name, level in zip(idx_names, range(len(self.levels)))
]),
copy=False
)


if index:
result.index = self
return result
Expand Down Expand Up @@ -1294,6 +1299,7 @@ def from_arrays(cls, arrays, sortorder=None, names=None):
MultiIndex.from_tuples : Convert list of tuples to MultiIndex
MultiIndex.from_product : Make a MultiIndex from cartesian product
of iterables
MultiIndex.from_frame : Make a MultiIndex from a DataFrame.
"""
if not is_list_like(arrays):
raise TypeError("Input must be a list / sequence of array-likes.")
Expand Down Expand Up @@ -1343,6 +1349,7 @@ def from_tuples(cls, tuples, sortorder=None, names=None):
MultiIndex.from_arrays : Convert list of arrays to MultiIndex
MultiIndex.from_product : Make a MultiIndex from cartesian product
of iterables
MultiIndex.from_frame : Make a MultiIndex from a DataFrame.
"""
if not is_list_like(tuples):
raise TypeError('Input must be a list / sequence of tuple-likes.')
Expand Down Expand Up @@ -1399,6 +1406,7 @@ def from_product(cls, iterables, sortorder=None, names=None):
--------
MultiIndex.from_arrays : Convert list of arrays to MultiIndex
MultiIndex.from_tuples : Convert list of tuples to MultiIndex
MultiIndex.from_frame : Make a MultiIndex from a DataFrame.
"""
from pandas.core.arrays.categorical import _factorize_from_iterables
from pandas.core.reshape.util import cartesian_product
Expand All @@ -1412,6 +1420,77 @@ def from_product(cls, iterables, sortorder=None, names=None):
labels = cartesian_product(labels)
return MultiIndex(levels, labels, sortorder=sortorder, names=names)

@classmethod
def from_frame(cls, df, squeeze=True, names=None):
"""
Make a MultiIndex from a DataFrame.

Parameters
----------
df : pd.DataFrame
DataFrame to be converted to MultiIndex.
squeeze : bool, default True
If df is a single column, squeeze MultiIndex to be a regular Index.
names : list / sequence / callable, optonal
If no names provided, use column names, or tuple of column names if
the columns is a MultiIndex. If sequence, overwrite names with the
given sequence. If callable, pass each column name or tuples of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really sure of the difference of these, can you show what the rationale for all of these options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callable option was mostly for cases where the frame used to construct the mi is itself multiindexed on the columns. Example below:

    >>> df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['e', 'f']],
    ...                   columns=pd.MultiIndex.from_tuples([('L1', 'x'), 
    ...                                                      ('L2', 'y')]))
    >>> df
      L1 L2
       x  y
    0  a  b
    1  c  d
    2  e  f
    >>> pd.MultiIndex.from_frame(df, names=lambda x: '_'.join(x))
    MultiIndex(levels=[['a', 'c', 'e'], ['b', 'd', 'f']],
               labels=[[0, 1, 2], [0, 1, 2]],
               names=['L1_x', 'L2_y'])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be an uncommon occurrence. Would it make more sense to just not provide the callable option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove the callable option for now

names to the callable.

Returns
-------
MultiIndex or Index
The MultiIndex representation of the given DataFrame. Returns an
Index if the DataFrame is single column and squeeze is True.

Examples
--------
>>> df = pd.DataFrame([[0, 'happy'], [0, 'jolly'], [1, 'happy'],
... [1, 'jolly'], [2, 'joy'], [2, 'joy']],
... columns=['number', 'mood'])
>>> df
number mood
0 0 happy
1 0 jolly
2 1 happy
3 1 jolly
4 2 joy
5 2 joy
>>> pd.MultiIndex.from_frame(df)
MultiIndex(levels=[[0, 1, 2], ['happy', 'jolly', 'joy']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 2, 2]],
names=['number', 'mood'])

See Also
--------
MultiIndex.from_arrays : Convert list of arrays to MultiIndex
MultiIndex.from_tuples : Convert list of tuples to MultiIndex
MultiIndex.from_product : Make a MultiIndex from cartesian product
of iterables
"""
from pandas import DataFrame
if not isinstance(df, DataFrame):
raise TypeError("Input must be a DataFrame")

# Get MultiIndex names
if names is None:
names = list(df)
else:
if callable(names):
names = [names(x) for x in list(df)]
else:
if not is_list_like(names):
raise TypeError("'names' must be a list / sequence "
"of column names, or a callable.")

if len(names) != len(list(df)):
raise ValueError("'names' should have same length as "
"number of columns in df.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested in my previous comment that all checks on names are superfluous: they exactly repeat the checks done inside the constructor. Isn't

names = column_names if names is None else names

sufficient?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, removed the redundant code. Thanks.


# This way will preserve dtype of columns
mi = cls.from_arrays([df[x] for x in df], names=names)
return mi.squeeze() if squeeze else mi

def _sort_levels_monotonic(self):
"""
.. versionadded:: 0.20.0
Expand Down Expand Up @@ -1474,6 +1553,30 @@ def _sort_levels_monotonic(self):
names=self.names, sortorder=self.sortorder,
verify_integrity=False)

def squeeze(self):
"""
Squeeze a single level MultiIndex to be a regular Index instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to make this a public method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed squeeze -> _squeeze

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am re-thinking if this should be public, see #22866

Returns
-------
Index or MultiIndex
Returns Index equivalent of single level MultiIndex. Returns
copy of MultiIndex if multilevel.

Examples
--------
>>> mi = pd.MultiIndex.from_tuples([('a',), ('b',), ('c',)])
>>> mi
MultiIndex(levels=[['a', 'b', 'c']],
labels=[[0, 1, 2]])
>>> mi.squeeze()
Index(['a', 'b', 'c'], dtype='object')
"""
if len(self.levels) == 1:
return self.levels[0][self.labels[0]]
else:
return self.copy()

def remove_unused_levels(self):
"""
Create a new MultiIndex from the current that removes
Expand Down
74 changes: 74 additions & 0 deletions pandas/tests/indexes/multi/test_constructor.py
Original file line number Diff line number Diff line change
Expand Up @@ -472,3 +472,77 @@ def test_from_tuples_with_tuple_label():
idx = pd.MultiIndex.from_tuples([(2, 1), (4, (1, 2))], names=('a', 'b'))
result = pd.DataFrame([2, 3], columns=['c'], index=idx)
tm.assert_frame_equal(expected, result)


def test_from_frame():
df = pd.DataFrame([['a', 'a'], ['a', 'b'], ['b', 'a'], ['b', 'b']],
columns=['L1', 'L2'])
expected = pd.MultiIndex.from_tuples([('a', 'a'), ('a', 'b'),
('b', 'a'), ('b', 'b')],
names=['L1', 'L2'])
result = pd.MultiIndex.from_frame(df)
tm.assert_index_equal(expected, result)


@pytest.mark.parametrize('squeeze,input_type,expected', [
(True, 'multi', pd.MultiIndex.from_tuples([('a', 'a'), ('a', 'b'),
('b', 'a'), ('b', 'b')],
names=['L1', 'L2'])),
(True, 'single', pd.Index(['a', 'a', 'b', 'b'], name='L1')),
(False, 'multi', pd.MultiIndex.from_tuples([('a', 'a'), ('a', 'b'),
('b', 'a'), ('b', 'b')],
names=['L1', 'L2'])),
(False, 'single', pd.MultiIndex.from_tuples([('a',), ('a',),
('b',), ('b',)],
names=['L1']))
])
def test_from_frame_squeeze(squeeze, input_type, expected):
if input_type == 'multi':
df = pd.DataFrame([['a', 'a'], ['a', 'b'], ['b', 'a'], ['b', 'b']],
columns=['L1', 'L2'])
elif input_type == 'single':
df = pd.DataFrame([['a'], ['a'], ['b'], ['b']], columns=['L1'])

result = pd.MultiIndex.from_frame(df, squeeze=squeeze)
tm.assert_index_equal(expected, result)


def test_from_frame_non_frame():
with tm.assert_raises_regex(TypeError, 'Input must be a DataFrame'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be with pytest.raises(TypeError, match='Input must be a DataFrame') now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

pd.MultiIndex.from_frame([1, 2, 3, 4])


def test_from_frame_dtype_fidelity():
df = pd.DataFrame({
'dates': pd.date_range('19910905', periods=6),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test dates with timezones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, in both to_frame and from_frame.

'a': [1, 1, 1, 2, 2, 2],
'b': pd.Categorical(['a', 'a', 'b', 'b', 'c', 'c'], ordered=True),
'c': ['x', 'x', 'y', 'z', 'x', 'y']
})
original_dtypes = df.dtypes.to_dict()
mi = pd.MultiIndex.from_frame(df)
mi_dtypes = {name: mi.levels[i].dtype for i, name in enumerate(mi.names)}
assert original_dtypes == mi_dtypes


def test_from_frame_names_as_list():
df = pd.DataFrame([['a', 'a'], ['a', 'b'], ['b', 'a'], ['b', 'b']],
columns=['L1', 'L2'])
mi = pd.MultiIndex.from_frame(df, names=['a', 'b'])
assert mi.names == ['a', 'b']


def test_from_frame_names_as_callable():
df = pd.DataFrame([['a', 'a'], ['a', 'b'], ['b', 'a'], ['b', 'b']],
columns=pd.MultiIndex.from_tuples([('L1', 'x'),
('L2', 'y')]))
mi = pd.MultiIndex.from_frame(df, names=lambda x: '_'.join(x))
assert mi.names == ['L1_x', 'L2_y']


def test_from_frame_names_bad_input():
df = pd.DataFrame([['a', 'a'], ['a', 'b'], ['b', 'a'], ['b', 'b']],
columns=['L1', 'L2'])
with tm.assert_raises_regex(TypeError, "names' must be a list / sequence "
"of column names, or a callable."):
pd.MultiIndex.from_frame(df, names='bad')
31 changes: 31 additions & 0 deletions pandas/tests/indexes/multi/test_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,20 @@ def test_to_frame():
tm.assert_frame_equal(result, expected)


def test_to_frame_dtype_fidelity():
mi = pd.MultiIndex.from_arrays([
pd.date_range('19910905', periods=6),
[1, 1, 1, 2, 2, 2],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a repeated test of the above, if so, then not necessary here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was at the suggestion of @TomAugspurger
#23141 (comment)

pd.Categorical(['a', 'a', 'b', 'b', 'c', 'c'], ordered=True),
['x', 'x', 'y', 'z', 'x', 'y']
], names=['dates', 'a', 'b', 'c'])
original_dtypes = {name: mi.levels[i].dtype
for i, name in enumerate(mi.names)}
df = mi.to_frame()
df_dtypes = df.dtypes.to_dict()
assert original_dtypes == df_dtypes


def test_to_hierarchical():
index = MultiIndex.from_tuples([(1, 'one'), (1, 'two'), (2, 'one'), (
2, 'two')])
Expand Down Expand Up @@ -169,3 +183,20 @@ def test_to_series_with_arguments(idx):
assert s.values is not idx.values
assert s.index is not idx
assert s.name != idx.name


def test_squeeze_single_level():
mi = pd.MultiIndex.from_tuples([('a',), ('a',), ('b',), ('b',)],
names=['L1'])
expected = pd.Index(['a', 'a', 'b', 'b'], name='L1')
result = mi.squeeze()
tm.assert_index_equal(expected, result)


def test_squeeze_multi_level():
mi = pd.MultiIndex.from_tuples([('a', 'a'), ('a', 'b'), ('b', 'a'),
('b', 'b')],
names=['L1', 'L2'])
expected = mi.copy()
result = mi.squeeze()
tm.assert_index_equal(expected, result)