Skip to content

BUG: Fix pd.json_normalize to not skip the first element of a generator input #38698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 30, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,8 @@ I/O
- Allow custom error values for parse_dates argument of :func:`read_sql`, :func:`read_sql_query` and :func:`read_sql_table` (:issue:`35185`)
- Bug in :func:`to_hdf` raising ``KeyError`` when trying to apply
for subclasses of ``DataFrame`` or ``Series`` (:issue:`33748`).
- Bug in :func:`json_normalize` resulting in the first element of a generator object not being included in the returned ``DataFrame`` (:issue:`35923`)


Period
^^^^^^
Expand Down
7 changes: 6 additions & 1 deletion pandas/io/json/_normalize.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# ---------------------------------------------------------------------
# JSON normalization routines

from collections import defaultdict
from collections import abc, defaultdict
import copy
from typing import Any, DefaultDict, Dict, Iterable, List, Optional, Union

Expand Down Expand Up @@ -262,6 +262,11 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
if isinstance(data, list) and not data:
return DataFrame()

if isinstance(data, abc.Iterator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make these if/elif (all 3 conditions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# GH35923 Fix pd.json_normalize to not skip the first element of a
# generator input
data = list(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have some big performance implications when dealing with large generators - is it not alternately possible to just store the first element for inspection and reuse as necessary while maintaining the state of the generator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we barely support generators (its not even documented), so -1 if this adds any complexity.


# A bit of a hackjob
if isinstance(data, dict):
data = [data]
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/io/json/test_normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -518,6 +518,17 @@ def test_meta_non_iterable(self):
)
tm.assert_frame_equal(result, expected)

def test_generator(self, state_data):
# GH35923 Fix pd.json_normalize to not skip the first element of a
# generator input
def generator_data():
yield from state_data[0]["counties"]

result = json_normalize(generator_data())
expected = DataFrame(state_data[0]["counties"])

tm.assert_frame_equal(result, expected)


class TestNestedToRecord:
def test_flat_stays_flat(self):
Expand Down