Skip to content

BUG: Fix pd.json_normalize to not skip the first element of a generator input #38698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 30, 2020
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,8 @@ I/O
- Allow custom error values for parse_dates argument of :func:`read_sql`, :func:`read_sql_query` and :func:`read_sql_table` (:issue:`35185`)
- Bug in :func:`to_hdf` raising ``KeyError`` when trying to apply
for subclasses of ``DataFrame`` or ``Series`` (:issue:`33748`).
- Bug in :func:`json_normalize` resulting in the first element of a generator object not being included in the returned ``DataFrame`` (:issue:`35923`)


Period
^^^^^^
Expand Down
11 changes: 7 additions & 4 deletions pandas/io/json/_normalize.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# ---------------------------------------------------------------------
# JSON normalization routines

from collections import defaultdict
from collections import abc, defaultdict
import copy
from typing import Any, DefaultDict, Dict, Iterable, List, Optional, Union

Expand Down Expand Up @@ -261,9 +261,12 @@ def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:

if isinstance(data, list) and not data:
return DataFrame()

# A bit of a hackjob
if isinstance(data, dict):
elif isinstance(data, abc.Iterator):
# GH35923 Fix pd.json_normalize to not skip the first element of a
# generator input
data = list(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have some big performance implications when dealing with large generators - is it not alternately possible to just store the first element for inspection and reuse as necessary while maintaining the state of the generator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we barely support generators (its not even documented), so -1 if this adds any complexity.

elif isinstance(data, dict):
# A bit of a hackjob
data = [data]

if record_path is None:
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/io/json/test_normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -518,6 +518,17 @@ def test_meta_non_iterable(self):
)
tm.assert_frame_equal(result, expected)

def test_generator(self, state_data):
# GH35923 Fix pd.json_normalize to not skip the first element of a
# generator input
def generator_data():
yield from state_data[0]["counties"]

result = json_normalize(generator_data())
expected = DataFrame(state_data[0]["counties"])

tm.assert_frame_equal(result, expected)


class TestNestedToRecord:
def test_flat_stays_flat(self):
Expand Down