Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve datetime conversion consistency in APX processor #73

Merged
merged 1 commit into from
May 9, 2024

Conversation

andersy005
Copy link
Member

@andersy005 andersy005 commented May 9, 2024

This PR addresses an inconsistency identified in the APX data processing workflow, specifically concerning datetime conversion. i've added an intermediary step that trims the transaction date to only keep the date component and ignore the time components.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 df.process_apx_credits(registry_name='american-carbon-registry', download_type='retirements')

File ~/mambaforge/envs/offsets-db-data/lib/python3.10/site-packages/pandas_flavor/register.py:153, in register_dataframe_method.<locals>.inner.<locals>.AccessorMethod.__call__(self, *args, **kwargs)
    151 global method_call_ctx_factory
    152 if method_call_ctx_factory is None:
--> 153     return method(self._obj, *args, **kwargs)
    155 return handle_pandas_extension_call(
    156     method, method_signature, self._obj, args, kwargs
    157 )

File ~/mambaforge/envs/offsets-db-data/lib/python3.10/site-packages/offsets_db_data/apx.py:79, in process_apx_credits(df, download_type, registry_name, arb)
     72 column_mapping = load_column_mapping(
     73     registry_name=registry_name, download_type=download_type, mapping_path=CREDIT_SCHEMA_UPATH
     74 )
     76 columns = {v: k for k, v in column_mapping.items()}
     78 data = (
---> 79     df.set_registry(registry_name=registry_name)
     80     .determine_transaction_type(download_type=download_type)
     81     .rename(columns=columns)
     82     .convert_to_datetime(columns=['transaction_date'])
     83 )
     85 if download_type == 'issuances':
     86     data = data.aggregate_issuance_transactions()

File ~/mambaforge/envs/offsets-db-data/lib/python3.10/site-packages/pandas_flavor/register.py:153, in register_dataframe_method.<locals>.inner.<locals>.AccessorMethod.__call__(self, *args, **kwargs)
    151 global method_call_ctx_factory
    152 if method_call_ctx_factory is None:
--> 153     return method(self._obj, *args, **kwargs)
    155 return handle_pandas_extension_call(
    156     method, method_signature, self._obj, args, kwargs
    157 )

File ~/mambaforge/envs/offsets-db-data/lib/python3.10/site-packages/offsets_db_data/common.py:104, in convert_to_datetime(df, columns, utc, **kwargs)
    102 for column in columns:
    103     if column in df.columns:
--> 104         df[column] = pd.to_datetime(df[column], utc=utc, **kwargs).dt.normalize()
    105     else:
    106         raise KeyError(f"The column '{column}' is missing.")

File ~/mambaforge/envs/offsets-db-data/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:1112, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1110         result = arg.map(cache_array)
   1111     else:
-> 1112         values = convert_listlike(arg._values, format)
   1113         result = arg._constructor(values, index=arg.index, name=arg.name)
   1114 elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):

File ~/mambaforge/envs/offsets-db-data/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:488, in _convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst, yearfirst, exact)
    486 # `format` could be inferred, or user didn't ask for mixed-format parsing.
    487 if format is not None and format != "mixed":
--> 488     return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
    490 result, tz_parsed = objects_to_datetime64ns(
    491     arg,
    492     dayfirst=dayfirst,
   (...)
    496     allow_object=True,
    497 )
    499 if tz_parsed is not None:
    500     # We can take a shortcut since the datetime64 numpy array
    501     # is in UTC

File ~/mambaforge/envs/offsets-db-data/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:519, in _array_strptime_with_fallback(arg, name, utc, fmt, exact, errors)
    508 def _array_strptime_with_fallback(
    509     arg,
    510     name,
   (...)
    514     errors: str,
    515 ) -> Index:
    516     """
    517     Call array_strptime, with fallback behavior depending on 'errors'.
    518     """
--> 519     result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors, utc=utc)
    520     if any(tz is not None for tz in timezones):
    521         return _return_parsed_timezone_results(result, timezones, utc, name)

File strptime.pyx:534, in pandas._libs.tslibs.strptime.array_strptime()

File strptime.pyx:355, in pandas._libs.tslibs.strptime.array_strptime()

ValueError: time data "4/10/2012" doesn't match format "%m/%d/%Y %I:%M:%S %p", at position 4860. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

@andersy005 andersy005 merged commit ff42280 into main May 9, 2024
6 checks passed
@andersy005 andersy005 deleted the improve-datetime-conversion-consistency branch May 9, 2024 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant