-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update/calculate hourly airqualitydata using bigqdata #4524
Update/calculate hourly airqualitydata using bigqdata #4524
Conversation
Warning Rate limit exceeded@NicholasTurner23 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 15 minutes and 25 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis pull request updates several utility modules for air quality data processing. In particular, it modifies device retrieval logic for BAM network data in the extraction routines and refines the data calibration process. The changes in the data utility module improve type conversion, filtering, and handling of API data, while two DAG and utility files related to KCCA measurements have been removed. Error handling and method signatures remain consistent with previous behavior. Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant AN as AirnowDataUtils
participant AQ as AirQoApi
participant DU as DataUtils
U->>AN: Call extract_bam_data(start_date_time, end_date_time)
AN->>AQ: get_devices_by_network()
AN->>DU: get_devices(device_category, network)
DU-->>AN: Return device list
AQ-->>AN: Return device list
AN->>AN: Merge and process devices
AN-->>U: Return DataFrame or raise ValueError if no devices found
sequenceDiagram
participant U as User
participant AD as AirQoDataUtils
participant M as CalibrationModel
U->>AD: Call calibrate_data(data, groupby)
AD->>AD: Filter and prepare input DataFrame
AD->>M: Request model predictions for PM values
M-->>AD: Return pm2_5_calibrated and pm10_calibrated values
AD->>AD: Merge predictions to form calibrated_data
AD-->>U: Return calibrated_data DataFrame
Possibly related PRs
Suggested reviewers
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/workflows/airqo_etl_utils/airqo_utils.py (1)
699-736
: Optimize repeated model downloads.
Each grouped calibration retrieves models from GCS, potentially causing performance overhead for large datasets. Consider caching downloaded models the first time they are needed, then reuse them for subsequent groups.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/workflows/airqo_etl_utils/airnow_utils.py
(2 hunks)src/workflows/airqo_etl_utils/airqo_utils.py
(4 hunks)src/workflows/airqo_etl_utils/datautils.py
(4 hunks)src/workflows/airqo_etl_utils/kcca_utils.py
(0 hunks)src/workflows/dags/kcca_measurements.py
(0 hunks)
💤 Files with no reviewable changes (2)
- src/workflows/dags/kcca_measurements.py
- src/workflows/airqo_etl_utils/kcca_utils.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (javascript)
🔇 Additional comments (6)
src/workflows/airqo_etl_utils/airnow_utils.py (1)
11-11
: No concerns with this new import.
It cleanly referencesDataUtils
without introducing side effects.src/workflows/airqo_etl_utils/airqo_utils.py (1)
611-622
: Confirm the need to drop calibration columns.
Dropping columns likeavg_pm2_5
,avg_pm10
, and others may limit post-calibration analysis. If long-term traceability or further debug is required, consider preserving them.src/workflows/airqo_etl_utils/datautils.py (4)
211-214
: Recheck strict filtering on PM columns.
Dropping rows where all PM columns are null may discard records containing valid data in other columns (e.g., temperature). Confirm that losing such rows is acceptable.
1030-1121
: Confirm robust coordinate parsing and device lookup logic.
_flatten_location_coordinates_clarity
logs stack traces on invalid coordinates; consider downgrading tologger.error
if these are frequent user input errors._add_site_and_device_details
depends ondevices.name
existing. Verify that the column matches the local device naming convention.
811-811
:❓ Verification inconclusive
Ensure 'format="mixed"' is supported in your pandas environment.
This parameter may require a recent pandas version. Double check your environment's compatibility to prevent runtime errors.
🌐 Web query:
Is "format='mixed'" supported in pandas 1.x?
💡 Result:
The
format='mixed'
option in pandas' date parsing functions likepd.to_datetime()
was introduced in pandas 2.0.0 (April 2023) and is not supported in pandas 1.x [9][10].Key points:
- Purpose: Allows parsing date strings with inconsistent formats by inferring the format for each element individually.
- Pandas 1.x Limitation: In pandas 1.x, users had to either:
- Specify a single format for all elements (
format="%Y-%m-%d"
)- Use
infer_datetime_format=True
(less flexible)- Process dates iteratively with custom logic
- New in 2.0.0: Explicit support for mixed-format dates via
format="mixed"
was added to handle real-world datasets with irregular date formatting [9][10].Example of the new syntax (only works in ≥2.0.0):
pd.to_datetime(["2020-01-01", "01/02/2023"], format="mixed")For pandas 1.x users, alternatives include:
- Preprocessing data to standardize formats
- Using
errors='coerce'
withpd.to_datetime()
- Applying custom parsing logic row-wise with
apply()
Citations:
- 1: https://community.plotly.com/t/announcing-plotly-py-4-8-plotly-express-support-for-wide-and-mixed-form-data-plus-a-pandas-backend/40048
- 2: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
- 3: https://pandas.pydata.org/pandas-docs/version/2.1.3/user_guide/timeseries.html
- 4: https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
- 5: https://pandas.pydata.org/docs/dev/reference/api/pandas.read_csv.html
- 6: https://hedaro.com/Programming-Languages/Python/Data-Analyst/Data-Analyst---The-Donor-Data-Debacle
- 7: https://pandas.pydata.org/docs/user_guide/io.html
- 8: https://docs.astral.sh/ruff/rules/
- 9: https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html
- 10: DEPR deprecate mixed timezone offsets with utc=False? pandas-dev/pandas#50887
Ensure Compatibility with Pandas 2.0.0 or Later for
format="mixed"
The
pd.to_datetime(..., format="mixed")
call will only work if you're running pandas 2.0.0 or newer—pandas 1.x does not support this parameter. Please verify that your runtime environment meets this requirement. If not, consider either upgrading pandas or revising the date parsing strategy (e.g., usinginfer_datetime_format=True
or custom parsing logic).
- Location:
src/workflows/airqo_etl_utils/datautils.py
at line 811- Code Snippet:
data["timestamp"] = pd.to_datetime(data["timestamp"], format="mixed")
69-89
:❓ Verification inconclusive
Validate potential non-integer device numbers.
Castingdevice_number
toint
after filling with -1 is fine for missing values, but ensure no rows have non-numeric entries (e.g., string codes) that would raise errors outside the try block.
🏁 Script executed:
#!/bin/bash # Confirm only integer-like device_number values exist before type casting rg '\"device_number\":' -A5Length of output: 114
Attention: Re-check Device Number Casting for Non-Numeric Values
The initial verification command failed due to an escape sequence error. This means the original check didn’t yield any useful results. While filling missing values with –1 and casting to int is acceptable for missing entries, please manually verify that no non-numeric entries (e.g., string codes) exist in the "device_number" column. For a more precise inspection, consider running a revised shell script such as:
#!/bin/bash # Search for the usage of device_number assignment and conversion in the code rg 'devices\["device_number"\]' -n src/workflows/airqo_etl_utils/datautils.pyThis should provide context on how “device_number” is handled in the file and help confirm that there isn’t unexpected non-numeric data leading to potential errors.
devices, _ = DataUtils.get_devices( | ||
device_category=DeviceCategory.BAM, device_network=DeviceNetwork.METONE | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consolidate or remove the redundant device retrieval call.
Currently, the result from AirQoApi().get_devices_by_network(...)
gets overshadowed by the subsequent DataUtils.get_devices(...)
. Consider removing one of these calls or merging their results to avoid confusion and ensure consistency.
- devices = AirQoApi().get_devices_by_network(
- device_network=DeviceNetwork.METONE, device_category=DeviceCategory.BAM
- )
devices, _ = DataUtils.get_devices(
device_category=DeviceCategory.BAM, device_network=DeviceNetwork.METONE
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
devices, _ = DataUtils.get_devices( | |
device_category=DeviceCategory.BAM, device_network=DeviceNetwork.METONE | |
) | |
devices, _ = DataUtils.get_devices( | |
device_category=DeviceCategory.BAM, device_network=DeviceNetwork.METONE | |
) |
|
||
data["timestamp"] = pd.to_datetime(data["timestamp"]) | ||
|
||
to_calibrate = data["network"] == "xxxx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Network check seems to be a placeholder.
Using "xxxx"
as the network condition might be unintentional. Verify if a production network name should be used instead, or if this logic is still under development.
- to_calibrate = data["network"] == "xxxx"
+ # Example correction:
+ to_calibrate = data["network"] == "airqo" # or whichever name is intended
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
to_calibrate = data["network"] == "xxxx" | |
# Example correction: | |
to_calibrate = data["network"] == "airqo" # or whichever name is intended |
Clean up
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## staging #4524 +/- ##
===========================================
- Coverage 11.51% 11.47% -0.04%
===========================================
Files 155 63 -92
Lines 18709 6237 -12472
Branches 549 549
===========================================
- Hits 2154 716 -1438
+ Misses 16551 5517 -11034
Partials 4 4 |
Clean up
Description
Summary by CodeRabbit
New Features
Refactor