Update/calculate hourly airqualitydata using bigqdata #4478

NicholasTurner23 · 2025-02-25T09:08:44Z

Description

Update model selection to use country models.
Optimize and document machine learning utils.

Related Issues

JIRA cards:
- OPS-354

Summary by CodeRabbit

New Features
- The calibration process now supports customizable grouping (e.g., grouping by country) for improved data accuracy.
- Introduced new country constants to enhance the selection of calibration models.
Refactor
- Standardized data preprocessing and forecasting workflows with improved type-safety and consistent frequency handling.
- Enhanced internal logging for clearer operational insights.
Documentation
- Updated method descriptions and inline guidance to improve clarity around data processing and forecasting procedures.

coderabbitai · 2025-02-25T09:08:52Z

📝 Walkthrough

Walkthrough

The changes enhance the flexibility and clarity of calibration and forecasting methods across multiple modules. The calibrate_data method now accepts a customizable grouping parameter and leverages the new CountryModels enum for model retrieval. Several methods across utilities and workflows have improved type annotations, detailed docstrings, and updated logging. Additionally, minor adjustments such as parameter renaming and the replacement of hardcoded frequency strings with centralized constants ensure consistency throughout the codebase.

Changes

File(s)	Change Summary
`src/workflows/airqo_etl_utils/airqo_utils.py`, `src/workflows/dags/airqo_measurements.py`, `src/workflows/dags/airqo_mobile_measurements.py`	Updated `calibrate_data` to include a `groupby` parameter (defaulting to "country") and modified internal logic to use `CountryModels` for model retrieval.
`src/workflows/airqo_etl_utils/constants.py`, `src/workflows/dags/forecast_prediction_jobs.py`	Added new `CountryModels` enum and replaced hardcoded frequency strings with `Frequency` constants for consistent frequency handling.
`src/workflows/airqo_etl_utils/ml_utils.py`	Enhanced type annotations, updated method signatures, enriched docstrings, and improved logging in multiple ML utility functions.
`src/workflows/airqo_etl_utils/utils.py`	Modified `get_calibration_model_path` to accept a flexible parameter (`calibrateby`) with type checks for city vs. `CountryModels`.
`src/workflows/dags/weather_measurements.py`	Renamed parameter in `save_weather_data` from `data` to `dataframe` for clarity and consistency.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant AirQoDataUtils
    participant CountryModels
    Caller->>AirQoDataUtils: Call calibrate_data(data, groupby="country")
    AirQoDataUtils->>data: Group data using provided `groupby`
    AirQoDataUtils->>CountryModels: Retrieve model list (using lowercase key)
    AirQoDataUtils-->>Caller: Return calibrated DataFrame

Possibly related PRs

Update/calculate hourly airqualitydata using bigqdata #4355: Updates to the calibrate_data method with changes focused on grouping and parameter handling.
Update/calculate hourly airqualitydata using bigqdata #4450: Direct modifications to the calibrate_data signature and internal logic for calibration.
Update/calculate hourly airqualitydata using bigqdata #4474: Adjustments in DAGs to incorporate the new groupby parameter when calling calibrate_data.

Suggested reviewers

Baalmart
Psalmz777
BenjaminSsempala

Poem

In our code’s vibrant flow, new parameters gleam,
With grouping by country, we refine the calibration scheme.
Enums and type hints light up our flow,
Logging and docs ensure clarity as we grow.
Cheers to improved code, shining bright like a dream!

Warning

Review ran into problems

🔥 Problems

Errors were encountered while retrieving linked issues.

Errors (1)

JIRA integration encountered authorization issues. Please disconnect and reconnect the integration in the CodeRabbit UI.

✨ Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

src/workflows/airqo_etl_utils/utils.py (3)

11-18: Remove unused Frequency import.

The Frequency enum is imported but never used here, triggering a lint warning. Consider removing it from the import list to keep imports clean.

🧰 Tools

🪛 Ruff (0.8.2)

16-16: .constants.Frequency imported but unused

Remove unused import: .constants.Frequency

(F401)

21-21: Eliminate the unused Optional.

Optional is currently not referenced anywhere in this file. Removing it helps maintain a tidy import list and avoid future confusion.

🧰 Tools

🪛 Ruff (0.8.2)

21-21: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

308-310: Enhance docstring parameter explanation.

Although it references a generic city or country, it may be helpful to add examples indicating how calibrateby can be either a CountryModels enum member or a simple string.

src/workflows/airqo_etl_utils/airqo_utils.py (2)

10-17: Remove unused CityModel import.

CityModel import appears redundant now that CountryModels is used. Please remove it to address the lint warning.

🧰 Tools

🪛 Ruff (0.8.2)

14-14: .constants.CityModel imported but unused

Remove unused import: .constants.CityModel

(F401)

665-665: Case-insensitive matching detail.

While .lower() is spot on, consider .casefold() for more robust locale handling if needed. This is a minor nuance.

src/workflows/airqo_etl_utils/ml_utils.py (1)

335-353: Clear 3D coordinate transformations.

The docstring clarifies the need for radians, which is accurate. Consider adding a quick check or conversion for cases where lat/long might still be in degrees.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 58091d1 and 71bfda4.

📒 Files selected for processing (8)

src/workflows/airqo_etl_utils/airqo_utils.py (5 hunks)
src/workflows/airqo_etl_utils/constants.py (1 hunks)
src/workflows/airqo_etl_utils/ml_utils.py (13 hunks)
src/workflows/airqo_etl_utils/utils.py (2 hunks)
src/workflows/dags/airqo_measurements.py (2 hunks)
src/workflows/dags/airqo_mobile_measurements.py (1 hunks)
src/workflows/dags/forecast_prediction_jobs.py (7 hunks)
src/workflows/dags/weather_measurements.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

src/workflows/dags/weather_measurements.py

🧰 Additional context used

🪛 Ruff (0.8.2)

src/workflows/airqo_etl_utils/airqo_utils.py

14-14: .constants.CityModel imported but unused

Remove unused import: .constants.CityModel

(F401)

src/workflows/airqo_etl_utils/utils.py

16-16: .constants.Frequency imported but unused

Remove unused import: .constants.Frequency

(F401)

21-21: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: Analyze (python)
GitHub Check: Analyze (javascript)

🔇 Additional comments (24)

src/workflows/airqo_etl_utils/constants.py (1)

281-300: Well-structured addition of CountryModels enum

The CountryModels enum follows the consistent pattern established by other enums in this file, with proper string representation methods and clear documentation. This addition enables a more flexible geographic model selection approach.

src/workflows/dags/airqo_measurements.py (2)

107-108: Parameter addition enhances calibration flexibility

Adding the groupby="country" parameter to the calibration method aligns with the PR objective to utilize country-specific models, allowing for more targeted data calibration processes.

419-419: Consistent implementation of country-based grouping

The same groupby="country" parameter is appropriately applied to the realtime measurements calibration process, ensuring consistent behavior across different DAG workflows.

src/workflows/dags/forecast_prediction_jobs.py (3)

13-13: Good addition of Frequency import

Adding this import prepares the file for the string-to-enum conversion below.

43-45: Improved code maintainability with enum usage

Replacing string literals with the Frequency enum values enhances code quality by centralizing frequency definitions, preventing typos, and simplifying future changes to frequency handling.

49-49: Systematic replacement of string literals with enum constants

This comprehensive replacement of hardcoded frequency strings with centralized enum constants across all relevant functions demonstrates excellent attention to consistency and maintainability.

Also applies to: 53-53, 57-57, 66-70, 79-79, 97-97, 101-101, 105-105, 109-109, 117-122, 131-131

src/workflows/dags/airqo_mobile_measurements.py (1)

63-63: Enhanced calibration with country-based grouping

Adding the groupby="country" parameter to mobile measurements calibration ensures consistency with the approach used in other workflows and enables country-specific model selection.

src/workflows/airqo_etl_utils/utils.py (1)

326-328: Approved usage of CountryModels.

This check and distinct return logic nicely accommodate both enum-based and string-based calibration targets. The code is concise and clear.

src/workflows/airqo_etl_utils/airqo_utils.py (10)

574-574: Parameter extension for flexible calibration.

Introducing groupby enhances reusability, allowing calibrations by various location types. This aligns with the broader country-based model logic.

603-603: Potential KeyError for unexpected group fields.

Accessing sites[["site_id", groupby]] could fail if groupby is not set correctly. Consider validating or providing a fallback.

645-645: Consistent grouping by dynamic field.

Switching to dynamic grouping fosters scalable logic for different calibration levels (e.g., city, country). Looks good.

650-650: Robust default PM2.5 model fallback.

Relying on CountryModels.DEFAULT ensures continuity when a custom model is unavailable. This increases reliability.

657-657: Logical default PM10 model usage.

Same structured fallback for PM10 calibration prevents potential model-not-found errors.

661-661: Maintaining consistent model keys.

Gathering CountryModels values into a list enables uniform checking of available calibration keys. This is clean and future-proof.

663-663: Clear iteration over grouped DataFrame.

Iterating (country, group) from grouped_df is straightforward and maintainable. Appropriately splits calibration tasks by location.

671-671: Prioritizing custom PM2.5 model.

Fetching the PM2.5 model by explicit country name further personalizes the calibration flow.

678-678: Custom PM10 model consistency.

Same approach to PM10 ensures uniform calibration logic across pollutants.

893-893: Country-based grouping in action.

Passing "country" for the groupby parameter ensures the calibration logic aligns with the newly introduced approach.

src/workflows/airqo_etl_utils/ml_utils.py (6)

99-123: Improved docstring clarity in preprocess_data.

Defining the mandatory columns and explaining the daily vs. hourly logic is helpful. You might also clarify how to handle large missing data segments beyond simple interpolation.

162-190: Comprehensive lag & roll documentation.

The function thoroughly covers both daily and hourly scenarios, including rolling stats. This structured approach provides convenient feature engineering. Keep watching memory impact when dealing with large DataFrames.

241-294: Robust time-features extraction.

Allowing both daily and hourly branches with freq.str is tidy. The added error handling for invalid frequencies is good practice.

296-333: Cyclic features for daily/hourly data.

Capturing cyclical seasonality with sine/cosine transformations is a solid approach. Ensure these transformations align with your domain boundaries (e.g., year, month).

703-703: Model retrieval with frequency.str.

Using f"{frequency.str}_forecast_model.pkl" ensures consistent naming logic. Good synergy with the GCS reading pattern.

757-759: Frequency-based forecast collection.

The branching on frequency.str is straightforward, ensuring the correct MongoDB stores for either daily or hourly forecasts.

NicholasTurner23 added 6 commits February 25, 2025 11:54

Optimize and document machine learning utils

03f7d65

Use enums for frequency

e24ba68

Clean up model selection

a336c1f

Update model selection implementation to use country models

3ffa715

Clean up

da53303

Updates from airqstaging staging

71bfda4

coderabbitai bot reviewed Feb 25, 2025

View reviewed changes

Baalmart approved these changes Feb 25, 2025

View reviewed changes

Baalmart merged commit df9f410 into airqo-platform:staging Feb 25, 2025
46 checks passed

Baalmart mentioned this pull request Feb 25, 2025

move to production #4479

Merged

1 task

This was referenced Feb 25, 2025

Update/calculate hourly airqualitydata using bigqdata #4482

Merged

Update/calculate hourly airqualitydata using bigqdata #4488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update/calculate hourly airqualitydata using bigqdata #4478

Update/calculate hourly airqualitydata using bigqdata #4478

NicholasTurner23 commented Feb 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

Update/calculate hourly airqualitydata using bigqdata #4478

Update/calculate hourly airqualitydata using bigqdata #4478

Conversation

NicholasTurner23 commented Feb 25, 2025 • edited by coderabbitai bot Loading

Description

Related Issues

Summary by CodeRabbit

coderabbitai bot commented Feb 25, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

NicholasTurner23 commented Feb 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)