Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review dbt data validation failures on transform_warehouse #3684

Open
erikamov opened this issue Feb 10, 2025 · 8 comments
Open

Review dbt data validation failures on transform_warehouse #3684

erikamov opened this issue Feb 10, 2025 · 8 comments
Assignees

Comments

@erikamov
Copy link
Contributor

erikamov commented Feb 10, 2025

User story / feature request

Review dbt data validation failures on transform_warehouse:

  • accepted_values_int_payments__authorisations_deduped_request_type__AUTHORISATION__DEBT_RECOVERY_AUTHCHECK__DEBT_RECOVERY_REVERSAL__CARD_CHECK

    • models/intermediate/payments/_int_payments.yml
  • dbt_utils_accepted_range_fct_daily_schedule_feed_validation_notices_date__DATE_2024_03_26___DATE_2024_01_20_

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • dbt_utils_accepted_range_int_payments__settlements_to_aggregations_net_settled_amount_dollars__True__0

    • models/intermediate/payments/_int_payments.yml
  • dbt_utils_expression_is_true_int_gtfs_schedule__stop_times_grouped_num_approximate_timepoint_stop_times_num_exact_timepoint_stop_times_num_stop_times

    • models/* intermediate/gtfs/_int_gtfs.yaml
  • dbt_utils_expression_is_true_int_gtfs_schedule__stop_times_grouped__num_gtfs_flex_stop_times_0_is_gtfs_flex_trip

    • models/intermediate/gtfs/_int_gtfs.yaml
  • dbt_utils_mutually_exclusive_ranges_dim_contract_attachments_required___valid_from__source_record_id___valid_to

    • models/mart/* transit_database/_mart_transit_database.yml
  • dbt_utils_expression_is_true_int_payments__micropayments_adjustments_refunds_joined_micropayment_refund_amount_charge_amount

    • models/intermediate/payments/_int_payments.yml
  • dbt_utils_unique_combination_of_columns_base_tts_services_ct_services_map_ct_key__ct_date

    • models/staging/transit_database/base/_base_transit_database.yml
  • dbt_utils_unique_combination_of_columns_base_tts_services_ct_services_map_tts_key__tts_date

    • models/staging/transit_database/base/_base_transit_database.yml
  • not_null_dim_calendar_service_id

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_schedule_feeds_feed_timezone

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_transfers_from_stop_id

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_transfers_to_stop_id

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_trips_route_id

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_trips_trip_id

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_fct_daily_organization_combined_guideline_checks_status

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • not_null_fct_daily_schedule_feed_validation_notices_code

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • not_null_fct_daily_schedule_feed_validation_notices_validation_validator_version

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • not_null_fct_daily_schedule_feed_validation_notices_severity

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • not_null_fct_daily_service_combined_guideline_checks_status

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • not_null_fct_vehicle_locations_grouped_trip_instance_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • not_null_fct_vehicle_locations_trip_instance_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • not_null_int_gtfs_quality__guideline_checks_long_status

    • models/intermediate/gtfs_quality/_int_gtfs_quality.yml
  • relationships_fct_daily_rt_feed_files_schedule_to_use_for_rt_validation_gtfs_dataset_key__key__ref_dim_gtfs_datasets_

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • unique_dim_calendar__gtfs_key

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • unique_dim_contract_attachments_key

    • models/mart/transit_database/_mart_transit_database.yml
  • unique_dim_fare_media__gtfs_key

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • unique_fct_elavon__transactions_trn_ref_num

    • models/mart/payments/_payments.yml
  • unique_fct_payments_deposit_transactions_trn_ref_num

    • models/mart/payments/_payments.yml
  • unique_fct_payments_aggregations_aggregation_id

    • models/mart/payments/_payments.yml
  • unique_int_elavon__deposit_transactions_trn_ref_num

    • models/staging/payments/elavon/_elavon.yml
  • unique_int_gtfs_quality__guideline_checks_long_key

    • models/intermediate/gtfs_quality/_int_gtfs_quality.yml
  • unique_int_gtfs_quality__rt_validation_outcomes_key

    • models/intermediate/gtfs_quality/_int_gtfs_quality.yml
  • unique_int_payments__settlements_to_aggregations_aggregation_id

    • models/intermediate/payments/_int_payments.yml
  • unique_int_payments__settlements_to_aggregations_retrieval_reference_number

    • models/intermediate/payments/_int_payments.yml
  • unique_miles_traveled_location_name

    • seeds/_seeds.yml
  • unique_miles_traveled_off_location_name

    • seeds/_seeds.yml
  • unique_int_payments__micropayments_adjustments_refunds_joined_micropayment_id

    • models/intermediate/payments/_int_payments.yml
  • unique_proportion_fct_vehicle_locations_grouped_0_9999__trip_instance_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • unique_proportion_fct_vehicle_locations_0_9999__trip_instance_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • unique_fct_service_alerts_messages_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • validate_guidelines_checks_rows_per_check_matches_index

    • tests/validate_guidelines_checks_rows_per_check_matches_index.sql
  • dbt_utils_unique_combination_of_columns_int_transit_database__service_components_unnested_service_key__product_key__component_key__dt

    • models/intermediate/transit_database/_int_transit_database.yml
  • not_null_dim_calendar_end_date

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_friday

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_saturday

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_start_date

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_sunday

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_wednesday

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_thursday

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_fare_attributes_transfers

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_feed_info_feed_publisher_name

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_shapes_shape_pt_sequence

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_shapes_shape_pt_lon

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_stop_times_stop_sequence

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_transfers_transfer_type

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_fct_daily_schedule_feeds_gtfs_dataset_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • not_null_fct_schedule_feed_downloads_gtfs_dataset_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml

Acceptance Criteria

Tests fixed or new tickets created.

Notes

@erikamov erikamov self-assigned this Feb 10, 2025
@erikamov erikamov added the bug Something isn't working label Feb 10, 2025
@erikamov erikamov changed the title Fix DBT test failures on transform_warehouse Fix dbt data validation failures on transform_warehouse Feb 10, 2025
@erikamov erikamov changed the title Fix dbt data validation failures on transform_warehouse Review dbt data validation failures on transform_warehouse Feb 10, 2025
@erikamov
Copy link
Contributor Author

erikamov commented Feb 10, 2025

I will be adding comments with more details about each failure.

  • accepted_values_int_payments__authorisations_deduped_request_type__AUTHORISATION__DEBT_RECOVERY_AUTHCHECK__DEBT_RECOVERY_REVERSAL__CARD_CHECK

    Table: staging.int_payments__authorisations_deduped
    Column: request_type
    Accepted values: 'AUTHORISATION', 'DEBT_RECOVERY_AUTHCHECK', 'DEBT_RECOVERY_REVERSAL', 'CARD_CHECK'

    Failing because of 2 records with request_type = 'Reversal' from MST on 2024-12-05 and ATN on 2024-12-16

 select *
   from staging.int_payments__authorisations_deduped
 where request_type not in ('AUTHORISATION', 'DEBT_RECOVERY_AUTHCHECK', 'DEBT_RECOVERY_REVERSAL', 'CARD_CHECK')

@erikamov erikamov removed the bug Something isn't working label Feb 11, 2025
@erikamov
Copy link
Contributor Author

  • dbt_utils_accepted_range_int_payments__settlements_to_aggregations_net_settled_amount_dollars__True__0
    Table: staging.int_payments__settlements_to_aggregations
    Column: net_settled_amount_dollars
    Accepted value >= 0

    Failing because of 2 records have negative values: ccjpa -4 on 2024-08-21 and -6 on 2024-12-17

select *
from staging.int_payments__settlements_to_aggregations
where net_settled_amount_dollars < 0
  • unique_int_payments__settlements_to_aggregations_aggregation_id
    Table: staging.int_payments__settlements_to_aggregations
    Column: aggregation_id
    Unique values

    Failing because of 2 duplicated aggregation_id from MST between 2024-11-30 and 2024-12-03

select *
from staging.int_payments__settlements_to_aggregations
where aggregation_id in (select aggregation_id
                         from staging.int_payments__settlements_to_aggregations
                         group by aggregation_id
                         having count(*) > 1)
  • unique_int_payments__settlements_to_aggregations_retrieval_reference_number
    Table: staging.int_payments__settlements_to_aggregations
    Column: retrieval_reference_number
    Unique values

    Failing because of 8 duplicated retrieval_reference_number from MST and ATN

select *
from staging.int_payments__settlements_to_aggregations
where retrieval_reference_number in (select retrieval_reference_number
                         from staging.int_payments__settlements_to_aggregations
                         group by retrieval_reference_number
                         having count(*) > 1)

@erikamov
Copy link
Contributor Author

erikamov commented Feb 11, 2025

  • unique_int_payments__micropayments_adjustments_refunds_joined_micropayment_id
    Table: staging.int_payments__micropayments_adjustments_refunds_joined
    Column: micropayment_id
    Unique values

    Failing because of 2 duplicated micropayment_id from CCJPA 2024-11-19 and MST 2024-08-27

select *
from staging.int_payments__micropayments_adjustments_refunds_joined
where micropayment_id in (select micropayment_id
                            from staging.int_payments__micropayments_adjustments_refunds_joined
                           group by micropayment_id
                          having count(*) > 1)
  • dbt_utils_expression_is_true_int_payments__micropayments_adjustments_refunds_joined_micropayment_refund_amount_charge_amount
    Table: staging.int_payments__micropayments_adjustments_refunds_joined
    Columns: micropayment_refund_amount and charge_amount
    micropayment_refund_amount <= charge_amount

    Failing because of 4 records where micropayment_refund_amount > charge_amount
    CCJPA on 2024-08-09, 24 > 20
    ATN on 2024-08-27, 22.5 > 7.5
    SBMTD on 2024-02-23, 1.7 > 0.85
    MST on 2025-01-14, 4 > 2

select *
from staging.int_payments__micropayments_adjustments_refunds_joined
where micropayment_refund_amount > charge_amount

@erikamov
Copy link
Contributor Author

  • dbt_utils_accepted_range_fct_daily_schedule_feed_validation_notices_date__DATE_2024_03_26___DATE_2024_01_20_

    Table: mart_gtfs_quality.fct_daily_schedule_feed_validation_notices
    Column: date

    • accepted_range:
      min_value: "DATE'2021-04-16'"
      max_value: "DATE'2022-09-14'"
      where: validation_validator_version = 'v2.0.0'
    • accepted_range:
      min_value: "DATE'2022-09-15'"
      max_value: "DATE'2022-11-15'"
      where: validation_validator_version = 'v3.1.1'
    • accepted_range:
      min_value: "DATE'2022-11-16'"
      max_value: "DATE'2023-08-31'"
      where: validation_validator_version = 'v4.0.0'
    • accepted_range:
      min_value: "DATE'2023-09-01'"
      max_value: "DATE'2024-01-19'"
      where: validation_validator_version = 'v4.1.0'
    • accepted_range:
      min_value: "DATE'2024-01-20'"
      max_value: "DATE'2024-03-26'"
      where: validation_validator_version = 'v4.2.0'
    • accepted_range:
      min_value: "DATE'2024-03-27'"
      where: validation_validator_version = 'v5.0.0'

    Failing because there are 29,140 records with validation_validator_version = 'v4.2.0' and date between 2024-03-27 and 2024-07-29 (outcome_extract_dt between 2024-01-20 and 2024-02-23)

select date, outcome_extract_dt, validation_validator_version, code, count(*) as records_count
  from mart_gtfs_quality.fct_daily_schedule_feed_validation_notices
 where date between '2024-01-20' and '2024-03-27'
   and validation_validator_version != 'v4.2.0'
 group by date, outcome_extract_dt, validation_validator_version, code
  • not_null_fct_daily_schedule_feed_validation_notices_code / not_null_fct_daily_schedule_feed_validation_notices_validation_validator_version / not_null_fct_daily_schedule_feed_validation_notices_severity

    Table: mart_gtfs_quality.fct_daily_schedule_feed_validation_notices
    Column: code, validation_validator_version, and severity

    Failing because of 23 records without code, validation_validator_version, and severity, dates 2025-01-26 and 2025-02-04

select *
from mart_gtfs_quality.fct_daily_schedule_feed_validation_notices
where code is null

@erikamov
Copy link
Contributor Author

  • dbt_utils_expression_is_true_int_gtfs_schedule__stop_times_grouped_num_approximate_timepoint_stop_times_num_exact_timepoint_stop_times_num_stop_times

    Table: staging.int_gtfs_schedule__stop_times_grouped
    expression: num_approximate_timepoint_stop_times + num_exact_timepoint_stop_times = num_stop_times
    Failing because of 1 record where num_approximate_timepoint_stop_times + num_exact_timepoint_stop_times is different than num_stop_times

select *
from staging.int_gtfs_schedule__stop_times_grouped
where num_approximate_timepoint_stop_times + num_exact_timepoint_stop_times != num_stop_times
  • dbt_utils_expression_is_true_int_gtfs_schedule__stop_times_grouped__num_gtfs_flex_stop_times_0_is_gtfs_flex_trip

    Table: staging.int_gtfs_schedule__stop_times_grouped
    expression: (num_gtfs_flex_stop_times > 0) = is_gtfs_flex_trip
    Failing because of 82 records where num_gtfs_flex_stop_times > 0 does not match is_gtfs_flex_trip

select *
from staging.int_gtfs_schedule__stop_times_grouped
where (num_gtfs_flex_stop_times > 0) != is_gtfs_flex_trip

@erikamov
Copy link
Contributor Author

erikamov commented Feb 11, 2025

  • dbt_utils_mutually_exclusive_ranges_dim_contract_attachments_required___valid_from__source_record_id___valid_to / unique_dim_contract_attachments_key

    Table: mart_transit_database.dim_contract_attachments
    mutually_exclusive_ranges:
    lower_bound_column: _valid_from
    upper_bound_column: _valid_to
    partition_by: source_record_id
    gaps: required

    unique key: 'unnested_attachments.id' and '_valid_from'

    Failing because of 2 records are overlapping the range

with window_functions as (
    select
        
        source_record_id as partition_by_col,
        
        _valid_from as lower_bound,
        _valid_to as upper_bound,

        lead(_valid_from) over (
            partition by source_record_id
            order by _valid_from, _valid_to
        ) as next_lower_bound,

        row_number() over (
            partition by source_record_id
            order by _valid_from desc, _valid_to desc
        ) = 1 as is_last_record

    from  mart_transit_database.dim_contract_attachments
),

calc as (
    -- We want to return records where one of our assumptions fails, so we'll use
    -- the `not` function with `and` statements so we can write our assumptions more cleanly
    select
        *,

        -- For each record: lower_bound should be < upper_bound.
        -- Coalesce it to return an error on the null case (implicit assumption
        -- these columns are not_null)
        coalesce(
            lower_bound < upper_bound,
            false
        ) as lower_bound_less_than_upper_bound,

        -- For each record: upper_bound < the next lower_bound.
        -- Coalesce it to handle null cases for the last record.
        coalesce(
            upper_bound < next_lower_bound,
            is_last_record,
            false
        ) as upper_bound_less_than_next_lower_bound

    from window_functions

),

validation_errors as (

    select
        *
    from calc

    where not(
        -- THE FOLLOWING SHOULD BE TRUE --
        lower_bound_less_than_upper_bound
        and upper_bound_less_than_next_lower_bound
    )
)

select * from validation_errors

@erikamov
Copy link
Contributor Author

  • dbt_utils_unique_combination_of_columns_base_tts_services_ct_services_map_ct_key__ct_date / dbt_utils_unique_combination_of_columns_base_tts_services_ct_services_map_tts_key__tts_date

    Table: staging.base_tts_services_ct_services_map

    unique_combination_of_columns: tts_key and tts_date
    Failing because of 78 records with tts_key and tts_date duplicated 2022-08-17 - 2023-03-29)

select tts_key, tts_date, count(*)
from staging.base_tts_services_ct_services_map
group by tts_key, tts_date
having count(*) > 1

unique_combination_of_columns: ct_key and ct_date
Failing because of 78 records with ct_key and ct_date duplicated 2022-08-17 - 2023-03-29)

select ct_key, ct_date, count(*)
from staging.base_tts_services_ct_services_map
group by ct_key, ct_date
having count(*) > 1

For a unique combination of all four columns tts_key, tts_date, ct_key, and ct_date, it would return 8 duplicated records

@erikamov
Copy link
Contributor Author

  • unique_dim_calendar__gtfs_key

    Table: mart_gtfs.dim_calendar
    Column: _gtfs_key
    unique 'feed_key' and 'service_id'
    where: "feed_key != '6368fe701bdd68c4f521751a9a222a10'" (filtered out because of some issue with Auburn data)

    Failing because of 3 duplicated _gtfs_key

select _gtfs_key, feed_key, service_id, count(1)
from mart_gtfs.dim_calendar
where feed_key != '6368fe701bdd68c4f521751a9a222a10'
group by _gtfs_key, feed_key, service_id
having count(*) > 1
  • not_null_dim_calendar_service_id / not_null_dim_calendar_end_date / not_null_dim_calendar_friday / not_null_dim_calendar_saturday / not_null_dim_calendar_start_date / not_null_dim_calendar_sunday / not_null_dim_calendar_wednesday / not_null_dim_calendar_thursday

    Table: mart_gtfs.dim_calendar
    Columns cannot have null values: service_id, start_date, end_date, wednesday, thursday, friday, saturday, sunday
    where: "feed_key != '6368fe701bdd68c4f521751a9a222a10'" (filtered out because of some issue with Auburn data)

    Failing because of 1 record without values on service_id, start_date, end_date, wednesday, thursday, friday, saturday, and sunday columns

select *
from mart_gtfs.dim_calendar
where feed_key != '6368fe701bdd68c4f521751a9a222a10'
and (service_id = '' 
     or start_date is null
     or end_date is null
     or wednesday is null
     or thursday is null
     or friday is null
     or saturday is null
     or sunday is null)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant