Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/model too many joins #413

Merged
merged 40 commits into from
Mar 28, 2024

Conversation

BradCr
Copy link
Contributor

@BradCr BradCr commented Jan 20, 2024

This is a:

  • bug fix PR with no breaking changes
  • new functionality

Link to Issue

Closes #394

Description & motivation

Per the dbt best practices docs, models should bring together a reasonable number (typically 4 to 6) of entities or concepts (staging models, or perhaps other intermediate models) that will be joined with another similarly purposed intermediate model to generate a mart. Having too many models in our mart increases the complexity. We can join two intermediate models that each house a piece of the complexity, giving us increased readability, flexibility, testing surface area, and insight into our components.

This new model, fct_number_of_joins, will identify models that join from seven (7) or more other models and should be refactored into intermediate models each taking some of the joins.

Integration Test Screenshot

The equality test for the new fct_number_of_joins model passed in the integration tests.

dbt test fct_number_of_joins

Screenshot of the fct_number_of_joins model being built:

dbt_fct_number_of_joins_build

I updated the documentation to explain the new model. For the screenshots I had two concerns; When I attached the screenshots they are linked to my repository (not sure if that gets automatically updated or if it's something that a dbt maintainer would have to deal with) and the screenshots are from the new Cloud Lineage diagram, so the look and feel is different from the existing screenshots.

Checklist

  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
    • Databricks
    • DuckDB
    • Trino/Starburst
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)

@BradCr
Copy link
Contributor Author

BradCr commented Jan 20, 2024

The integration tests look like they're failing because, now that I've introduced this new model, some test models, and the associated seed, it is failing on some other equality tests. I assume I need to update the contents of those seed files to include what I've added.

Is that the correct approach?

@b-per
Copy link
Collaborator

b-per commented Jan 22, 2024

Yes, this is a bit annoying, but we need to update a couple of seed files when adding new models in the integration_tests folder. So, please, feel free to update the seeds for the tests that started failing.

This doesn't happen every day so we kept it like this for now, but we know that it is not fun to do 😄

@graciegoheen
Copy link
Collaborator

Hey! How would you feel about naming the model fct_too_many_joins or something similar so that the model name describes the problem being flagged?

@BradCr
Copy link
Contributor Author

BradCr commented Jan 22, 2024

@b-per Thanks for clarifying, I'll align all the seeds so the equality tests no longer fail. I think I saw a couple discrepancies that weren't just about my new model (when I tested locally in VSCode) so I'll double-check for those as well.

@graciegoheen That's a good suggestion. Now that you mention it, I'm not sure why I didn't name the model that to begin with, seeing as I named both the branch and the PR "too many joins".

@BradCr
Copy link
Contributor Author

BradCr commented Jan 22, 2024

Renamed the model from fct_number_of_joins to fct_too_many_joins. Next I'll work on updating the seeds for the integration test.

@b-per
Copy link
Collaborator

b-per commented Jan 25, 2024

OK, so I spent a bit of time looking at the test results that need to be change but unfortunately, the new model under integration_tests requires changing the majority of the existing tests that we have.

For example, with the new model, there is now no violation of fct_rejoining_of_upstream_concepts anymore.

We should try to get your local version working but I am thinking that the test might need to be updated as well.

I am thinking that we might want to make the number of joins a variable, that defaults to 7, but that we could change to 3 in our integration_tests folder. Otherwise, it will impact way too many other models. (and I also think that people might want to configure themselves a number different from 7)

@b-per
Copy link
Collaborator

b-per commented Jan 25, 2024

Also, one of the seeds is the following:

resource_name,resource_type,file_path,join_count
fct_model_10,model,"models\\marts\\fct_model_10.sql",7

This is not going to work with CI as it runs on Linux and the file separator is the Windows one. We should just maybe remove checking that specific column.

This makes me think that some of the discrepancies you see when running locally might be due to you running on Windows when CI runs on Linux and we run those on MacOS at dbt Labs

…, gitignored integration_tests/package-lock.yml
@BradCr
Copy link
Contributor Author

BradCr commented Jan 26, 2024

Thank you for all the suggestions. Here is a list of what I've done and the info you asked for. I did the dbt run --full-refresh and then uploaded the duckdb database to my Google drive. I've messaged that link to you on Slack.

I changed the threshold for fct_too_many_joins to be a variable, defaulted to 7 in the main dbt_project.yml, and to 3 in the integration_tests.

I corrected the test_fct_too_many_joins.csv seed to use forward slashes (I had originally changed it to double-backslash because that's what my build put into the model in the duckdb.)

Contents of pip list, with dbt-code, dbt-duckdb, and duckdb at the top:
Package Version


dbt-core 1.7.5
dbt-duckdb 1.7.1
duckdb 0.9.2
agate 1.7.1
annotated-types 0.6.0
attrs 23.2.0
Babel 2.14.0
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
dbt-extractor 0.5.1
dbt-semantic-interfaces 0.4.3
idna 3.6
importlib-metadata 6.11.0
isodate 0.6.1
Jinja2 3.1.3
jsonschema 4.21.0
jsonschema-specifications 2023.12.1
leather 0.3.4
Logbook 1.5.3
MarkupSafe 2.1.3
mashumaro 3.11
minimal-snowplow-tracker 0.0.2
more-itertools 10.2.0
msgpack 1.0.7
networkx 3.2.1
packaging 23.2
parsedatetime 2.6
pathspec 0.11.2
pip 23.3.2
protobuf 4.25.2
pycparser 2.21
pydantic 2.5.3
pydantic_core 2.14.6
python-dateutil 2.8.2
python-slugify 8.0.1
pytimeparse 1.1.8
pytz 2023.3.post1
PyYAML 6.0.1
referencing 0.32.1
requests 2.31.0
rpds-py 0.17.1
setuptools 65.5.0
six 1.16.0
sqlparse 0.4.4
text-unidecode 1.3
typing_extensions 4.9.0
urllib3 1.26.18
zipp 3.17.0

Duckdb profiles.yml:

integration_tests:
  target: duckdb
  outputs:
    duckdb:
      type: duckdb
      path: '/Users/bcron/python/duckdb/dbt_integration_tests.duckdb'

@BradCr
Copy link
Contributor Author

BradCr commented Jan 27, 2024

The fact that I haven't gotten the integration tests to pass is bugging me, so I decided to try something. I created a new branch in my repository to try and get a good, clean baseline test before making any changes. I installed the dbt-duckdb package, ran dbt deps in the root directory and in integration_tests, then ran dbt build -f in root and in integration_tests.

Without any modifications to any models or seeds, I am getting errors on 10 tests. I was able to resolve or explain all but one. Is it possible that some integration_tests seeds in main aren't up to date and need corrections in order to work? It looks like there have been other recent PRs that passed circleci tests just fine, so I've got to assume the issue is on my end, but then I don't understand why it would throw me errors on a new clone.


These three have the single forward slash vs. double backward slash issue, but models and paths are the same, so I imagine they would pass circleci just fine if I leave the seeds alone:

  • equality_fct_source_directories
  • equality_fct_model_directories
  • equality_fct_test_directories

These tests had additional rows in the table that weren't in the seed. When I added those rows to the seeds, the tests pass just fine:

  • equality_fct_undocumented_models
  • equality_fct_chained_views_dependencies
  • equality_fct_model_fanout
  • equality_fct_root_models
  • equality_fct_missing_primary_key_tests

This covergae test had different values in the seed vs. the table build by the model. When I changed the values in the seed, the test passes just fine:

  • equality_fct_documentation_coverage

This covergae test had different values in the seed vs. the table built by the model. When I changed the values in the seed, there is still a weird error:

  • equality_fct_test_coverage

After updating the test_fct_test_coverage seed, the equality_fct_test_coverage still fails. Here's where it gets interesting. The output of equality_test_fct_test_coverage.sql looks like this:

which_diff total_models total_tests tested_models test_coverage_pct test_to_model_ratio staging_test_coverage_pct intermediate_test_coverage_pct marts_test_coverage_pct other_test_coverage_pct
a_minus_b 18 13 6 33.33000183105469 0.7222 80.0 25.0 0.0 25.0
b_minus_a 18 13 6 33.33 0.7222 80.0 25.0 0.0 25.0

Notice the test_coverage_pct from a_minus_b is showing in this query with 15 decimal places, rather than 2. But when I select * from each table test_fct_test_coverage and fct_test_coverage independently, both of them show 33.33 as the value. Not sure what's happening here, but perhaps the circleci test would pass?

@b-per b-per marked this pull request as ready for review February 6, 2024 10:46
@graciegoheen
Copy link
Collaborator

We saw the same rounding issue here #427 (comment)

@BradCr
Copy link
Contributor Author

BradCr commented Feb 15, 2024

We saw the same rounding issue here #427 (comment)

Good to know it wasn't just me.

@BradCr
Copy link
Contributor Author

BradCr commented Mar 3, 2024

I saw there were a couple of PRs merged and my branch was out of date. I synched my branch, and it now fails the CircleCI tests. It was passing previously.

@dave-connors-3 dave-connors-3 merged commit a46d0da into dbt-labs:main Mar 28, 2024
7 checks passed
@dave-connors-3
Copy link
Collaborator

@BradCr thanks so much for the contribution! sorry this took so long!

@BradCr
Copy link
Contributor Author

BradCr commented Mar 28, 2024

@dave-connors-3 No problem, Dave. This was such a fun experience. I'm looking forward to contributing again.

Thank you also @b-per and @graciegoheen for all the help and guidance!

@BradCr BradCr deleted the feature/model-too-many-joins branch March 28, 2024 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add modeling rule for joining too many models
4 participants