Feature/model too many joins #413

BradCr · 2024-01-20T04:45:25Z

This is a:

bug fix PR with no breaking changes
new functionality

Link to Issue

Closes #394

Description & motivation

Per the dbt best practices docs, models should bring together a reasonable number (typically 4 to 6) of entities or concepts (staging models, or perhaps other intermediate models) that will be joined with another similarly purposed intermediate model to generate a mart. Having too many models in our mart increases the complexity. We can join two intermediate models that each house a piece of the complexity, giving us increased readability, flexibility, testing surface area, and insight into our components.

This new model, fct_number_of_joins, will identify models that join from seven (7) or more other models and should be refactored into intermediate models each taking some of the joins.

Integration Test Screenshot

The equality test for the new fct_number_of_joins model passed in the integration tests.

Screenshot of the fct_number_of_joins model being built:

I updated the documentation to explain the new model. For the screenshots I had two concerns; When I attached the screenshots they are linked to my repository (not sure if that gets automatically updated or if it's something that a dbt maintainer would have to deal with) and the screenshots are from the new Cloud Lineage diagram, so the look and feel is different from the existing screenshots.

Checklist

…se give error

… identify it

…dCr/dbt-project-evaluator into feature/model-too-many-joins

BradCr · 2024-01-20T18:46:34Z

The integration tests look like they're failing because, now that I've introduced this new model, some test models, and the associated seed, it is failing on some other equality tests. I assume I need to update the contents of those seed files to include what I've added.

Is that the correct approach?

b-per · 2024-01-22T11:49:57Z

Yes, this is a bit annoying, but we need to update a couple of seed files when adding new models in the integration_tests folder. So, please, feel free to update the seeds for the tests that started failing.

This doesn't happen every day so we kept it like this for now, but we know that it is not fun to do 😄

graciegoheen · 2024-01-22T15:39:37Z

Hey! How would you feel about naming the model fct_too_many_joins or something similar so that the model name describes the problem being flagged?

BradCr · 2024-01-22T16:53:11Z

@b-per Thanks for clarifying, I'll align all the seeds so the equality tests no longer fail. I think I saw a couple discrepancies that weren't just about my new model (when I tested locally in VSCode) so I'll double-check for those as well.

@graciegoheen That's a good suggestion. Now that you mention it, I'm not sure why I didn't name the model that to begin with, seeing as I named both the branch and the PR "too many joins".

BradCr · 2024-01-22T23:28:11Z

Renamed the model from fct_number_of_joins to fct_too_many_joins. Next I'll work on updating the seeds for the integration test.

b-per · 2024-01-25T12:40:45Z

OK, so I spent a bit of time looking at the test results that need to be change but unfortunately, the new model under integration_tests requires changing the majority of the existing tests that we have.

For example, with the new model, there is now no violation of fct_rejoining_of_upstream_concepts anymore.

We should try to get your local version working but I am thinking that the test might need to be updated as well.

I am thinking that we might want to make the number of joins a variable, that defaults to 7, but that we could change to 3 in our integration_tests folder. Otherwise, it will impact way too many other models. (and I also think that people might want to configure themselves a number different from 7)

b-per · 2024-01-25T12:42:52Z

Also, one of the seeds is the following:

resource_name,resource_type,file_path,join_count
fct_model_10,model,"models\\marts\\fct_model_10.sql",7

This is not going to work with CI as it runs on Linux and the file separator is the Windows one. We should just maybe remove checking that specific column.

This makes me think that some of the discrepancies you see when running locally might be due to you running on Windows when CI runs on Linux and we run those on MacOS at dbt Labs

…, gitignored integration_tests/package-lock.yml

BradCr · 2024-01-26T03:19:51Z

Thank you for all the suggestions. Here is a list of what I've done and the info you asked for. I did the dbt run --full-refresh and then uploaded the duckdb database to my Google drive. I've messaged that link to you on Slack.

I changed the threshold for fct_too_many_joins to be a variable, defaulted to 7 in the main dbt_project.yml, and to 3 in the integration_tests.

I corrected the test_fct_too_many_joins.csv seed to use forward slashes (I had originally changed it to double-backslash because that's what my build put into the model in the duckdb.)

Contents of pip list, with dbt-code, dbt-duckdb, and duckdb at the top:
Package Version

dbt-core 1.7.5
dbt-duckdb 1.7.1
duckdb 0.9.2
agate 1.7.1
annotated-types 0.6.0
attrs 23.2.0
Babel 2.14.0
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
dbt-extractor 0.5.1
dbt-semantic-interfaces 0.4.3
idna 3.6
importlib-metadata 6.11.0
isodate 0.6.1
Jinja2 3.1.3
jsonschema 4.21.0
jsonschema-specifications 2023.12.1
leather 0.3.4
Logbook 1.5.3
MarkupSafe 2.1.3
mashumaro 3.11
minimal-snowplow-tracker 0.0.2
more-itertools 10.2.0
msgpack 1.0.7
networkx 3.2.1
packaging 23.2
parsedatetime 2.6
pathspec 0.11.2
pip 23.3.2
protobuf 4.25.2
pycparser 2.21
pydantic 2.5.3
pydantic_core 2.14.6
python-dateutil 2.8.2
python-slugify 8.0.1
pytimeparse 1.1.8
pytz 2023.3.post1
PyYAML 6.0.1
referencing 0.32.1
requests 2.31.0
rpds-py 0.17.1
setuptools 65.5.0
six 1.16.0
sqlparse 0.4.4
text-unidecode 1.3
typing_extensions 4.9.0
urllib3 1.26.18
zipp 3.17.0

Duckdb profiles.yml:

integration_tests:
  target: duckdb
  outputs:
    duckdb:
      type: duckdb
      path: '/Users/bcron/python/duckdb/dbt_integration_tests.duckdb'

BradCr · 2024-01-27T22:32:23Z

The fact that I haven't gotten the integration tests to pass is bugging me, so I decided to try something. I created a new branch in my repository to try and get a good, clean baseline test before making any changes. I installed the dbt-duckdb package, ran dbt deps in the root directory and in integration_tests, then ran dbt build -f in root and in integration_tests.

Without any modifications to any models or seeds, I am getting errors on 10 tests. I was able to resolve or explain all but one. Is it possible that some integration_tests seeds in main aren't up to date and need corrections in order to work? It looks like there have been other recent PRs that passed circleci tests just fine, so I've got to assume the issue is on my end, but then I don't understand why it would throw me errors on a new clone.

These three have the single forward slash vs. double backward slash issue, but models and paths are the same, so I imagine they would pass circleci just fine if I leave the seeds alone:

equality_fct_source_directories
equality_fct_model_directories
equality_fct_test_directories

These tests had additional rows in the table that weren't in the seed. When I added those rows to the seeds, the tests pass just fine:

equality_fct_undocumented_models
equality_fct_chained_views_dependencies
equality_fct_model_fanout
equality_fct_root_models
equality_fct_missing_primary_key_tests

This covergae test had different values in the seed vs. the table build by the model. When I changed the values in the seed, the test passes just fine:

equality_fct_documentation_coverage

This covergae test had different values in the seed vs. the table built by the model. When I changed the values in the seed, there is still a weird error:

equality_fct_test_coverage

After updating the test_fct_test_coverage seed, the equality_fct_test_coverage still fails. Here's where it gets interesting. The output of equality_test_fct_test_coverage.sql looks like this:

which_diff	total_models	total_tests	tested_models	test_coverage_pct	test_to_model_ratio	staging_test_coverage_pct	intermediate_test_coverage_pct	marts_test_coverage_pct	other_test_coverage_pct
a_minus_b	18	13	6	33.33000183105469	0.7222	80.0	25.0	0.0	25.0
b_minus_a	18	13	6	33.33	0.7222	80.0	25.0	0.0	25.0

Notice the test_coverage_pct from a_minus_b is showing in this query with 15 decimal places, rather than 2. But when I select * from each table test_fct_test_coverage and fct_test_coverage independently, both of them show 33.33 as the value. Not sure what's happening here, but perhaps the circleci test would pass?

… feature/model-too-many-joins

graciegoheen · 2024-02-15T16:44:51Z

We saw the same rounding issue here #427 (comment)

BradCr · 2024-02-15T16:50:02Z

We saw the same rounding issue here #427 (comment)

Good to know it wasn't just me.

BradCr · 2024-03-03T17:32:17Z

I saw there were a couple of PRs merged and my branch was out of date. I synched my branch, and it now fails the CircleCI tests. It was passing previously.

models/marts/dag/fct_too_many_joins.sql

integration_tests/seeds/dag/test_fct_too_many_joins.csv

docs/rules/modeling.md

integration_tests/seeds/dag/dag_seeds.yml

dave-connors-3 · 2024-03-28T16:14:59Z

@BradCr thanks so much for the contribution! sorry this took so long!

BradCr · 2024-03-28T16:24:36Z

@dave-connors-3 No problem, Dave. This was such a fun experience. I'm looking forward to contributing again.

Thank you also @b-per and @graciegoheen for all the help and guidance!

BradCr and others added 20 commits January 4, 2024 13:27

Created fct_number_of_joins.sql model

2df96bd

Added fct_number_of_joins and a test to make sure it's empty, otherwi…

7c09502

…se give error

Created fct_model_10.sql, which should trigger fct_number_of_joins to…

5f5319c

… identify it

Created test_fct_number_of_joins.csv seed.

29075c4

Added fct_number_of_joins to the Rules Documentation.

e13af09

Added fct_number_of_joins to rules.md

cb16e9a

Update modeling.md with screenshots of models

bc233e1

Edited dbt_project.yml

40c46bb

Edited dbt_project.yml

e03d02d

Reset gitignore

50e7527

Update modeling.md (grammar)

53f0afe

Corrected linting on fct_number_of_joins.sql

1ec51be

Update dbt_project.yml (remove extra lines)

1d4b011

Update fct_model_10 corrected model reference name

157a512

Added equality test for number_of_joins model and seed

1e09637

Fixed fct_model_10 joins

77d7530

Fixed fct_model_10

e257d25

Merge branch 'feature/model-too-many-joins' of https://github.com/Bra…

69fda09

…dCr/dbt-project-evaluator into feature/model-too-many-joins

Corrected test_fct_number_of_joins seed

6f62243

Update rules.md corrected number of joins

e9dbf7b

BradCr mentioned this pull request Jan 20, 2024

Add modeling rule for joining too many models #394

Closed

BradCr and others added 2 commits January 22, 2024 15:19

Merge branch 'main' into feature/model-too-many-joins

d9df206

Rename new model from 'fct_number_of_joins' to 'fct_too_many_joins'

cefa619

BradCr added 2 commits January 22, 2024 18:59

Update seed files

7e57214

Adjust test seed files

1bcd884

Made changes to seeds, converted too_many_joins threshold to variable…

e882b3e

…, gitignored integration_tests/package-lock.yml

graciegoheen assigned b-per Feb 1, 2024

b-per added 5 commits February 6, 2024 10:56

Update seeds and tests for too many joins

0c71be4

Update seed to prevent error in BQ

4afee87

Update docs with new variable

127b34c

Merge branch 'main' of github.com:dbt-labs/dbt-project-evaluator into…

ba55085

… feature/model-too-many-joins

Uncomment exception filtering

83245be

b-per marked this pull request as ready for review February 6, 2024 10:46

b-per requested review from graciegoheen and dave-connors-3 as code owners February 6, 2024 10:46

Merge branch 'main' into feature/model-too-many-joins

86c9aa7

graciegoheen and others added 2 commits March 14, 2024 10:01

Merge branch 'main' into feature/model-too-many-joins

7392689

Merge branch 'main' into feature/model-too-many-joins

dd298d3

graciegoheen reviewed Mar 28, 2024

View reviewed changes

Apply suggestions from code review

b51f9bb

graciegoheen reviewed Mar 28, 2024

View reviewed changes

integration_tests/seeds/dag/dag_seeds.yml Outdated Show resolved Hide resolved

Update integration_tests/seeds/dag/dag_seeds.yml

c88170c

graciegoheen approved these changes Mar 28, 2024

View reviewed changes

dave-connors-3 approved these changes Mar 28, 2024

View reviewed changes

dave-connors-3 merged commit a46d0da into dbt-labs:main Mar 28, 2024

BradCr deleted the feature/model-too-many-joins branch March 28, 2024 16:25

Feature/model too many joins #413

Feature/model too many joins #413

Uh oh!

Conversation

BradCr commented Jan 20, 2024

Link to Issue

Description & motivation

Integration Test Screenshot

Checklist

Uh oh!

BradCr commented Jan 20, 2024

Uh oh!

b-per commented Jan 22, 2024

Uh oh!

graciegoheen commented Jan 22, 2024

Uh oh!

BradCr commented Jan 22, 2024

Uh oh!

BradCr commented Jan 22, 2024

Uh oh!

b-per commented Jan 25, 2024

Uh oh!

b-per commented Jan 25, 2024

Uh oh!

BradCr commented Jan 26, 2024

Uh oh!

BradCr commented Jan 27, 2024

Uh oh!

graciegoheen commented Feb 15, 2024

Uh oh!

BradCr commented Feb 15, 2024

Uh oh!

BradCr commented Mar 3, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dave-connors-3 commented Mar 28, 2024

Uh oh!

BradCr commented Mar 28, 2024

Uh oh!

Uh oh!