You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SQLMesh, while great, has a different view on models than we do. I think we may need to have our own custom scheduling such that models can be eventually consistent. This would allow us to spend less and also continuously provide mostly complete data.
This may also make using sqlmesh (or at least our favorite parts of sqlmesh) easier for more people and less the "all-or-nothing" system that it is right now. While simultaneously being more informative than the state of things.
Describe the solution you'd like
This is not a fully baked idea but here are some early thoughts.
Essentially this is handling scheduling of materializations in a very different way such that we support the models in the repo and models via some kind of other contribution system likely through pyoso and stored in an application database.
Each model should have some execution status:
CURRENTLY_MATERIALIZING - you're likely looking at some mix of old data + new data
MATERIALIZED (and some date) - materialized within the last 24 hours
Each model can also have a dependency status:
UP_TO_DATE - all dependencies are MATERIALIZED
PENDING - some dependencies are updating so this is likely out of date
Additionally there should be a maintenance status:
ACTIVE - this is actively maintained
ARCHIVED - you're looking at a model that's data is available but no longer should be relied on and may be deleted
At least internally we should be able to have a visibility status:
PUBLIC - Publicly visible
PRIVATE - Private only
Incremental models should all be forward only. Any breaking changes should result in a new version of that model being made should force a version bump (via ci in code, via automatic versioning for any pyoso contributions).
Backfills to incremental models can be done asynchronously.
If we treat all models like this then we maintain high availability with some things potentially changing but recalculations happening in more "eventual" way that allows for us to save money. It might end up being more complicated so I think we need to think through a few things here but I wanted to get some ideas written down.
Describe alternatives you've considered
We've currently only used sqlmesh + dbt
The text was updated successfully, but these errors were encountered:
Another thing we should be doing is allow some models to categorize their changes by changes in a given partition. So for example, let's think about the superchain dataset. Let's say we only want to process things that have changes for a given partition. Say a model is only interested in changes for BASE only. We shouldn't trigger recalculations unnecessarily for that model if only changes occur on optimism.
Describe the feature you'd like to request
SQLMesh, while great, has a different view on models than we do. I think we may need to have our own custom scheduling such that models can be eventually consistent. This would allow us to spend less and also continuously provide mostly complete data.
This may also make using sqlmesh (or at least our favorite parts of sqlmesh) easier for more people and less the "all-or-nothing" system that it is right now. While simultaneously being more informative than the state of things.
Describe the solution you'd like
This is not a fully baked idea but here are some early thoughts.
Essentially this is handling scheduling of materializations in a very different way such that we support the models in the repo and models via some kind of other contribution system likely through pyoso and stored in an application database.
Each model should have some execution status:
Each model can also have a dependency status:
MATERIALIZED
Additionally there should be a maintenance status:
At least internally we should be able to have a visibility status:
Incremental models should all be forward only. Any breaking changes should result in a new version of that model being made should force a version bump (via ci in code, via automatic versioning for any pyoso contributions).
Backfills to incremental models can be done asynchronously.
If we treat all models like this then we maintain high availability with some things potentially changing but recalculations happening in more "eventual" way that allows for us to save money. It might end up being more complicated so I think we need to think through a few things here but I wanted to get some ideas written down.
Describe alternatives you've considered
We've currently only used sqlmesh + dbt
The text was updated successfully, but these errors were encountered: