Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom SQL Model Materialization Schedule #3236

Open
ravenac95 opened this issue Mar 11, 2025 · 2 comments
Open

Custom SQL Model Materialization Schedule #3236

ravenac95 opened this issue Mar 11, 2025 · 2 comments

Comments

@ravenac95
Copy link
Member

Describe the feature you'd like to request

SQLMesh, while great, has a different view on models than we do. I think we may need to have our own custom scheduling such that models can be eventually consistent. This would allow us to spend less and also continuously provide mostly complete data.

This may also make using sqlmesh (or at least our favorite parts of sqlmesh) easier for more people and less the "all-or-nothing" system that it is right now. While simultaneously being more informative than the state of things.

Describe the solution you'd like

This is not a fully baked idea but here are some early thoughts.

Essentially this is handling scheduling of materializations in a very different way such that we support the models in the repo and models via some kind of other contribution system likely through pyoso and stored in an application database.

Each model should have some execution status:

  • CURRENTLY_MATERIALIZING - you're likely looking at some mix of old data + new data
  • MATERIALIZED (and some date) - materialized within the last 24 hours

Each model can also have a dependency status:

  • UP_TO_DATE - all dependencies are MATERIALIZED
  • PENDING - some dependencies are updating so this is likely out of date

Additionally there should be a maintenance status:

  • ACTIVE - this is actively maintained
  • ARCHIVED - you're looking at a model that's data is available but no longer should be relied on and may be deleted

At least internally we should be able to have a visibility status:

  • PUBLIC - Publicly visible
  • PRIVATE - Private only

Incremental models should all be forward only. Any breaking changes should result in a new version of that model being made should force a version bump (via ci in code, via automatic versioning for any pyoso contributions).

Backfills to incremental models can be done asynchronously.

If we treat all models like this then we maintain high availability with some things potentially changing but recalculations happening in more "eventual" way that allows for us to save money. It might end up being more complicated so I think we need to think through a few things here but I wanted to get some ideas written down.

Describe alternatives you've considered

We've currently only used sqlmesh + dbt

Copy link

linear bot commented Mar 11, 2025

@github-project-automation github-project-automation bot moved this to Backlog in OSO Mar 11, 2025
@ravenac95
Copy link
Member Author

Another thing we should be doing is allow some models to categorize their changes by changes in a given partition. So for example, let's think about the superchain dataset. Let's say we only want to process things that have changes for a given partition. Say a model is only interested in changes for BASE only. We shouldn't trigger recalculations unnecessarily for that model if only changes occur on optimism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

1 participant