Mixed per-user and team-shared pipeline runs #199

windiana42 · 2024-06-11T07:19:17Z

Sometimes, a pipeline uses very large inputs in the first stage which makes it run slowly and take up a lot of disk space. However, it would be nice if it is rather fast to try out new code on single tasks or stages. Pipedag already supports running just single tasks or stages. When running a single task, it is already possible that a user plays in the temporary schema avoiding to ever schema swap. However, sometimes it would be nice to also commit a stage "per-user" and then run tasks with input being a mixture of the per-user inputs and the team-shared inputs.

This issue is about implementing a mixed per-user/team-shared mode. In this case, inputs to running subgraphs would generally be fetched from the team-shared version if no such input exists in the per-user version. Temporary schemas and committed stage schemas should always reside per-user. So mostly dematerialization would have to be adapted.

Options:

An advanced version of this idea could even do cache-invalidation checks on the team-shared instance, however, with some protection mechanism that prevents overwriting data in the team-shared instance.
This issue could interact with Retry of producing a stage output #167 in a way that one could update information table by table in the per-user temp schema with multiple runs.
It is even thinkable to allow mixed execution on two arbitrary pipeline instance configurations. Dominant use will probably still be per-user / team-shared instances of the same instance_id.

windiana42 added enhancement New feature or request usability speed labels Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixed per-user and team-shared pipeline runs #199

Mixed per-user and team-shared pipeline runs #199

windiana42 commented Jun 11, 2024

Mixed per-user and team-shared pipeline runs #199

Mixed per-user and team-shared pipeline runs #199

Comments

windiana42 commented Jun 11, 2024