Add static and runtime dag info, API to fetch ancestor and successor tasks #2124

talsperre · 2024-10-31T09:07:57Z

Add runtime DAG info so that we can query the ancestor and successor tasks for a given task easily.

Usage

from metaflow import Task, namespace
namespace(None)
task = Task('RuntimeDAGFlow/18/step_c/32076012', attempt=0)

To get ancestors, progenies, and siblings, use the following API:

ancestors = task.ancestors
successors = task.successors

The output would be a list of metaflow Task objects.

saikonen · 2024-11-11T09:45:07Z

metaflow/metadata/metadata.py

+    @classmethod
+    def _filter_tasks_by_metadata(
+        cls, flow_id, run_id, query_step, field_name, field_value
+    ):
+        raise NotImplementedError()
+
+    @classmethod
+    def filter_tasks_by_metadata(
+        cls, flow_id, run_id, query_step, field_name, field_value
+    ):
+        # TODO: Do we need to do anything wrt to task attempt?
+        task_ids = cls._filter_tasks_by_metadata(
+            flow_id, run_id, query_step, field_name, field_value
+        )
+        return task_ids
+


is there a need for the private method, or could this simply be contained in the public-facing one? right now its not doing anything before calling the private one.

also, did you have an implementation of this for service.py yet?

saikonen · 2024-11-11T09:55:09Z

metaflow/metadata/metadata.py

+    def filter_tasks_by_metadata(
+        cls, flow_id, run_id, query_step, field_name, field_value
+    ):
+        # TODO: Do we need to do anything wrt to task attempt?


probably not, as the ancestors for task attempts should be identical, right? What about the immediate_siblings though, will they include or exclude attempts of the same task?

romain-intel

A few comments. I think it's pretty close though. I haven't looked at hte metadata service changes. We may also want to raise a better error message if the service is not new enough?

romain-intel · 2025-01-13T19:59:19Z

metaflow/client/core.py

+        run_id: str,
+        cur_foreach_stack_len: int,
+        steps: List[str],
+        query_type: str,


nit: I would just use a boolean. Something like is_ancestor. These are internal functions anyways and slightly more efficient to use bools :)

metaflow/client/core.py

romain-intel · 2025-01-13T20:02:59Z

metaflow/client/core.py

+        if query_foreach_stack_len == cur_foreach_stack_len:
+            # The successor or ancestor tasks belong to the same foreach stack level
+            field_name = "foreach-indices"
+            field_value = self.metadata_dict.get(field_name)


We don't currently cache metadata_dict so either we could fix that or cache it here to avoid making multiple calls to the metadata service and then sorts. It would need to be cached across _get_related_tasks and this function.

romain-intel · 2025-01-13T20:06:36Z

metaflow/client/core.py

+                # Current Task: foreach-indices = [0, 1, 2], foreach-indices-truncated = [0, 1]
+                # Ancestor Task: foreach-indices = [0, 1], foreach-indices-truncated = [0]
+                # We will compare the foreach-indices value of ancestor task with the
+                # foreach-indices value of current task


nit: foreach-indices-truncated value of the current task

romain-intel · 2025-01-13T20:08:27Z

metaflow/client/core.py

+        return field_name, field_value
+
+    def _get_related_tasks(self, relation_type: str) -> Dict[str, List[str]]:
+        start_time = time.time()


not used -- can strip or use.

romain-intel · 2025-01-13T20:14:31Z

metaflow/metaflow_config.py

@@ -248,8 +248,7 @@
 # Default container registry
 DEFAULT_CONTAINER_REGISTRY = from_conf("DEFAULT_CONTAINER_REGISTRY")
 # Controls whether to include foreach stack information in metadata.
-# TODO(Darin, 05/01/24): Remove this flag once we are confident with this feature.
-INCLUDE_FOREACH_STACK = from_conf("INCLUDE_FOREACH_STACK", False)
+INCLUDE_FOREACH_STACK = from_conf("INCLUDE_FOREACH_STACK", True)


We should probably change this at some point and remove it to not make it optional anymore.

romain-intel · 2025-01-13T20:19:07Z

metaflow/plugins/metadata_providers/local.py

+            # Filter tasks based on metadata
+            for task in tasks:
+                task_id = task.get("task_id")
+                if not task_id:


when does this happen? Also, task_id of zero is valid iirc.

romain-intel · 2025-01-13T20:21:04Z

metaflow/plugins/metadata_providers/local.py

+            # and the artifact files are saved as: <attempt>_artifact__<artifact_name>.json
+            # We loop over all the JSON files in the directory and find the latest one
+            # that matches the field prefix.
+            json_files = glob.glob(os.path.join(path, "*.json"))


we should be able to do a more efficient globbing so we don't have to filter by field_prefix later on. SOmething like f"{field_prefix}*.json".

romain-intel · 2025-01-13T20:23:33Z

metaflow/task.py

+                        type="foreach-indices-truncated",
+                        tags=metadata_tags,
+                    ),
+                    MetaDatum(


I believe this is only used in the siblings thing. If that's the case, we may be able to get rid of this when we refactor the siblings thing (if we do that). I am also a little confused as to why this is needed.

Will refactor siblings function mostly to return siblings irrespective of whether it is in a for each or not.

Cool -- I think we can now get rid of this metadatum then right?

romain-intel · 2025-01-13T20:24:17Z

metaflow/task.py

+                                tags=metadata_tags,
+                            ),
+                            MetaDatum(
+                                field="previous_steps",


nit: consistency here between previous_steps and foreach-indices for example.

saikonen · 2025-01-14T16:12:20Z

metaflow/plugins/metadata_providers/service.py

+        }
+        url = ServiceMetadataProvider._obj_path(flow_id, run_id, query_step)
+        url = f"{url}/tasks?{urlencode(query_params)}"
+        return cls._request(cls._monitor, url, "GET")


getting an error with this that cls does not have _monitor. All other calls to _request pass in None

Yes, I will pass in None simply.

saikonen · 2025-01-14T16:12:30Z

metaflow/plugins/metadata_providers/service.py

+            "query_step": query_step,
+        }
+        url = ServiceMetadataProvider._obj_path(flow_id, run_id, query_step)
+        url = f"{url}/tasks?{urlencode(query_params)}"


missing import for urlencode. f-strings are probably fine by 2025, as we've gotten rid of the older tests that break with them.

Ya, I was wondering about fstrings but did check and our official minimum version is 3.6 which supports it -- and yes, let's move at least a tad into the future :). I'm going to start using them too and the code will slowly migrate to it (and become infinitesimally faster :) )

romain-intel · 2025-01-16T08:53:17Z

metaflow/client/core.py

+                m.name: m.value
+                for m in sorted(self.metadata, key=lambda m: m.created_at)
+            }
+        return self._metadata_dict


Note: this slightly changes the syntax since now if there is new metadata, the user won't get it. Should check if this impacts other operations. Or scope the caching to just the functions that need it.

saikonen · 2025-01-16T11:39:07Z

metaflow/client/core.py

+
+    def _get_related_tasks(self, is_ancestor: bool) -> Dict[str, List[str]]:
+        flow_id, run_id, _, _ = self.path_components
+        steps = (


data type problem here which leads to the queries not working correctly:
the steps ends up being of type str on OSS metadata-service, so using these you end up iterating over characters instead of step names, e.g.:

/flows/SplitFlow/runs/63/steps/{/filtered_tasks?metadata_field_name=foreach-indices&metadata_field_value=%7B%7D&query_step=%7B /flows/SplitFlow/runs/63/steps/e/filtered_tasks?metadata_field_name=foreach-indices&metadata_field_value=%7B%7D&query_step=e /flows/SplitFlow/runs/63/steps/n/filtered_tasks?metadata_field_name=foreach-indices&metadata_field_value=%7B%7D&query_step=n /flows/SplitFlow/runs/63/steps/d/filtered_tasks?metadata_field_name=foreach-indices&metadata_field_value=%7B%7D&query_step=d /flows/SplitFlow/runs/63/steps/}/filtered_tasks?metadata_field_name=foreach-indices&metadata_field_value=%7B%7D&query_step=%7D

savingoyal

quick UX feedback - let me know if in the new proposed UX we miss out on any use cases. i am reviewing the rest of the PR meanwhile.

savingoyal · 2025-01-28T18:24:23Z

metaflow/client/core.py

+        }
+
+    @property
+    def immediate_ancestors(self) -> Dict[str, List[str]]:


can we offer a property parents which simply returns a list of task pathspecs. it will return Nonefor the start step. an open question is if we would want to also offer parent - maybe we can cross that bridge as a follow up PR if needed.

what are the consistency guarantees offered by this property. do we expect that this property will return an immutable set of parents as soon as this task is registered?

also - we are assuming that return values here point to the latest successful attempt. might be good to note this in the doc string.

savingoyal · 2025-01-28T18:26:32Z

metaflow/client/core.py

+        return self._get_related_tasks(is_ancestor=True)
+
+    @property
+    def immediate_successors(self) -> Dict[str, List[str]]:


can we offer a property children which simply returns a list of task pathspecs. it will return None for the end step. similar comment for child as for parent

what is the expected behavior of this property? as soon as children task are registered, do we start updating the children property? also, when do we know that there are not going to be any more children tasks?

savingoyal · 2025-01-28T18:33:58Z

metaflow/client/core.py

+        return self._get_related_tasks(is_ancestor=False)
+
+    @property
+    def siblings(self) -> Dict[str, List[str]]:


can this be siblings(self, ancestors) - where ancestors is a list of task pathspec and defaults to task.parents?

savingoyal · 2025-01-28T18:44:38Z

metaflow/task.py

@@ -493,6 +512,36 @@ def run_step(
                    )
                )

+            # Add runtime dag info - for a nested foreach this may look like:


in terms of metadata that needs to be stored - we could simply stringify the stack and store it - step1:0,step2:1... - and that should be it, no? a prefix query should be fast enough

metaflow/client/core.py

savingoyal · 2025-02-04T19:25:33Z

metaflow/client/core.py

+        return Step(f"{flow_id}/{run_id}/{query_step}", _namespace_check=False).task
+
+    @property
+    def _graph_info(self):


do we need to expose _graph_info, ancestor_steps or successor_steps to the end user?

is my understanding correct that _graph_info is introduced to support static dag info? if so - a better ux would be to introduce parent_steps and child_steps in the step object instead of exposing _graph_info as is. we treat _graph_info as a special escape hatch - so best to not make it formal at the moment and lose some much needed flexibility around it's structure.

metaflow/task.py

savingoyal · 2025-02-04T19:45:32Z

metaflow/task.py

@@ -493,6 +506,20 @@ def run_step(
                    )
                )

+            # Add runtime dag info - for a nested foreach this may look like:
+            # foreach_indices: "step1:idx1,step2:idx2,step3:idx3"
+            foreach_indices = self._dynamic_runtime_metadata(foreach_stack)


we could use a different term for this field - it's a stack of foreach_indices - maybe - foreach_execution_path

savingoyal · 2025-02-04T20:41:14Z

metaflow/client/core.py

@@ -301,7 +303,7 @@ def __init__(
            # distinguish between "attempt will happen" and "no such
            # attempt exists".

-        if pathspec:
+        if _use_pathspec and pathspec:


task.ancestors[0].system_tags is unfortunately always empty and different that parent_task.system_tags - it might be better to just yield the full task objects instead

I think we could fetch it once for the whole run and use it?

savingoyal · 2025-02-04T20:41:53Z

test/core/tests/client_ancestors.py

+                ancestor_pathspecs = set([task.pathspec for task in ancestors])
+
+                # Compare with stored parent_task_pathspecs
+                task_pathspec = task.data.task_pathspec


You should check not just the pathspec but that the contents of the Task object are the same.

savingoyal · 2025-02-04T20:50:25Z

metaflow/client/core.py

+        List["Task"]
+            List of all ancestor tasks of the current task.
+        """
+        return self._get_related_tasks(is_ancestor=True)


steps = self.ancestor_steps if not steps: return if len(steps) > 1: # Static join - use exact path matching pattern = self.metadata_dict.get("foreach-indices", ".*") else: # Foreach join - match tasks with shorter foreach path current_path = self.metadata_dict.get("foreach-indices", "") if not current_path: pattern = ".*" else: target_task = Step(f"{self.flow_id}/{self.run_id}/{steps[0]}", _namespace_check=False).task target_depth = len(target_task.metadata_dict.get("foreach-indices", "").split(",")) pattern = ",".join(current_path.split(",")[:target_depth]) yield from self._iter_matching_tasks(steps, pattern)

savingoyal · 2025-02-04T20:52:53Z

metaflow/client/core.py

+
+    @property
+    def successors(self) -> List["Task"]:
+        """


steps = self.parent.child_steps if not steps: return if len(steps) > 1: # Static split - use exact path matching pattern = self.metadata_dict.get("foreach-indices", ".*") else: # Foreach split - match tasks with longer foreach path current_path = self.metadata_dict.get("foreach-indices", "") pattern = f"{current_path},.*" if current_path else ".*" yield from self._iter_matching_tasks(steps, pattern)

savingoyal · 2025-02-04T21:02:38Z

metaflow/client/core.py

+        _, _, step_name, _ = self.path_components
+        return self._graph_info[step_name]["next"]
+
+    def _get_metadata_query_vals(


def _iter_matching_tasks(self, steps, pattern): """ Yield tasks from specified steps matching a foreach path pattern. Parameters ---------- steps : List[str] List of step names to search for tasks pattern : str Regex pattern to match foreach-indices metadata Returns ------- Iterator[Task] Tasks matching the foreach path pattern """ flow_id, run_id, _, _ = self.path_components for step in steps: task_ids = self._metaflow.metadata.filter_tasks_by_metadata( flow_id, run_id, step, "foreach-indices", pattern ) for task_id in task_ids: yield Task( pathspec="%s/%s/%s/%s" % (flow_id, run_id, step, task_id), _namespace_check=False )

Why couldn't the filter_task_by_metadata return all the info we need to form the task? It seems it would save a few RT calls right?

romain-intel

not a review -- just some comments.

romain-intel · 2025-02-07T21:52:59Z

metaflow/client/core.py

@@ -301,7 +303,7 @@ def __init__(
            # distinguish between "attempt will happen" and "no such
            # attempt exists".

-        if pathspec:
+        if _use_pathspec and pathspec:


I think we could fetch it once for the whole run and use it?

romain-intel · 2025-02-07T22:00:15Z

metaflow/client/core.py

+        _, _, step_name, _ = self.path_components
+        return self._graph_info[step_name]["next"]
+
+    def _get_metadata_query_vals(


Why couldn't the filter_task_by_metadata return all the info we need to form the task? It seems it would save a few RT calls right?

…a services

savingoyal · 2025-02-12T20:32:06Z

metaflow/client/core.py

+        target_task = Step(
+            f"{flow_id}/{run_id}/{steps[0]}", _namespace_check=False
+        ).task
+        target_path = target_task.metadata_dict.get("foreach-execution-path", "")


nit - just target_task.metadata_dict.get("foreach-execution-path")

savingoyal · 2025-02-12T20:33:37Z

metaflow/client/core.py

+        if not steps:
+            return
+
+        current_path = self.metadata_dict.get("foreach-execution-path", "")


nit - just self.metadata_dict.get("foreach-execution-path") - just to guard against an eventuality where foreach-execution-path set to empty starts having a meaning.

savingoyal · 2025-02-12T20:35:28Z

metaflow/client/core.py

+                yield Task(pathspec=task_pathspec, _namespace_check=False)
+
+    @property
+    def parent_tasks(self) -> List["Task"]:


would be good to handle the case where the user is inspecting a task that ran using an old version of metaflow or is using an old version of the service...

savingoyal · 2025-02-12T20:36:47Z

metaflow/client/core.py

@@ -1123,6 +1123,139 @@ def _iter_filter(self, x):
        # exclude private data artifacts
        return x.id[0] != "_"

+    def _iter_matching_tasks(self, steps, pattern):


minor nit - _iter_matching_tasks(self, steps, metadata_key, metadata_pattern)

savingoyal · 2025-02-12T20:37:34Z

metaflow/client/core.py

+    def parent_tasks(self) -> List["Task"]:
+        """
+        Returns a list of all parent tasks of the current task for the latest successful
+        attempt.


parents should be the same across attempts?

savingoyal · 2025-02-12T20:41:23Z

metaflow/metadata_provider/metadata.py

+        cls,
+        flow_id: str,
+        run_id: str,
+        query_step: str,


nit - step_name

savingoyal · 2025-02-12T20:42:19Z

metaflow/plugins/metadata_providers/local.py

+            if not task_id:
+                continue
+
+            task_name = task.get("task_name")


curious - why task_name?

savingoyal · 2025-02-12T20:43:14Z

metaflow/task.py

@@ -493,6 +504,19 @@ def run_step(
                    )
                )

+            # Add runtime dag information to the metadata of the task
+            foreach_execution_path = self._dynamic_runtime_metadata(foreach_stack)


this comment still applies

savingoyal · 2025-02-12T20:43:49Z

test/core/tests/client_child_tasks.py

+from metaflow_test import MetaflowTest, ExpectationFailed, steps
+
+
+class ChildrenTest(MetaflowTest):


you can combine these two tests together.

savingoyal · 2025-02-12T20:44:53Z

test/core/tests/client_child_tasks.py

+                child_steps = task.parent.child_steps
+
+                for child_task in child_tasks:
+                    assert task.pathspec in child_task.data.parent_pathspecs, (


can you verify other properties as well?

talsperre force-pushed the dev/add-runtime-dag-info branch from 48c771d to ec43f14 Compare November 1, 2024 18:34

saikonen reviewed Nov 11, 2024

View reviewed changes

saikonen mentioned this pull request Nov 22, 2024

feature: add task filtering based on metadata Netflix/metaflow-service#449

Open

talsperre force-pushed the dev/add-runtime-dag-info branch from ffbf68a to c6fb9ac Compare January 2, 2025 23:25

talsperre changed the title ~~Add static and runtime dag info, API to fetch ancestor tasks~~ Add static and runtime dag info, API to fetch ancestor and successor tasks Jan 7, 2025

talsperre force-pushed the dev/add-runtime-dag-info branch 2 times, most recently from d66d32b to 7644058 Compare January 12, 2025 03:12

talsperre requested review from savingoyal and romain-intel January 12, 2025 05:53

romain-intel reviewed Jan 13, 2025

View reviewed changes

saikonen reviewed Jan 14, 2025

View reviewed changes

talsperre force-pushed the dev/add-runtime-dag-info branch from 17a4489 to 7cdfb41 Compare January 15, 2025 00:53

romain-intel reviewed Jan 16, 2025

View reviewed changes

saikonen requested changes Jan 16, 2025

View reviewed changes

talsperre force-pushed the dev/add-runtime-dag-info branch from a8df33d to 7833e40 Compare January 22, 2025 08:30

savingoyal requested changes Jan 28, 2025

View reviewed changes

savingoyal reviewed Jan 28, 2025

View reviewed changes

savingoyal requested changes Feb 4, 2025

View reviewed changes

romain-intel reviewed Feb 7, 2025

View reviewed changes

talsperre added 11 commits February 11, 2025 23:12

Add static and runtime dag info, API to fetch ancestor tasks

255485b

Add API to get immediate successors

bd377a9

Add API for getting closest siblings

0214a61

Update metadata API params

480d360

Refactor ancestor and successor client code

0131f43

Remove unneccessary prints

d70cf98

Support querying ancestors and successors in local metadata provider

70863e5

Refactor and simplify client code

b01fc7d

Make query logic more descriptive

f08be5a

Add core tests for ancestor task API

cf49ace

Add core test for immediate successor API

6218ffe

talsperre and others added 16 commits February 11, 2025 23:12

Add endpoint in OSS metadata service

6500813

Add logs to tests

c997a0b

Log for each stack to metadata, update query logic

6eac1b0

Add more comments to code

5b739e7

Run black formatting

53fb3b8

Set monitor to None in filter tasks API

9963ddd

import urlencode

294f283

Address comments

f39d140

Update logic for siblings, make it work for static splits as well

bec68f7

update service url for filter task requests. update query param names.

e5f1ed7

Fix bug in parsing steps due to different data formats across metadat…

959e6a3

…a services

json serialize the ancestry metadata

7c827b5

Address comments

f4936fa

Update docstrings

32a889e

Remove duplicate code

7b1d717

Address comments

bc9e456

talsperre force-pushed the dev/add-runtime-dag-info branch from 75c1301 to bc9e456 Compare February 12, 2025 11:49

talsperre added 7 commits February 12, 2025 03:53

Remove commented out code

b03f0c4

Update docstrings

67ee87e

Remove spurious import in core

ff27d87

Update OSS metadata service API call

adc5bf7

Remove commented code from parent task tests

6aae908

Remove spurious function

035e3a3

Remove spurious comment

f821d1e

savingoyal requested changes Feb 12, 2025

View reviewed changes

		from metaflow_test import MetaflowTest, ExpectationFailed, steps


		class ChildrenTest(MetaflowTest):

Add static and runtime dag info, API to fetch ancestor and successor tasks #2124

Are you sure you want to change the base?

Add static and runtime dag info, API to fetch ancestor and successor tasks #2124

Conversation

talsperre commented Oct 31, 2024 • edited Loading

Usage

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romain-intel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saikonen Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

savingoyal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romain-intel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talsperre commented Oct 31, 2024 •

edited

Loading

saikonen Jan 16, 2025 •

edited

Loading