-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add static and runtime dag info, API to fetch ancestor and successor tasks #2124
base: master
Are you sure you want to change the base?
Conversation
48c771d
to
ec43f14
Compare
metaflow/metadata/metadata.py
Outdated
@classmethod | ||
def _filter_tasks_by_metadata( | ||
cls, flow_id, run_id, query_step, field_name, field_value | ||
): | ||
raise NotImplementedError() | ||
|
||
@classmethod | ||
def filter_tasks_by_metadata( | ||
cls, flow_id, run_id, query_step, field_name, field_value | ||
): | ||
# TODO: Do we need to do anything wrt to task attempt? | ||
task_ids = cls._filter_tasks_by_metadata( | ||
flow_id, run_id, query_step, field_name, field_value | ||
) | ||
return task_ids | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a need for the private method, or could this simply be contained in the public-facing one? right now its not doing anything before calling the private one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, did you have an implementation of this for service.py
yet?
metaflow/metadata/metadata.py
Outdated
def filter_tasks_by_metadata( | ||
cls, flow_id, run_id, query_step, field_name, field_value | ||
): | ||
# TODO: Do we need to do anything wrt to task attempt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not, as the ancestors for task attempts should be identical, right? What about the immediate_siblings
though, will they include or exclude attempts of the same task?
ffbf68a
to
c6fb9ac
Compare
d66d32b
to
7644058
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. I think it's pretty close though. I haven't looked at hte metadata service changes. We may also want to raise a better error message if the service is not new enough?
metaflow/client/core.py
Outdated
run_id: str, | ||
cur_foreach_stack_len: int, | ||
steps: List[str], | ||
query_type: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would just use a boolean. Something like is_ancestor
. These are internal functions anyways and slightly more efficient to use bools :)
metaflow/client/core.py
Outdated
if query_foreach_stack_len == cur_foreach_stack_len: | ||
# The successor or ancestor tasks belong to the same foreach stack level | ||
field_name = "foreach-indices" | ||
field_value = self.metadata_dict.get(field_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently cache metadata_dict so either we could fix that or cache it here to avoid making multiple calls to the metadata service and then sorts. It would need to be cached across _get_related_tasks and this function.
metaflow/client/core.py
Outdated
# Current Task: foreach-indices = [0, 1, 2], foreach-indices-truncated = [0, 1] | ||
# Ancestor Task: foreach-indices = [0, 1], foreach-indices-truncated = [0] | ||
# We will compare the foreach-indices value of ancestor task with the | ||
# foreach-indices value of current task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: foreach-indices-truncated value of the current task
metaflow/client/core.py
Outdated
return field_name, field_value | ||
|
||
def _get_related_tasks(self, relation_type: str) -> Dict[str, List[str]]: | ||
start_time = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used -- can strip or use.
metaflow/metaflow_config.py
Outdated
@@ -248,8 +248,7 @@ | |||
# Default container registry | |||
DEFAULT_CONTAINER_REGISTRY = from_conf("DEFAULT_CONTAINER_REGISTRY") | |||
# Controls whether to include foreach stack information in metadata. | |||
# TODO(Darin, 05/01/24): Remove this flag once we are confident with this feature. | |||
INCLUDE_FOREACH_STACK = from_conf("INCLUDE_FOREACH_STACK", False) | |||
INCLUDE_FOREACH_STACK = from_conf("INCLUDE_FOREACH_STACK", True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably change this at some point and remove it to not make it optional anymore.
# Filter tasks based on metadata | ||
for task in tasks: | ||
task_id = task.get("task_id") | ||
if not task_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does this happen? Also, task_id of zero is valid iirc.
# and the artifact files are saved as: <attempt>_artifact__<artifact_name>.json | ||
# We loop over all the JSON files in the directory and find the latest one | ||
# that matches the field prefix. | ||
json_files = glob.glob(os.path.join(path, "*.json")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be able to do a more efficient globbing so we don't have to filter by field_prefix later on. SOmething like f"{field_prefix}*.json"
.
metaflow/task.py
Outdated
type="foreach-indices-truncated", | ||
tags=metadata_tags, | ||
), | ||
MetaDatum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is only used in the siblings thing. If that's the case, we may be able to get rid of this when we refactor the siblings thing (if we do that). I am also a little confused as to why this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will refactor siblings function mostly to return siblings irrespective of whether it is in a for each or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool -- I think we can now get rid of this metadatum then right?
metaflow/task.py
Outdated
tags=metadata_tags, | ||
), | ||
MetaDatum( | ||
field="previous_steps", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consistency here between previous_steps and foreach-indices for example.
} | ||
url = ServiceMetadataProvider._obj_path(flow_id, run_id, query_step) | ||
url = f"{url}/tasks?{urlencode(query_params)}" | ||
return cls._request(cls._monitor, url, "GET") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getting an error with this that cls
does not have _monitor
. All other calls to _request
pass in None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will pass in None
simply.
"query_step": query_step, | ||
} | ||
url = ServiceMetadataProvider._obj_path(flow_id, run_id, query_step) | ||
url = f"{url}/tasks?{urlencode(query_params)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing import for urlencode. f-strings are probably fine by 2025, as we've gotten rid of the older tests that break with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I was wondering about fstrings but did check and our official minimum version is 3.6 which supports it -- and yes, let's move at least a tad into the future :). I'm going to start using them too and the code will slowly migrate to it (and become infinitesimally faster :) )
17a4489
to
7cdfb41
Compare
metaflow/client/core.py
Outdated
m.name: m.value | ||
for m in sorted(self.metadata, key=lambda m: m.created_at) | ||
} | ||
return self._metadata_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this slightly changes the syntax since now if there is new metadata, the user won't get it. Should check if this impacts other operations. Or scope the caching to just the functions that need it.
metaflow/client/core.py
Outdated
|
||
def _get_related_tasks(self, is_ancestor: bool) -> Dict[str, List[str]]: | ||
flow_id, run_id, _, _ = self.path_components | ||
steps = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data type problem here which leads to the queries not working correctly:
the steps ends up being of type str
on OSS metadata-service, so using these you end up iterating over characters instead of step names, e.g.:
/flows/SplitFlow/runs/63/steps/{/filtered_tasks?metadata_field_name=foreach-indices&metadata_field_value=%7B%7D&query_step=%7B
/flows/SplitFlow/runs/63/steps/e/filtered_tasks?metadata_field_name=foreach-indices&metadata_field_value=%7B%7D&query_step=e
/flows/SplitFlow/runs/63/steps/n/filtered_tasks?metadata_field_name=foreach-indices&metadata_field_value=%7B%7D&query_step=n
/flows/SplitFlow/runs/63/steps/d/filtered_tasks?metadata_field_name=foreach-indices&metadata_field_value=%7B%7D&query_step=d
/flows/SplitFlow/runs/63/steps/}/filtered_tasks?metadata_field_name=foreach-indices&metadata_field_value=%7B%7D&query_step=%7D
a8df33d
to
7833e40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick UX feedback - let me know if in the new proposed UX we miss out on any use cases. i am reviewing the rest of the PR meanwhile.
metaflow/client/core.py
Outdated
} | ||
|
||
@property | ||
def immediate_ancestors(self) -> Dict[str, List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we offer a property parents
which simply returns a list of task pathspecs. it will return None
for the start
step. an open question is if we would want to also offer parent
- maybe we can cross that bridge as a follow up PR if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the consistency guarantees offered by this property. do we expect that this property will return an immutable set of parents as soon as this task is registered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also - we are assuming that return values here point to the latest successful attempt. might be good to note this in the doc string.
metaflow/client/core.py
Outdated
return self._get_related_tasks(is_ancestor=True) | ||
|
||
@property | ||
def immediate_successors(self) -> Dict[str, List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we offer a property children
which simply returns a list of task pathspecs. it will return None
for the end
step. similar comment for child
as for parent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the expected behavior of this property? as soon as children task are registered, do we start updating the children
property? also, when do we know that there are not going to be any more children tasks?
metaflow/client/core.py
Outdated
return self._get_related_tasks(is_ancestor=False) | ||
|
||
@property | ||
def siblings(self) -> Dict[str, List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be siblings(self, ancestors)
- where ancestors
is a list of task pathspec and defaults to task.parents
?
metaflow/task.py
Outdated
@@ -493,6 +512,36 @@ def run_step( | |||
) | |||
) | |||
|
|||
# Add runtime dag info - for a nested foreach this may look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in terms of metadata that needs to be stored - we could simply stringify the stack and store it - step1:0,step2:1...
- and that should be it, no? a prefix query should be fast enough
metaflow/client/core.py
Outdated
return Step(f"{flow_id}/{run_id}/{query_step}", _namespace_check=False).task | ||
|
||
@property | ||
def _graph_info(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to expose _graph_info
, ancestor_steps
or successor_steps
to the end user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is my understanding correct that _graph_info is introduced to support static dag info? if so - a better ux would be to introduce parent_steps and child_steps in the step object instead of exposing _graph_info
as is. we treat _graph_info
as a special escape hatch - so best to not make it formal at the moment and lose some much needed flexibility around it's structure.
metaflow/task.py
Outdated
@@ -493,6 +506,20 @@ def run_step( | |||
) | |||
) | |||
|
|||
# Add runtime dag info - for a nested foreach this may look like: | |||
# foreach_indices: "step1:idx1,step2:idx2,step3:idx3" | |||
foreach_indices = self._dynamic_runtime_metadata(foreach_stack) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could use a different term for this field - it's a stack of foreach_indices - maybe - foreach_execution_path
metaflow/client/core.py
Outdated
@@ -301,7 +303,7 @@ def __init__( | |||
# distinguish between "attempt will happen" and "no such | |||
# attempt exists". | |||
|
|||
if pathspec: | |||
if _use_pathspec and pathspec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task.ancestors[0].system_tags is unfortunately always empty and different that parent_task.system_tags - it might be better to just yield the full task objects instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could fetch it once for the whole run and use it?
test/core/tests/client_ancestors.py
Outdated
ancestor_pathspecs = set([task.pathspec for task in ancestors]) | ||
|
||
# Compare with stored parent_task_pathspecs | ||
task_pathspec = task.data.task_pathspec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check not just the pathspec but that the contents of the Task object are the same.
metaflow/client/core.py
Outdated
List["Task"] | ||
List of all ancestor tasks of the current task. | ||
""" | ||
return self._get_related_tasks(is_ancestor=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
steps = self.ancestor_steps
if not steps:
return
if len(steps) > 1:
# Static join - use exact path matching
pattern = self.metadata_dict.get("foreach-indices", ".*")
else:
# Foreach join - match tasks with shorter foreach path
current_path = self.metadata_dict.get("foreach-indices", "")
if not current_path:
pattern = ".*"
else:
target_task = Step(f"{self.flow_id}/{self.run_id}/{steps[0]}", _namespace_check=False).task
target_depth = len(target_task.metadata_dict.get("foreach-indices", "").split(","))
pattern = ",".join(current_path.split(",")[:target_depth])
yield from self._iter_matching_tasks(steps, pattern)
metaflow/client/core.py
Outdated
|
||
@property | ||
def successors(self) -> List["Task"]: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
steps = self.parent.child_steps
if not steps:
return
if len(steps) > 1:
# Static split - use exact path matching
pattern = self.metadata_dict.get("foreach-indices", ".*")
else:
# Foreach split - match tasks with longer foreach path
current_path = self.metadata_dict.get("foreach-indices", "")
pattern = f"{current_path},.*" if current_path else ".*"
yield from self._iter_matching_tasks(steps, pattern)
metaflow/client/core.py
Outdated
_, _, step_name, _ = self.path_components | ||
return self._graph_info[step_name]["next"] | ||
|
||
def _get_metadata_query_vals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _iter_matching_tasks(self, steps, pattern):
"""
Yield tasks from specified steps matching a foreach path pattern.
Parameters
----------
steps : List[str]
List of step names to search for tasks
pattern : str
Regex pattern to match foreach-indices metadata
Returns
-------
Iterator[Task]
Tasks matching the foreach path pattern
"""
flow_id, run_id, _, _ = self.path_components
for step in steps:
task_ids = self._metaflow.metadata.filter_tasks_by_metadata(
flow_id, run_id, step, "foreach-indices", pattern
)
for task_id in task_ids:
yield Task(
pathspec="%s/%s/%s/%s" % (flow_id, run_id, step, task_id),
_namespace_check=False
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why couldn't the filter_task_by_metadata return all the info we need to form the task? It seems it would save a few RT calls right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a review -- just some comments.
metaflow/client/core.py
Outdated
@@ -301,7 +303,7 @@ def __init__( | |||
# distinguish between "attempt will happen" and "no such | |||
# attempt exists". | |||
|
|||
if pathspec: | |||
if _use_pathspec and pathspec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could fetch it once for the whole run and use it?
metaflow/client/core.py
Outdated
_, _, step_name, _ = self.path_components | ||
return self._graph_info[step_name]["next"] | ||
|
||
def _get_metadata_query_vals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why couldn't the filter_task_by_metadata return all the info we need to form the task? It seems it would save a few RT calls right?
75c1301
to
bc9e456
Compare
target_task = Step( | ||
f"{flow_id}/{run_id}/{steps[0]}", _namespace_check=False | ||
).task | ||
target_path = target_task.metadata_dict.get("foreach-execution-path", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - just target_task.metadata_dict.get("foreach-execution-path")
if not steps: | ||
return | ||
|
||
current_path = self.metadata_dict.get("foreach-execution-path", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - just self.metadata_dict.get("foreach-execution-path")
- just to guard against an eventuality where foreach-execution-path set to empty starts having a meaning.
yield Task(pathspec=task_pathspec, _namespace_check=False) | ||
|
||
@property | ||
def parent_tasks(self) -> List["Task"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to handle the case where the user is inspecting a task that ran using an old version of metaflow or is using an old version of the service...
@@ -1123,6 +1123,139 @@ def _iter_filter(self, x): | |||
# exclude private data artifacts | |||
return x.id[0] != "_" | |||
|
|||
def _iter_matching_tasks(self, steps, pattern): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit - _iter_matching_tasks(self, steps, metadata_key, metadata_pattern)
def parent_tasks(self) -> List["Task"]: | ||
""" | ||
Returns a list of all parent tasks of the current task for the latest successful | ||
attempt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parents should be the same across attempts?
cls, | ||
flow_id: str, | ||
run_id: str, | ||
query_step: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - step_name
if not task_id: | ||
continue | ||
|
||
task_name = task.get("task_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious - why task_name
?
@@ -493,6 +504,19 @@ def run_step( | |||
) | |||
) | |||
|
|||
# Add runtime dag information to the metadata of the task | |||
foreach_execution_path = self._dynamic_runtime_metadata(foreach_stack) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment still applies
from metaflow_test import MetaflowTest, ExpectationFailed, steps | ||
|
||
|
||
class ChildrenTest(MetaflowTest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can combine these two tests together.
child_steps = task.parent.child_steps | ||
|
||
for child_task in child_tasks: | ||
assert task.pathspec in child_task.data.parent_pathspecs, ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you verify other properties as well?
Add runtime DAG info so that we can query the ancestor and successor tasks for a given task easily.
Usage
To get ancestors, progenies, and siblings, use the following API:
The output would be a list of metaflow
Task
objects.