-
Notifications
You must be signed in to change notification settings - Fork 292
CP-50836: Add VM_migrate_downtime and request_shutdown spans to help visualise potential optimisations #6379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CP-50836: Add VM_migrate_downtime and request_shutdown spans to help visualise potential optimisations #6379
Conversation
Signed-off-by: Steven Woods <[email protected]>
5162e09
to
da753e6
Compare
@@ -1778,7 +1778,8 @@ module VM = struct | |||
if di.Xenctrl.total_memory_pages = 0n then | |||
raise (Xenopsd_error Domain_not_built) ; | |||
Domain.unpause ~xc di.Xenctrl.domid | |||
) | |||
) ; | |||
with_tracing ~task:t ~name:"VM_migrate_downtime_end" Fun.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this span going to be generated when a domain is unpaused for another reason? such as after a suspend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh good point, you're right! Hmm the simplest way to resolve this would be to add a bool e.g. during_migrate to save and unpause and only create the span when it's true. However, the VM_save atomic already takes a 4-sized tuple and if I try this fix it gives "Tuples with arity > 4 are not supported", which is understandable, but I'm not sure what the nicest way to resolve this would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also starting a VM includes unpausing... Why not add this in the VM_receive_memory op in Xenops_server itself, which is specific to migration?
Is this empty "span" just meant as marker, because we can't have an actual cross-host downtime span?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Yes, if we could have a single span with the correct downtime begin and end that would be best, but it appears to be rather complicated to do. Baggage works cross-host but won't work here as they are sibling spans/are on different subtrees. A possible solution would be to add to the xenops http headers and use a propagation_trace context, I've created CP-54049 to make this improvement, but it is lower priority as the cost/reward doesn't seem worth it currently. The span would also go across multiple trees, so we would want to introduce the concept of an async span, happening separately to the rest of the flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still surprised that a span over an identity function is created.
@@ -2523,6 +2525,7 @@ module VM = struct | |||
) | |||
then | |||
raise (Xenopsd_error Failed_to_acknowledge_suspend_request) ; | |||
with_tracing ~task ~name:"VM_migrate_downtime_begin" Fun.id ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll see this for suspending as well. You could detect the migration case by matching data
with Some (FD _)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⏫
da753e6
to
b47a601
Compare
This simplifies making optimisations as it removes the need to look in the logs for these events. Signed-off-by: Steven Woods <[email protected]>
b47a601
to
caea30d
Compare
I've fixed the comments above, so the spans are now correctly only being created for VM_migrate. You can see this in jaeger for the job 4271591. |
Also simplify two tracing functions which could just use Span.get_trace_context directly