Skip to content

CP-50836: Add VM_migrate_downtime and request_shutdown spans to help visualise potential optimisations #6379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 8, 2025

Conversation

snwoods
Copy link
Contributor

@snwoods snwoods commented Mar 20, 2025

Also simplify two tracing functions which could just use Span.get_trace_context directly

@snwoods snwoods force-pushed the private/stevenwo/CP-50836 branch from 5162e09 to da753e6 Compare March 20, 2025 17:23
@@ -1778,7 +1778,8 @@ module VM = struct
if di.Xenctrl.total_memory_pages = 0n then
raise (Xenopsd_error Domain_not_built) ;
Domain.unpause ~xc di.Xenctrl.domid
)
) ;
with_tracing ~task:t ~name:"VM_migrate_downtime_end" Fun.id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this span going to be generated when a domain is unpaused for another reason? such as after a suspend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh good point, you're right! Hmm the simplest way to resolve this would be to add a bool e.g. during_migrate to save and unpause and only create the span when it's true. However, the VM_save atomic already takes a 4-sized tuple and if I try this fix it gives "Tuples with arity > 4 are not supported", which is understandable, but I'm not sure what the nicest way to resolve this would be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also starting a VM includes unpausing... Why not add this in the VM_receive_memory op in Xenops_server itself, which is specific to migration?

Is this empty "span" just meant as marker, because we can't have an actual cross-host downtime span?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Yes, if we could have a single span with the correct downtime begin and end that would be best, but it appears to be rather complicated to do. Baggage works cross-host but won't work here as they are sibling spans/are on different subtrees. A possible solution would be to add to the xenops http headers and use a propagation_trace context, I've created CP-54049 to make this improvement, but it is lower priority as the cost/reward doesn't seem worth it currently. The span would also go across multiple trees, so we would want to introduce the concept of an async span, happening separately to the rest of the flow.

Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still surprised that a span over an identity function is created.

@@ -2523,6 +2525,7 @@ module VM = struct
)
then
raise (Xenopsd_error Failed_to_acknowledge_suspend_request) ;
with_tracing ~task ~name:"VM_migrate_downtime_begin" Fun.id ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll see this for suspending as well. You could detect the migration case by matching data with Some (FD _).

Copy link
Member

@robhoes robhoes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snwoods snwoods force-pushed the private/stevenwo/CP-50836 branch from da753e6 to b47a601 Compare April 2, 2025 09:12
This simplifies making optimisations as it removes the need to look in
the logs for these events.

Signed-off-by: Steven Woods <[email protected]>
@snwoods snwoods force-pushed the private/stevenwo/CP-50836 branch from b47a601 to caea30d Compare April 2, 2025 09:29
@snwoods
Copy link
Contributor Author

snwoods commented Apr 2, 2025

I've fixed the comments above, so the spans are now correctly only being created for VM_migrate. You can see this in jaeger for the job 4271591.

@snwoods snwoods added this pull request to the merge queue Apr 8, 2025
Merged via the queue into xapi-project:master with commit 6fe1395 Apr 8, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants