Skip to content

CA-409510: Make xenopsd nested Parallel atoms explicit #6469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

snwoods
Copy link
Contributor

@snwoods snwoods commented May 15, 2025

Each Parallel atom takes up a worker thread whilst its children do the actual work, so we have parallel_queues to prevent a deadlock. However, nested Parallel atoms take up an additional worker, meaning they can still cause a deadlock. This commit adds a new Nested_parallel atomic with matching nested_parallel_queues to remove the possibility of this deadlock.
This increases the total number of workers, but these workers are just to hold the Nested_parallel Atomics and will not be doing any actual work

@snwoods snwoods force-pushed the private/stevenwo/CA-409510 branch 2 times, most recently from 6bdd89f to 18d9bb4 Compare May 15, 2025 10:38
@last-genius
Copy link
Contributor

@robhoes Is this the work you mentioned yesterday? Was this deadlock noticed in testing/production?

Each Parallel atom takes up a worker thread whilst its children do the
actual work, so we have parallel_queues to prevent a deadlock. However,
nested Parallel atoms take up an additional worker, meaning they can
still cause a deadlock. This commit adds a new Nested_parallel atomic
with matching nested_parallel_queues to remove the possibility of this
deadlock.
This increases the total number of workers, but these workers are just
to hold the Nested_parallel Atomics and will not be doing any actual work

Signed-off-by: Steven Woods <[email protected]>
@snwoods snwoods force-pushed the private/stevenwo/CA-409510 branch from 18d9bb4 to cedf836 Compare May 15, 2025 11:12
@psafont
Copy link
Member

psafont commented May 15, 2025

Is this the work you mentioned yesterday?

Yes, this has been in the product since I introduced the nested parallel > serial > parallel. We've detecting during testing in heavy workloads

@lindig
Copy link
Contributor

lindig commented May 15, 2025

Can we create by accident a nesting that is not safe? Do we need predicates that check we don't have such a situation?

@edwintorok
Copy link
Contributor

edwintorok commented May 15, 2025

by accident a nesting that is not safe

Even with this change only 1 level of nesting is supported. Parallel/Serial/NestedParallel/Serial/NestedParallel would probably still deadlock (I don't think we have such a construct in the code currently).

We can check the constraints at the top level where we have the full view of the atom tree. Although that'd be a runtime check, not a compile time check.
If we want to do this at compile time then we should change the variant to a polymorphic variant, to allow constraints to be enforced more easily.

Then toplevel you can have 'NestedParallel of [Serial | .... ] | ..., and ``Parallel of [NestedParallel of ....]. To make this a bit easier type aliases can be introduced:
`type true_atoms = [`VM_start | .... ]` (the ones that are really indivisible), then `type nested_parallel = `NestedParallel of [`Serial of true_atoms | true_atoms] | `Serial of true_atoms | true_atoms` , and `type top_level = [nested_parallel | `Parallel of nested_parallel | `Serial of [`Parallel of `nested_parallel | true_atoms ]]`, maybe with a few more aliases to factor out repetition.

I'd suggest to try to write down the polymorphic types for just 1 or 2 atoms (+ serial + parallel + nested_parallel), and once we're happy with those, then we could try do an experiment, and replace it in the code and see what compile errors we get. If we add a constraint that the toplevel atoms must be [< top_level] then the compiler will check we don't create any nesting that is not allowed by the polymorphic types (e.g. if we never declare NestedParallel to accept Parallel or `NestedParallel as children then attempting to use them should be a compile error).

See here how a state machine can be encoded into polymorphic variants for the basic idea: https://github.com/hammerlab/ketrew/blob/master/src/pure/target.ml#L169

Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is so far not addressing the problem of illegal nestings (which would represent an implementation defect).

@lindig
Copy link
Contributor

lindig commented May 19, 2025

Is this tied to an XSI? if so, we really should get this merged. If this is not strictly required to resolve the XSI I am happy to see more iterations.

This is a stopgap until we add compile-time constraints on the nesting,
by for example using a polymorphic variant.

Signed-off-by: Steven Woods <[email protected]>
@snwoods snwoods force-pushed the private/stevenwo/CA-409510 branch from 93c5b97 to 6258405 Compare May 19, 2025 15:42
@snwoods
Copy link
Contributor Author

snwoods commented May 19, 2025

this is so far not addressing the problem of illegal nestings

Correct, I've added a check that will log a warning if there are illegal nestings and created ticket CP-308087 to add some proper constraints e.g. using Polymorphic variants.

@snwoods
Copy link
Contributor Author

snwoods commented May 19, 2025

All testing green for Ring3 BVT+BST and SRIscsiPerVMScalability (the sequence which found this defect)
With worker-pool-size=1:
BVT+BST 217631, SRIscsiPerVMScalability 4310003 and 4309952
With worker-pool-size=25:
BVT+BST 217634, SRIscsiPerVMScalability 4310057

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants