Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stackalloc localloc #112168

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

Conversation

AndyAyersMS
Copy link
Member

Experiment with turning non-escaping new (nongc)[n] into stackallocs.
Also enable new (nongc)[100] if the allocation site is within a loop, also via stackalloc.

Currently no restriction on how big (that will have to change).

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 5, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 5, 2025

@hez2010 fyi

Expect this may blow up in some tests with stack overflows; we'll see. Also I forgot to exclude sites in handers, so things may blow up there too.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 5, 2025

Some preliminary notes. @dotnet/jit-contrib @davidwrighton @jkotas interested in your feedback.

This builds up on #104906

JIT-introduced stackalloc

Currently the JIT will never introduce a stackalloc into a method, but allowing this may be interesting.

Escape Analysis

The JIT relies on escape analysis to prove that a particular allocation done by a method cannot outlive ("escape") a call to the method. A successful proof requires knowing everything that can possibly happen to that allocation. Proofs of non-escape often founder at call boundaries, since the JIT generally has no knowledge of callee behavior. For instance x will be considered escaping in M if the JIT cannot inline Q.

M()
{ 
     int[] x = new int[100];
     Q(x);
}

Even when non-escape can be proven, stack allocation may not be possible. For example in the following code snippet, the array assigned to x cannot be stack allocated since the array size is not known to the JIT, and the known-sized arrays assigned to y cannot be stack allocated since the allocation site is in a loop. In these cases the amount of stack growth required is not known at JIT time.

N(int n)
{
    int[] x = new int[n];
    for (int i = 0; i < n; i++)
    {
        int[] y = new int[100];

        ;; uses of x and y
    }
}

Limitations like these can be overcome by allowing the JIT to introduce stackalloc into methods. But this comes with other complications:

  • stack space is limited, so these allocations cannot always go on the stack. Either large values of n or even modest values of n if the stack is already close ot its limit can introduce stack overflows into methods that would not have overflowed without stack allocation. So it seems any such allocation must conditionally be on the stack, or somewhere else (say an arena).
  • we generally cannot introduce stackalloc into catch or handler code (finally blocks)
  • objects with GC fields cannot (currently) be handled this way.
    • there is currently no way for a method to describe a runtime-varying number of GC roots to the runtime
    • arrays are heap objects, and writes to arrays require (unchecked) write barriers
    • there is no way to do a store covariance check without a write barrier (easy to fix)
    • many other runtime helpers assume object references are always heap references
    • stack allocated GC roots may end up being treated as so-called "untracked" lifetimes, extending the GC lifetime of the objects they reference
    • diagnostics may become more challenging

None of these seem fundamentally hard to solve, though the cost of checked write barriers might be enough to dissuade us.

If non-escape can be proven but stack allocation is conditional or not possible, the resulting object is still "thread private" and can be optimized more aggressively than if it was a general heap object. There are also widely used idioms (eg in Enumerators) where objects clone themselves to provide thread private access. The JIT does not understand these patterns, but we could work on enhancing the memory analyses the JIT does to try take advantage of this information.

This draft PR introduces stackalloc for non-GC type arrays when the JIT can prove non-escape. This currently has no policy attached. The initial thought for a policy is to leverage the same logic as in TryEnsureSufficientExecutionStack: at each allocation site, check the available stack capacity, and if the stack is not too full, allocate on the stack (perhaps with some additional per-allocation limit), else allocate on the heap.

These changes may well make the allocating methods slightly slower, as the cost of the array zeroing now must be directly paid by the method, rather than incurred by GC or slow allocation helpers. But they may well make the overall system faster. In some case the JIT may be able to prove that the zeroing is not necessary, if all elements are written before being read, but that's a ways off (if ever).

Span Captures (not part of this PR)

Escape analysis can possibly also leverage the fact that an allocation may be opaquely captured by a byref like struct. For instance in

O()
{
    Span<int> x = new int[100];
    Q(x);
}

the array lifetime cannot exceed Os lifetime and so the array can be safely stack allocated (here "opaquely" meaning there is no way to extract the captured object from the struct). There is no need to analyze or inline Q.

As above if the array size is unknown or the allocation site is in a loop then allocation would require stackalloc and associated policies.

In general doing this sort of thing requires "field-wise" escape analysis which is something I intend to work on, but it seems likely that just handing Span might be an easier and valuable special case; a span local has at most one GC reference inside it, so we can likely conflate the object and the reference and just leverage our current analysis.

Enabling this would potentially allow replacing some explicit stackalloc uses in the BCL with completely "safe" alternatives.

@jkotas
Copy link
Member

jkotas commented Feb 5, 2025

The JIT relies on escape analysis to prove that a particular allocation done by a method cannot outlive ("escape") a call to the method

Do you have good examples in BCL or other real-world code where this kicks in?

replacing some explicit stackalloc uses in the BCL with completely "safe" alternatives.

The typical use of stackallocs in BCL are constant-sized stackallocs or stackalloc+ArrayPool combos. Do you see the unsafe nature of the stackalloc uses in the BCL in unbounded stackalloc that may slip through code review?

BCL stackallocs and stackalloc+ArrayPool combos have other safety problems:

  • The memory is uninitialized. I doubt that we would be willing to pay for initialization of these buffers throughout the BCL.
  • The memory has to be returned to the array pool exactly once.

For the BCL use cases in particular, it may be more interesting to work on #52065 and base this optimization on top of it:

  • It would allow us to replace the stackalloc+ArrayPool combos throughout the BCL
  • JIT would inject code that returns the memory to the pool at the end of the method. Alternatively, we can work with Roslyn to introduce constructs for enforced deterministic destruction so that the cleanup code is injected in IL.
  • This array stackalloc optimization can use the same primitive.

@jkotas
Copy link
Member

jkotas commented Feb 5, 2025

Span x = new int[100];
Q(x);

If we had malloca-like API, I think this specific example can be converted to it as an optimization in Roslyn as well.

@hez2010
Copy link
Contributor

hez2010 commented Feb 5, 2025

Do you see the unsafe nature of the stackalloc uses in the BCL in unbounded stackalloc that may slip through code review?

Sometimes we may need to return a buffer to its caller so that we cannot use stackalloc. If the method is managed to get inlined to its caller, with this analysis we may get rid of the heap array allocation.

Do you have good examples in BCL or other real-world code where this kicks in?

Some typical scenarios that this may kick in after we have the support for gcref arrays are like string.Split and Regex.Matches etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants