Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Make BB_UNITY_WEIGHT 1.0 #112151

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

amanasifkhalid
Copy link
Member

Part of #107749. Change BB_UNITY_WEIGHT and friends from 100 to 1 to avoid scaling normalized block weights up unnecessarily. I wanted this to be a no-diff change, but I could only get so far; I implemented the following quirks to keep diffs down:

  • BB_COLD_WEIGHT has also been reduced by 100x, as our layout optimizations were using this threshold with BB_UNITY_WEIGHT factored in. In a follow-up PR, I think we should increase this back to 0.01: I think a block is sufficiently cold if its normalized weight suggests it executes only 1% of the time per method invocation. Increasing the amount of code we consider cold should also lessen the amount of work 3-opt needs to do.
  • Changing BB_UNITY_WEIGHT churned CSE significantly, and I think it's because we aren't quite careful about not mixing normalized weights and counts during cost/benefit analysis. I've added some scaling quirks in to avoid these diffs for now. I might be misunderstanding the utility of this distinction, but I wonder if we can unify weighted and non-weighted counts (LclVarDsc::m_lvRefCnt and LclVarDsc::m_lvRefCntWtd, CSEdsc::csdDefCount and CSEdsc::csdDefWtCnt, etc) after this goes in.
  • On an unrelated note, I noticed some places where we try to normalize a summed weight by dividing it by the entry block's weight. This should be unnecessary since BasicBlock::getBBWeight already normalizes with fgCalledCount, and wrong in cases where the entry block is reachable via loop backedges. I'll try fixing these in a follow-up PR.

With these quirks in-place, I'm seeing sporadic diffs from floating-point imprecision manifesting different decisions due to churn in block weights. For example, we sometimes CSE more/less aggressively due to the score being close to the aggressive threshold, or we sometimes churn layout due to blocks that were almost cold now being considered cold, etc. This churn seems unavoidable unless/until we expand our usage of profile helpers (Compiler::fgProfileWeightsEqual) to compare weights.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 4, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid amanasifkhalid marked this pull request as ready for review February 4, 2025 21:40
@amanasifkhalid
Copy link
Member Author

/azp run runtime-coreclr libraries-pgo, runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@AndyAyersMS
Copy link
Member

I wonder if we can unify weighted and non-weighted counts

These typically mean two different things, eg unweighted means "how many" and weighted means "how much" or "how frequent". There really should be some kind of explicit scale factor when they're combined.

I recall being annoyed that CSE mixes the two somewhat willy-nilly (2d in #92915 (comment))

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented Feb 5, 2025

libraries-pgo failures are #112196 and #111922. libraries-jitstress failure looks unrelated. Build Analysis is blocked by build timeouts.

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs are quite a bit larger on win-x64 and win-x86, with the former's concentrated in libraries.pmi and the latter's in libraries.pmi and benchmarks.run_pgo. The jit-analyze summaries of these collections reveal some duplicate methods with diffs that might be inflating the total churn. Looking at the example diffs, I'm seeing the same patterns I noted above:

Miniscule changes in block weights churning LSRA/layout,

 ;  V00 arg0         [V00,T02] (  3,  6   )   byref  ->  rbx         single-def
 ;  V01 arg1         [V01,T06] (  7,  6   )  double  ->  mm6         ld-addr-op single-def
-;  V02 arg2         [V02,T07] (  7,  5.00)  double  ->  [rsp+0x60]  ld-addr-op single-def
+;  V02 arg2         [V02,T07] (  7,  5   )  double  ->  mm7         ld-addr-op single-def

or more/fewer CSEs.

 ;* V07 tmp5         [V07    ] (  0,  0   )   ubyte  ->  zero-ref    "Inlining Arg"
 ;  V08 tmp6         [V08,T03] (  2,  1.33)   byref  ->  edx         single-def "argument with side effect"
+;  V09 cse0         [V09,T04] (  3,  1   )     int  ->  eax         "CSE #01: moderate"

Thanks!

@amanasifkhalid
Copy link
Member Author

These typically mean two different things, eg unweighted means "how many" and weighted means "how much" or "how frequent". There really should be some kind of explicit scale factor when they're combined.

I see, I suppose the removal of BB_UNITY_WEIGHT scaling has made it easier to spot where we mix these two up. I see that we switch between using weighted/unweighted counts to drive heuristics depending on our optimization goals -- if we're prioritizing size over speed, then we ought to chase CSEs with the most uses rather than the hottest ones -- so we'd be losing something tangible by using just one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants