[LoopInterchange] Enable it by default (WIP) #124911

sjoerdmeijer · 2025-01-29T11:56:28Z

This is a work in progress patch to enable loop-interchange by default and is a continuation of the RFC:

https://discourse.llvm.org/t/enabling-loop-interchange/82589

Basically, we promised to fix any compile-time and correctness issues in the different components involved here (loop-interchange and dependence analaysis.) before discussing enabling interchange by default. We think are close to complete this; I would like to explain where we are and wanted to check if there are any thoughts or concerns. A quick overview of the correctness and compile-time improvements that we have made include:

Correctness:

Compile-times:

And in terms of remaining work, we think we are very close to fixing these depenence analysis issues:

The compile-time increase with a geomean increase of 0.19% looks good (after committing #124247), I think:

stage1-O3:
Benchmark
kimwitu++        +0.10%
sqlite3          +0.14%
consumer-typeset +0.07%
Bullet           +0.06%
tramp3d-v4       +0.21%
mafft            +0.39%
ClamAVi          +0.06%
lencod           +0.61%
SPASS            +0.17%
7zip             +0.08%
geomean          +0.19%

See also:
http://llvm-compile-time-tracker.com/compare.php?from=19a7fe03b4f58c4f73ea91d5e63bc4c6e61f987b&to=b24f1367d68ee675ea93ecda4939208c6b68ae4b&stat=instructions%3Au

We might want to look into lencod to see if we can improve more, but not sure it is strictly necessary.

@kasuga-fj

This is a work in progress patch to enable loop-interchange by default and is a continuation of the RFC: https://discourse.llvm.org/t/enabling-loop-interchange/82589 Basically, we promised to fix any compile-time and correctness issues in the different components involved here (loop-interchange and dependence analaysis.) before discussing enabling interchange by default. We think are close to complete this; I would like to explain where we are and wanted to check if there are any thoughts or concerns. A quick overview of the correctness and compile-time improvements that we have made include: Correctness: - [LoopInterchange] Remove 'S' Scalar Dependencies (llvm#119345) - [LoopInterchange] Fix overflow in cost calculation (llvm#111807) - [LoopInterchange] Handle LE and GE correctly (PR llvm#124901) @kasuga-fj - [DA] disambiguate evolution of base addresses (llvm#116628) Compile-times: - [LoopInterchange] Constrain number of load/stores in a loop (llvm#118973) - [LoopInterchange] Bail out early if minimum loop nest is not met (llvm#115128) - [LoopInterchange] Hoist isComputableLoopNest() in the control flow (llvm#124247) And in terms of remaining work, we think we are very close to fixing these depenence analysis issues: - [DA] do not handle array accesses of different offsets (llvm#123436) - [DA] Dependence analysis does not handle array accesses of different sizes (llvm#116630) - [DA] use NSW arithmetic llvm#116632 The compile-time increase with a geomean increase of 0.19% looks good (after committing llvm#124247), I think: stage1-O3: Benchmark kimwitu++ +0.10% sqlite3 +0.14% consumer-typeset +0.07% Bullet +0.06% tramp3d-v4 +0.21% mafft +0.39% ClamAVi +0.06% lencod +0.61% SPASS +0.17% 7zip +0.08% geomean +0.19% See also: http://llvm-compile-time-tracker.com/compare.php?from=19a7fe03b4f58c4f73ea91d5e63bc4c6e61f987b&to=b24f1367d68ee675ea93ecda4939208c6b68ae4b&stat=instructions%3Au We might want to look into lencod to see if we can improve more, but not sure it is strictly necessary.

Meinersbur · 2025-01-29T19:03:38Z

Could you provide numbers for performance changes (improvments and regressions), e.g. highlights from llvm-test-suite? Would help illustrating that we also gain something from slighlty increased compile-time.

sjoerdmeijer · 2025-01-30T14:48:48Z

Could you provide numbers for performance changes (improvments and regressions), e.g. highlights from llvm-test-suite? Would help illustrating that we also gain something from slighlty increased compile-time.

Hi @Meinersbur, thanks for the comments, that's a very fair question!

I would like to use two metrics for success here: the number of times loop-interchange triggers in the test-suite, and performance changes, with a higher weight for the former, initially. Let me explain this. We see this as a first enablement step: i.e. we have prioritised correctness and made interchange more conservative, in order to get a first version running by default. Enablement of interchange, loopcache analysis, and data dependence analysis would a big step forward. Even before we made interchange more conservative, interchange was not triggering on our motivating examples. So after this first step, we will follow up with a second optimisation step and lift restrictions to let it trigger on our examples.

The number of times interchange triggers is a good metric for the potential of the pass and how general it is. I have counted 37 occurrences of interchange if external benchmark ffmpeg is included and 22 times without. So that is really good, I think. I will paste the build log with optimisation remarks at the end of this message for more details.

About perf changes: for me a good result is if this shows no performance regressions, and maybe a few minor uplifts here and there. And the reason is what I wrote above, we very much see this as a first enablement step, and know we have to work on making it more optimal. And when I measured the llvm test-suite, this is exactly what I saw: no regression, minor uplifts (in Polybench and its solvers). But that was before the latest interchange patches went in, so I will verify and confirm this with the top of trunk. I would like to add that I have issues with the llvm test-suite for evaluating this kind of codegen changes as they may not be very visible, and I briefly talked about that in my llvm dev conference talk here (sorry for the plug, but it's a link to the exact timestamp). So llvm test-suite performance numbers need to be taken with a pinch of salt, but again, will do confirm.

test-suite-externals/ffmpeg/libavcodec/aacps.c:121:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/aacps_tablegen.h:77:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/dcaenc.c:971:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/dpx.c:700:13: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/dpx.c:665:13: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/ituh263enc.c:786:13: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/mpeg12enc.c:113:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/mpeg4videoenc.c:1176:13: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/mpegaudiodec_template.c:268:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/proresenc_anatoliy.c:525:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavcodec/xface.c:291:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavfilter/af_arnndn.c:1512:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavfilter/vf_datascope.c:1077:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavfilter/vf_datascope.c:678:9: remark: Loop interchanged
test-suite-externals/ffmpeg/libavfilter/vf_photosensitivity.c:182:9: remark: Loop interchanged
MultiSource/Applications/JM/ldecod/block.c:935:5: remark: Loop interchanged
MultiSource/Applications/JM/ldecod/block.c:1265:5: remark: Loop interchanged
MultiSource/Applications/JM/ldecod/macroblock.c:2594:5: remark: Loop interchanged
MultiSource/Applications/JM/lencod/block.c:2139:5: remark: Loop interchanged
MultiSource/Applications/JM/lencod/block.c:2860:5: remark: Loop interchanged
MultiSource/Applications/JM/lencod/block.c:3016:5: remark: Loop interchanged
MultiSource/Applications/obsequi/tables.c:224:7: remark: Loop interchanged
MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/spatscal.c:298:5: remark: Loop interchanged
MultiSource/Benchmarks/nbench/nbench1.c:283:2: remark: Loop interchanged
MultiSource/Benchmarks/nbench/nbench1.c:1906:1: remark: Loop interchanged
MultiSource/Benchmarks/nbench/nbench1.c:1907:2: remark: Loop interchanged
SingleSource/Benchmarks/Linpack/linpack-pc.c:595:17: remark: Loop interchanged
SingleSource/Benchmarks/Linpack/linpack-pc.c:1204:17: remark: Loop interchanged
SingleSource/Benchmarks/Misc/dt.c:15:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/datamining/correlation/correlation.c:103:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/datamining/covariance/covariance.c:81:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/cholesky/cholesky.c:63:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/cholesky/cholesky.c:64:7: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu.c:51:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu.c:52:7: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/ludcmp/ludcmp.c:62:5: remark: Loop interchanged
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/ludcmp/ludcmp.c:63:7: remark: Loop interchanged

Meinersbur · 2025-01-30T15:57:37Z

llvm/lib/Passes/PassBuilderPipelines.cpp

@@ -201,7 +201,7 @@ static cl::opt<bool> RunNewGVN("enable-newgvn", cl::init(false), cl::Hidden,
                               cl::desc("Run the NewGVN pass"));

 static cl::opt<bool> EnableLoopInterchange(
-    "enable-loopinterchange", cl::init(false), cl::Hidden,
+    "enable-loopinterchange", cl::init(true), cl::Hidden,


Consider to also add switches to the clang driver like -fvectorize/-fno-vectorze and -funroll-loops/fno-unroll-loops. GCC's is called -floop-interchange.

Thanks for the suggestion, agree that this would be convenient to have. I will create a separate patch for this.

mshockwave · 2025-01-30T19:11:57Z

Has #116144 (correctness issue) been addressed?

kasuga-fj · 2025-01-31T07:02:29Z

Has #116144 (correctness issue) been addressed?

It would have been fixed with #119345 . As far as I know, there are no known correctness issues on the LoopInterchange side.

sjoerdmeijer · 2025-01-31T10:40:16Z

Has #116144 (correctness issue) been addressed?

It would have been fixed with #119345 . As far as I know, there are no known correctness issues on the LoopInterchange side.

Correct, should have been fixed.
In addition, I have just closed the following bug reports that had the same underlying issue as that one:

sjoerdmeijer · 2025-01-31T13:47:30Z

Re: perf number, @Meinersbur : I indeed see no regressions and 3 improvements of ~70% that are real, I think. There are a couple of minor improvements, but that could be real or noise, but didn't look further into it. I did 3 runs with interchange disabled ("before") and 3 runs with interchange enabled ("after"), and lower is better:

Program                                                 exec_time
                                                        before1   before2 before3 after1 after2 after3 diff
Polybench/linear-algebra/solvers/cholesky/cholesky.test  11.48     11.34   11.20    6.60   6.58   6.63 74.5%
    Polybench/linear-algebra/solvers/ludcmp/ludcmp.test  23.56     23.57   23.55   14.16  14.13  14.08 67.5%
            Polybench/linear-algebra/solvers/lu/lu.test  23.60     23.48   23.69   14.25  14.17  14.21 67.2%

This introduces options -floop-interchange and -fno-loop-interchange to enable/disable the loop-interchange pass. This is part of the work that tries to get that pass enabled by default (llvm#124911), where it was remarked that a user facing option to control this would be convenient to have. The option (name) is the same as GCC's.

kasuga-fj · 2025-02-06T07:27:05Z

I indeed see no regressions and 3 improvements of ~70% that are real, I think. There are a couple of minor improvements, but that could be real or noise, but didn't look further into it. I did 3 runs with interchange disabled ("before") and 3 runs with interchange enabled ("after"), and lower is better:

I tried these three cases on my local and couldn't reproduce this result. I investigated and found that the loops aren't interchanged in all the three cases. I assume that all the loops that were interchanged in your experiment are the same one that initializes the array B in the init_array function, for example the following one in cholesky.c.

https://github.com/llvm/llvm-test-suite/blob/94ca4917f6b02da34902b65b26240c5729d55bcf/SingleSource/Benchmarks/Polybench/linear-algebra/solvers/cholesky/cholesky.c#L62-L65

This loop has S dependence for the outermost loop so that it used to be able to interchange, but not anymore. My guess is that you are using an older version of the compiler that does not have the patches listed in this PR description. With future improvements, these loops should become interchangeable again. I think we should address this issue after the interchange is enabled by default.

This introduces options -floop-interchange and -fno-loop-interchange to enable/disable the loop-interchange pass. This is part of the work that tries to get that pass enabled by default (llvm#124911), where it was remarked that a user facing option to control this would be convenient to have. The option (name) is the same as GCC's.

This introduces options `-floop-interchange` and `-fno-loop-interchange` to enable/disable the loop-interchange pass. This is part of the work that tries to get that pass enabled by default (#124911), where it was remarked that a user facing option to control this would be convenient to have. The option name is the same as GCC's.

This introduces options `-floop-interchange` and `-fno-loop-interchange` to enable/disable the loop-interchange pass. This is part of the work that tries to get that pass enabled by default (llvm#124911), where it was remarked that a user facing option to control this would be convenient to have. The option name is the same as GCC's.

sjoerdmeijer · 2025-02-24T10:53:28Z

This loop has S dependence for the outermost loop so that it used to be able to interchange, but not anymore. My guess is that you are using an older version of the compiler that does not have the patches listed in this PR description. With future improvements, these loops should become interchangeable again. I think we should address this issue after the interchange is enabled by default.

I've checked and you're right. I used a slightly older build, but with all the correctness issues addressed it indeed doesn't trigger anymore on cholesky, lu and ludcmp. It still triggers on other polybench kernels (correlation and covariance), but due to their extremely low runtime and/or isn't on the hot path, doesn't show improvements. I think this indeed shows the potential of interchange, and is something I would like to address after enabling this by default.

kasuga-fj · 2025-02-25T06:39:10Z

I think this indeed shows the potential of interchange

I agree with you.

sjoerdmeijer requested review from nikic, sebpop, Meinersbur, kiranchandramohan, madhur13490, CongzheUalberta and kasuga-fj January 29, 2025 11:56

Meinersbur reviewed Jan 30, 2025

View reviewed changes

sjoerdmeijer mentioned this pull request Feb 5, 2025

[Clang][Driver] Add an option to control loop-interchange #125830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopInterchange] Enable it by default (WIP) #124911

[LoopInterchange] Enable it by default (WIP) #124911

sjoerdmeijer commented Jan 29, 2025 •

edited by madhur13490

Loading

Meinersbur commented Jan 29, 2025

sjoerdmeijer commented Jan 30, 2025

Meinersbur Jan 30, 2025

sjoerdmeijer Jan 31, 2025

mshockwave commented Jan 30, 2025

kasuga-fj commented Jan 31, 2025

sjoerdmeijer commented Jan 31, 2025

sjoerdmeijer commented Jan 31, 2025

kasuga-fj commented Feb 6, 2025

sjoerdmeijer commented Feb 24, 2025

kasuga-fj commented Feb 25, 2025

[LoopInterchange] Enable it by default (WIP) #124911

Are you sure you want to change the base?

[LoopInterchange] Enable it by default (WIP) #124911

Conversation

sjoerdmeijer commented Jan 29, 2025 • edited by madhur13490 Loading

Meinersbur commented Jan 29, 2025

sjoerdmeijer commented Jan 30, 2025

Meinersbur Jan 30, 2025

Choose a reason for hiding this comment

sjoerdmeijer Jan 31, 2025

Choose a reason for hiding this comment

mshockwave commented Jan 30, 2025

kasuga-fj commented Jan 31, 2025

sjoerdmeijer commented Jan 31, 2025

sjoerdmeijer commented Jan 31, 2025

kasuga-fj commented Feb 6, 2025

sjoerdmeijer commented Feb 24, 2025

kasuga-fj commented Feb 25, 2025

sjoerdmeijer commented Jan 29, 2025 •

edited by madhur13490

Loading