Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #351 from MatthewFluet/c-and-llvm-codegen-updates
Updates to C and LLVM codegens. Highlights: * Add `Machine.Program.rflow` to compute `{returns,raises}To` control flow (654c557) and use in `functor Chunkify` (1b3b7b8) and in Machine IR `Raise/Return` transfers (cf8e487). * Add `chunk-jump-table {false|true}` compile-time option to force generation of a jump table for the chunk switch (8e0dd2d, 5b6439b, 087a5b1). * Add `-chunk-{{must,may}-rto-self,must-rto-sing,must-rto-other}-opt` compile-time options to optimize return/raise transfers (7c10c70, 4d5abde, 4b7c649, c3b9905, 473808f) * Experiment using LLVM's `cc10` (aka, `ghccc`) calling convention (2e26ebd). * Experiment with a new `simple` chunkify strategy (3330cbe, 3d9c499, 138512f, faef164, d1df0de); generally performs about the same as `coalesce4096`, significantly improves `fib` and `tak` (for GCC), slightly improves `hamlet`, but slightly worsens `raytrace`: config command C04 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 C05 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -chunkify simple C09 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang C10 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -chunkify simple task_clock [email protected] (2-level) program `C05/C04` `C10/C09` barnes-hut 0.9978 0.9589 boyer 1.064 1.076 checksum 1.051 0.9775 count-graphs 1.005 0.9876 DLXSimulator 1.000 0.9905 even-odd 1.037 0.9989 fft 0.9616 0.9537 fib 0.6689 0.6260 flat-array 1.000 0.9645 hamlet 0.9547 0.9322 imp-for 1.067 1.014 knuth-bendix 1.092 1.031 lexgen 1.031 1.078 life 1.002 0.9911 logic 1.016 1.015 mandelbrot 0.9776 1.030 matrix-multiply 0.9903 0.9844 md5 1.008 0.9940 merge 0.9927 1.062 mlyacc 0.9810 1.024 model-elimination 0.9877 0.9743 mpuz 1.011 1.010 nucleic 1.036 1.030 output1 0.9943 1.021 peek 1.036 1.027 pidigits 1.000 0.9653 psdes-random 1.009 1.014 ratio-regions 0.9985 0.9881 ray 0.9738 0.9601 raytrace 1.101 1.100 simple 0.9620 0.9272 smith-normal-form 0.9690 0.9806 string-concat 0.9610 0.9772 tailfib 1.006 0.9292 tailmerge 0.9847 1.023 tak 0.8264 1.013 tensor 1.010 0.9998 tsp 0.9981 1.010 tyan 1.045 1.027 vector-rev 1.012 0.9891 vector32-concat 0.9495 1.030 vector64-concat 1.098 0.9744 vliw 0.9413 1.019 wc-input1 0.9301 1.098 wc-scanStream 1.114 0.9234 zebra 1.008 1.001 zern 0.9819 1.014 MIN 0.6689 0.6260 GMEAN 0.9940 0.9912 MAX 1.114 1.100 The `simple` chunkify strategy is not (yet) suitable for a self-compile; it can generate excessively large chunks, including one for a self-compile that requires 8min to compile by `gcc`. * Add `expect: WordX.t option` to RSSA and Machine `Switch.T` (911b5d4, e2b27ab, 695320d) and add `-gc-expect {none|false|true}` compile-time option, where `-gc-expect false` should indicate that performing a GC is cold path (823815a); no notable performance impact. * Lots of tweaks to C codegen, ultimately eliminating almost all `c-chunk.h` macros. * Eliminate unused `Machine.Operand.Contents` constructor (006269b). * Make a major refactoring of LLVM codegen (cec30c5). * Implement `Real<N>_qequal` for C codegen (9b7b2bd) and use `is{less,lessequal}` for `Real<N>_l{t,e}` for C codegen (7b55819). * Generalize LLVM type-based alias-analysis (27709ef). * Add `-llvm-aamd scope` for simple `noalias`/`alias.scope` alias-analysis metadata in LLVM codegen (b825f56); no notable performance impact. * Use C99/C11 `inline` for primitive and Basis Library functions (311331c, c864492, 4f2d213). * Add `-codegen-fuse-op-and-chk {false|true}` compile-time option to explicitly fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the C and LLVM codegens (6b738b8, 3d1e89c, 68f8512, 82c019f, 61de560, 5363199, 0d46a85). It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at #273 and #292 suggest that this does not always have significant impact, but sometimes `-codegen-fuse-op-and-chk true` can have a positive. Unfortunately, it can also have a (significant) negative impact. In `matrix-multiply` and `vector-rev`, fusing can cause GCC to not recognize that an explicit sequence index can be replaced by a stride length; in these benchmarks, it would be nice if MLton eliminated the overflow checks. config command C04 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 C09 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang C11 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -codegen-fuse-op-and-chk true C15 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -codegen-fuse-op-and-chk true task_clock [email protected] (2-level) program `C11/C04` `C15/C09` barnes-hut 1.005 0.9925 boyer 1.052 1.013 checksum 1.022 1.028 count-graphs 0.9722 1.002 DLXSimulator 1.004 0.9959 even-odd 0.8768 1.003 fft 0.9592 1.016 fib 0.9732 0.9798 flat-array 0.8148 1.019 hamlet 0.9966 1.030 imp-for 0.8993 0.7985 knuth-bendix 1.008 1.013 lexgen 0.9851 1.043 life 0.9954 1.006 logic 0.9994 1.014 mandelbrot 0.9440 1.013 matrix-multiply 1.336 1.009 md5 0.9604 1.007 merge 0.9675 1.037 mlyacc 1.032 1.029 model-elimination 1.010 1.004 mpuz 1.035 0.9599 nucleic 0.9938 0.9983 output1 0.9278 0.9709 peek 0.9850 1.035 pidigits 0.9702 0.9538 psdes-random 1.017 0.9986 ratio-regions 0.9801 0.9887 ray 0.9795 1.009 raytrace 0.9959 1.026 simple 0.9764 1.010 smith-normal-form 1.002 1.049 string-concat 0.7919 0.9035 tailfib 1.030 1.227 tailmerge 1.017 0.9980 tak 0.9790 0.9988 tensor 0.5258 1.000 tsp 0.9845 1.013 tyan 1.019 0.9739 vector-rev 1.178 1.253 vector32-concat 0.8703 0.9230 vector64-concat 0.8906 0.9038 vliw 0.9921 1.044 wc-input1 1.060 0.9809 wc-scanStream 0.9166 1.040 zebra 1.008 1.020 zern 1.051 1.089 MIN 0.5258 0.7985 GMEAN 0.9720 1.007 MAX 1.336 1.253 Note: the issue with `md5` mentioned in the commit messages are with respect to the `md5` benchmark before 2daaebf. Overall, this simplifies the C and LLVM codegen slightly, although there is little significant performance change: config command C02 /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen c -cc gcc-9 C04 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 C08 /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen llvm -cc clang C09 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang task_clock [email protected] (2-level) program `C04/C02` `C09/C08` barnes-hut 1.036 1.025 boyer 0.9731 1.006 checksum 0.9652 1.002 count-graphs 0.9988 0.9964 DLXSimulator 0.9970 1.023 even-odd 1.002 0.9881 fft 1.026 0.9674 fib 0.9034 0.7846 flat-array 1.014 1.021 hamlet 0.9740 1.010 imp-for 0.9707 0.9908 knuth-bendix 0.9077 0.9777 lexgen 1.048 0.8985 life 1.002 0.9827 logic 1.006 0.9867 mandelbrot 1.000 1.011 matrix-multiply 1.020 0.9957 md5 0.9700 0.9960 merge 0.9974 0.9818 mlyacc 1.003 0.9824 model-elimination 0.9936 0.9817 mpuz 0.9815 0.9466 nucleic 0.9946 1.002 output1 1.007 1.026 peek 0.9832 0.9898 pidigits 0.9950 1.047 psdes-random 1.009 0.9869 ratio-regions 0.9978 0.9725 ray 0.9938 0.9663 raytrace 0.9975 1.032 simple 0.9936 1.000 smith-normal-form 1.038 0.9941 string-concat 1.041 1.014 tailfib 0.9865 0.9741 tailmerge 1.010 1.020 tak 0.9331 0.9041 tensor 0.9938 0.9941 tsp 0.9825 1.004 tyan 0.9960 0.9879 vector-rev 1.014 0.9091 vector32-concat 1.090 0.9016 vector64-concat 0.9994 0.9800 vliw 0.9995 0.9876 wc-input1 0.9685 0.8634 wc-scanStream 1.178 1.105 zebra 0.9857 0.9900 zern 0.9733 0.9890 MIN 0.9034 0.7846 GMEAN 0.9982 0.9815 MAX 1.178 1.105
- Loading branch information