-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new overflow-checking primitives #273
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Previously, the `!` family of primitive operators (`+!` etc.) was an operation that always checked for overflow, whereas the `?` family never checked for overflow. However, as a first step towards transitioning to a model wherein the `?` family is a predicate that checks for overflow, here the semantics are flipped such that the `!` family never checks for overflow, and `?` always does.
This change introduces a new family of `$` operators, which always check their arguments for overflow and raise an appropriate exception. The `?` family of operators is changed to detect overflow (returning a `Bool`), and the `$` family is defined in terms of the corresponding `?` and `!` operators. Previous uses of `?` in the basis library are also corrected to use the new system.
Previously the old primitives were simply deleted; this commit introduces a compile-time switch (`MLton.newOverflow`) to switch between the old and new systems. This should allow for easier development and eventual integration of the new primitives, which can be incrementally implemented on the backend while keeping the old ones intact.
A previous change introduced the use of GCC builtins for overflow checking with the `newOverflow` flavor of checked primitive operations. However, these builtins are compatible only with GCC versions 7 and above, so to improve backwards compatibility a compatibility layer is added using the older builtins for legacy GCC versions. In addition, since much of the code for the new and old primitive operations was shared, this shared code is pulled into the compatibility layer (provided as `_overflow_b` macros).
Previously, the new overflow-checking primitives were implemented by first checking for overflow and then performing the operation. This change refactors the implementation to instead perform and store the operation first, then check for overflow. This will allow the peephole optimization to be more easily implemented for the new overflow primitives in the native codegens.
The native codegens (amd64, x86) previously used the pmd and imul2 instructions incorrectly to check for overflow (based erroneously on the pattern of the corresponding arithmetic operations). This fix changes the implementation to use `pmdcc` for all unsigned operations.
The existing redundantTests code consisted of a map over statements and a separate check for the presence of an arith transfer for the old overflow-checking primitives. Initially an additional map over statements was added to deal with the new overflow-checking primitives; this change merges the two maps into one.
Adds a new peephole optimization pass for the native codegens to eliminate redundant AL instructions. In general, whenever a checked word operation is generated, the `Word_*` and `Word_*CheckP` operations generate distinct assembly instructions. However, only one of these instructions is actually needed, since both perform the same operation and set the same overflow flag, so the generated assembly for a `Word_*CheckP` immediately following an identical `Word_*` operation can be safely removed. Currently, blocks where one `mov` instruction has been hoisted to the front of the block are also recognized.
Previously the `elimALRedundant` optimization was missing several cases because it occurred after the `moveHoist` optimization. Originally, an additional case was added to the `elimALRedundant` optimization to handle cases where one move had been hoisted immediately before the corresponding arithmetic operations, but this did not account for cases where one or both moves were hoisted elsewhere in the block. This change introduces a new group of optimizations in the `PeepholeLivenessBlock` section which execute before `moveHoist`. In the future additional optimizations may be added, but at present only the `elimALRedundant` optimization occurs in this group. In addition, the `elimALRedundant` was missing a branch to check for imul2 operations, which has now been added.
Previously, some redundant `mulCheckP` operations were missed in the main peephole optimization pass due to a previous multiplication of a variable by 2 being replaced by an addition of that variable to itself. This change adds a new peephole optimization to catch such operations.
Like `Word_mulCheck`, `Word_mulCheckP` should not use the `@llvm.{s,u}mul.with.overflow.i64` intrinsics, because they compile to calls to `__mulodi4` and `__udivdi3`, which are provided by compiler-rt and not always by libgcc.
The `elimALRedundant_pseudo` peephole optimization was removed from the amd64 codegen by 56d4362, but should also be removed from the x86 codegen.
Commit f9e1cb6 renamed the `Word_*CheckOverflows` C functions to `Word_*CheckP` C functions.
Addition, subtraction, and negation of 64bit operands on a 32bit platform must first add/sub the low bits and then addc/subb the high bits in order to correctly set the overflow flag. However, this does not yet pass all the regression tests: testing fixed-integer 15a16,21 > Int64: abs ~9223372036854775807 = Overflow <> 9223372036854775807 > Int64: ~ ~9223372036854775807 = Overflow <> 9223372036854775807 > Int64: fromString o toString 9223372036854775807 = Overflow <> 9223372036854775807 > Int64: fromString o toString 9223372036854775806 = Overflow <> 9223372036854775806 > unhandled exception: Overflow > Nonzero exit status. fixed-integer: difference with -type-check true -const MLton.newOverflow true
The `elimDeadDsts` peephole optimization attempts to eliminate instructions whose destinations are dead. There was limited code for preserving instructions that are immediately followed by a `setcc` instruction or a `if` transfer. However, it did not preserve instructions that were immediately followed by an `adc` or `sbb` instruction. This would lead to mis-compilation of `Word64_{add,neg,sub}CheckP` primitives. For example, a `Word64_addCheckP` could initially be translated to: movl src1Lo,tmpLo movl src1Hi,tmpHi addl src2Lo,tmpLo adcl src2Hi,tmpHi seto dst The "tmpLo" destination in the "addl src2Lo,tmpLo" instruction is dead and so the instruction would be eliminated; in turn, the "tmpLo" destination in the "movl src1Lo,tmpLo" instruction is dead and so the instruction would be eliminated. However, although the "tmpHi" destination in the "adcl src2Hi,tmpHi" instruction is dead, the instruction is preserved because of the subsequent "seto" instruction. The end result would be: movl src1Hi,tmpHi adcl src2Hi,tmpHi seto dst which fails to set "dst" appropriately. Now, in addition to preserving instructions that are immediately followed by a `setcc` instruction or an `if` transfer, an instruction immediately preserved by a `adc` or `sbb` instruction is preserved.
A `mov` instruction does not change the condition code flags, so continue looking for flag-dependent instructions past a `mov` instruction.
Previously, a `Word64_negCheckP` was initially translated to: movl $0,tmp subl srcLo,tmp movl $0,tmp sbbl srcHi,tmp seto dst With commits 97315a4 and 3a0a42d, the "subl srcLo,tmp" instruction is not eliminated by the `elimDeadDsts` peephole optimization. However, register allocation unconditionally rewrites `movl $0,dst` as `xorl dst,dst`, leaving code like: xorl tmp,tmp subl srcLo,tmp xorl tmp,tmp sbbl srcHi,tmp However, while a `mov` preserves the condition code flags, a `xorl` instruction clears the CF flag, which leads to incorrect flags being used and set by the `sbb` instruction. Similarly, a `Word64_addCheckP` was initially translated to: movl src1Lo,tmp addl src2Lo,tmp movl src1Hi,tmp adcl src2Hi,tmp seto dst which, if src1Hi were 0, would result in movl src1Lo,tmp addl src2Lo,tmp xorl tmp,tmp adcl src2Hi,tmp which leads to incorrect flags being used and set by the `adc` instruction. The existing 64-bit primitives avoid this problem by performing all of the `mov` instructions before the arithmetic instruction. Now, a `Word64_negCheckP` primitive is initially translated to: movl $0,tmpLo movl $0,tmpHi subl srcLo,tmpLo sbbl srcHi,tmpHi seto dst which can be correctly rewritten to: xorl tmpLo,tmpLo xorl tmpHi,tmpHi subl srcLo,tmpLo sbbl srcHi,tmpHi seto dst and a `Word64_addCheckP` primitive is initially translated to: movl src1Lo,tmpLo movl src1Hi,tmpHi addl src2Lo,tmpLo adcl src2Hi,tmpHi seto dst which, if src1Hi were 0, can be correctly rewritten to: movl src1Lo,tmpLo xorl tmpHi,tmpHi addl src2Lo,tmpLo adcl src2Hi,tmpHi
MatthewFluet
added a commit
that referenced
this pull request
Oct 15, 2018
Add new overflow-checking primitives Motivation MLton currently implements checked arithmetic with specialized primitives. These primitives are exposed to the Basis Library implementation as functions that implicitly raise a primitive exception: val +! = _prim "WordS8_addCheck": int * int -> int; In the XML IR, special care is taken to "remember" the primive exception associated with these primitives in order to implement exceptions correctly. In the SSA, SSA2, RSSA, and Machine IRs, these primitives are treated as transfers, rather than statements. This pull request implements a possibly better implementation of these operations as simple primitives that return a boolean: val +! = _prim "Word8_add": int * int -> int; val +? = _prim "WordS8_addCheckP": int * int -> bool; val +$ = fn (x, y) => let val r = +! (x, y) in if +? (x, y) then raise Overflow else r end This would eliminate the special cases in the XML IR and in the SSA, SSA2, RSSA, and Machine IRs, where the primitives would be treated as statements. Other compilers provide overflow checking via boolean-returning functions: * https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html * https://llvm.org/docs/LangRef.html#arithmetic-with-overflow-intrinsics Implementation This patch refactors the primitive checked-arithmetic operations such that the suffix `?` represents a new overflow-checking predicate, a suffix `!` represents the non-overflow-checking variant, and a suffix `$` represents the overflow-checking variant (mnemonic: `$` for "safe" or "expensive"). The behavior of the new `$`-operations is controlled by a compile-time constant, `MLton.newOverflow`. When set to false (the default), the `$`-operations make use of the old-style implicit-`Overflow` primitives. When set to true, the `$`-operations are implemented as an `if`-expression that branches on the result of the corresponding `?`-operation and either raises the `Overflow` exception or returns the result of the corresponding `!`-operation. Finally, the bare operation is aliased to either the `$`-form (with overflow detection enabled) or the `!`-form (with overflow detection disabled). Essentially: val +! = _prim "Word8_add": int * int -> int; val +? = _prim "WordS8_addCheckP": int * int -> bool; val +$ = if MLton.newOverflow then fn (x, y) => let val r = +! (x, y) in if +? (x, y) then raise Overflow else r end else fn (x, y) => ((_prim "WordS8_addCheckP": int * int -> int;) (x, y)) handle PrimOverflow => raise Overflow val + = if MLton.detectOverflow then +$ else +! Note that the checked-arithmetic using `!`- and `$`-operations is careful to perform the `!`-operation before the `$`-operation. With the native-codegens, a new peephole optimization combines the separate unchecked-arithmetic operation and checked-arithmetic-predicate operation into a single instruction. For the C-codgen, the new checked-arithmetic-predicate primitives are translated to uses of the `__builtin_{add,sub,mul}_overflow` intrinsics (which improves upon the previous explicit conditional checking, but requires gcc 5 or greater). Similarly, for the LLVM-codgen, the new checked-arithmetic-predicate primitives are translated to uses of the `{sadd,uadd,smul,umul,ssub,usub}.with.overflow` intrinsics. For both the C- and LLVM-codegens, it is expected that these intrinsics will be combined with the separate unchecked-arithmetic operation. In addition, the `RedundantTests` optimization has been extended to eliminate the overflow test when adding or subtracting 1 with the new primitives. Performance The native-codegen peephole optimization and `RedundantTests` have been mostly sufficient to keep performance on par with the older checked-arithmetic primitives, and in some cases performance has even significantly improved. Below is a summary of the exceptional runtime ratios in the different codegens (both positive and negative): | Benchmark | Native | LLVM | C | |-----------------|--------|------|------| | even-odd | 1.00 | 1.00 | 1.09 | | hamlet | 0.98 | 0.90 | 0.93 | | imp-for | 0.99 | 1.50 | 0.46 | | lexgen | 0.92 | 1.31 | 1.24 | | matrix-multiply | 0.99 | 1.00 | 0.87 | | md5 | 1.06 | 1.01 | 0.97 | | tensor | 1.01 | 1.00 | 0.57 | No benchmarks were consistently worse or better on all codegen, e.g., the `imp-for` benchmark performed exceptionally badly on the LLVM codegen, but was much faster on the C codegen and about even on the native codegen. For this particular benchmark, the cause of the slowdown with the LLVM codegen has yet to be discovered. Similarly, the cause of the slowdown in `lexgen` with the C- and LLVM-codegens is unknown. For the `md5` benchmark, on the other hand, the cause of the slowdown with the native codegen seems to be a failure to eliminate common subexpressions in certain circumstances, which can potentially be improved in the future. Improvements in the C-codegen are likely to be due to the better overflow checking with builtins. Future work The `CommonSubexp` optimization currently handles `Arith` transfers specially; in particular, the result of an `Arith` transfer can be used in common `Arith` transfers that it dominates. This was done so that code like: (n + m) + (n + m) can be transformed to let val r = n + m in r + r end With the new checked-arithmetic-predicate primitives, the computation of the boolean value may be common-subexpr eliminated, but `Case` discrimination will not. This forces the boolean value to be reified and to be discriminated multiple times (though, perhaps `KnownCase` could eliminate subsequent discriminations). Extending `CommonSubexpr` to propagate flow-sensitive relationships at `Case` transfers to the dominated targets could improve the performance `md5` with `MLton.newOverflow true` and potentially improve performance elsewhere as well (e.g., by propagating the results of comparisons as well). Once all performance slowdowns with `MLton.newOverflow true` have been eliminated, it would be desirable to remove the old-style implicit-`Overflow` primitives and `Arith` transfers. This would eliminate many previous instances of special-case code to handle these primitives and transfers. Finally, it may be worth investigating an implementation of the checked-operations via val +$ = fn (x, y) => let val b = +? (x, y) val r = +! (x, y) in if b then raise Overflow else r end rather than val +$ = fn (x, y) => let val r = +! (x, y) in if +? (x, y) then raise Overflow else r end The advantage of calculating the boolean first is that when `x` (or `y`) is a loop variable and `r` will be passed as the value for the next iteration, then both `x` and `r` could be assigned the same location (pseudo-register or stack slot). `x` and `r` cannot share the same location when the boolean is calculated second, because `x` and `r` are both live at the calculation of the boolean. See #218. This would require reworking the native-codegen peephole optimization.
MatthewFluet
added a commit
that referenced
this pull request
Mar 21, 2019
Remove old-style arithmetic primitives In #273, MLton was extended to support overflow-checking primitives for arithmetic operations. This allowed checked arithmetic operations to be encoded as an unchecked arithmetic operation followed by an overflow-checking operation followed by an `if` that either raises the `Overflow` exception or returns the arithmetic result. However, the overflow primitives (that implicitly raise a special `PrimOverflow` exception in the early ILs and are used in `Arith` transfers in the late ILs) remained in the compiler. This pull request removes the old overflow primitives entirely, along with the special-case code required to support them.
MatthewFluet
added a commit
to MatthewFluet/mlton
that referenced
this pull request
Nov 5, 2019
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). This flag will be used to enable explicit fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the codegens.
MatthewFluet
added a commit
to MatthewFluet/mlton
that referenced
this pull request
Nov 5, 2019
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). These functions compute both the arithmetic result and a boolean indicating overflow (using `__builtin_<op>_overflow`). They will be used for explicit fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the C codegen for `-codegen-fuse-op-and-check true`.
MatthewFluet
added a commit
to MatthewFluet/mlton
that referenced
this pull request
Nov 5, 2019
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). (Note: Because the final md5 state is not used by the `md5` benchmark program, MLton actually optimizes out most of the md5 computation. What is left is a lot of arithmetic from `PackWord32Little.subVec` to check for indices that should raise `Subscript`.) For example, with `-codegen-fuse-op-and-check false` and gcc-9, the `transform` function of `md5` has the following assembly: movl %r9d, %r10d subl $1, %r10d jo .L650 leal -1(%r8), %r10d movl %r10d, %r12d addl %r10d, %edx jo .L650 addl %r10d, %r11d cmpl %eax, %r11d jnb .L656 movl %ebp, %edx addl $1, %edx jo .L659 leal 1(%rcx), %edx movl %edx, %r11d imull %r9d, %r11d jo .L650 imull %r8d, %edx movl %edx, %r11d addl %r10d, %r11d jo .L650 leal (%rdx,%r10), %r11d cmpl %eax, %r11d jnb .L665 What seems to have happened is that gcc has arranged for equivalent values to be in `%r8` and `%r9`. In the first three lines, there is an implementation of `WordS32_subCheckP (X, 1)` using `subl/jo`, while in the fourth line, there is an implementation of `Word32_sub (X, 1)` using `lea` with an offset of `-1`. Notice that `%r10` is used for the result of both, so the fourth line is redundant (the value is already in `%r10`). On the other hand, with `-codegen-fuse-op-and-check true` and gcc-9, the `transform` function of `md5` has the following assembly: movl %r8d, %r9d subl $1, %r9d jo .L645 addl %r9d, %ecx jo .L645 cmpl %edx, %ecx jnb .L651 movl %eax, %ecx addl $1, %ecx jo .L654 imull %r8d, %ecx jo .L645 addl %r9d, %ecx jo .L645 cmpl %edx, %ecx jnb .L660
MatthewFluet
added a commit
that referenced
this pull request
Nov 22, 2019
Updates to C and LLVM codegens. Highlights: * Add `Machine.Program.rflow` to compute `{returns,raises}To` control flow (654c557) and use in `functor Chunkify` (1b3b7b8) and in Machine IR `Raise/Return` transfers (cf8e487). * Add `chunk-jump-table {false|true}` compile-time option to force generation of a jump table for the chunk switch (8e0dd2d, 5b6439b, 087a5b1). * Add `-chunk-{{must,may}-rto-self,must-rto-sing,must-rto-other}-opt` compile-time options to optimize return/raise transfers (7c10c70, 4d5abde, 4b7c649, c3b9905, 473808f) * Experiment using LLVM's `cc10` (aka, `ghccc`) calling convention (2e26ebd). * Experiment with a new `simple` chunkify strategy (3330cbe, 3d9c499, 138512f, faef164, d1df0de); generally performs about the same as `coalesce4096`, significantly improves `fib` and `tak` (for GCC), slightly improves `hamlet`, but slightly worsens `raytrace`: config command C04 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 C05 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -chunkify simple C09 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang C10 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -chunkify simple task_clock [email protected] (2-level) program `C05/C04` `C10/C09` barnes-hut 0.9978 0.9589 boyer 1.064 1.076 checksum 1.051 0.9775 count-graphs 1.005 0.9876 DLXSimulator 1.000 0.9905 even-odd 1.037 0.9989 fft 0.9616 0.9537 fib 0.6689 0.6260 flat-array 1.000 0.9645 hamlet 0.9547 0.9322 imp-for 1.067 1.014 knuth-bendix 1.092 1.031 lexgen 1.031 1.078 life 1.002 0.9911 logic 1.016 1.015 mandelbrot 0.9776 1.030 matrix-multiply 0.9903 0.9844 md5 1.008 0.9940 merge 0.9927 1.062 mlyacc 0.9810 1.024 model-elimination 0.9877 0.9743 mpuz 1.011 1.010 nucleic 1.036 1.030 output1 0.9943 1.021 peek 1.036 1.027 pidigits 1.000 0.9653 psdes-random 1.009 1.014 ratio-regions 0.9985 0.9881 ray 0.9738 0.9601 raytrace 1.101 1.100 simple 0.9620 0.9272 smith-normal-form 0.9690 0.9806 string-concat 0.9610 0.9772 tailfib 1.006 0.9292 tailmerge 0.9847 1.023 tak 0.8264 1.013 tensor 1.010 0.9998 tsp 0.9981 1.010 tyan 1.045 1.027 vector-rev 1.012 0.9891 vector32-concat 0.9495 1.030 vector64-concat 1.098 0.9744 vliw 0.9413 1.019 wc-input1 0.9301 1.098 wc-scanStream 1.114 0.9234 zebra 1.008 1.001 zern 0.9819 1.014 MIN 0.6689 0.6260 GMEAN 0.9940 0.9912 MAX 1.114 1.100 The `simple` chunkify strategy is not (yet) suitable for a self-compile; it can generate excessively large chunks, including one for a self-compile that requires 8min to compile by `gcc`. * Add `expect: WordX.t option` to RSSA and Machine `Switch.T` (911b5d4, e2b27ab, 695320d) and add `-gc-expect {none|false|true}` compile-time option, where `-gc-expect false` should indicate that performing a GC is cold path (823815a); no notable performance impact. * Lots of tweaks to C codegen, ultimately eliminating almost all `c-chunk.h` macros. * Eliminate unused `Machine.Operand.Contents` constructor (006269b). * Make a major refactoring of LLVM codegen (cec30c5). * Implement `Real<N>_qequal` for C codegen (9b7b2bd) and use `is{less,lessequal}` for `Real<N>_l{t,e}` for C codegen (7b55819). * Generalize LLVM type-based alias-analysis (27709ef). * Add `-llvm-aamd scope` for simple `noalias`/`alias.scope` alias-analysis metadata in LLVM codegen (b825f56); no notable performance impact. * Use C99/C11 `inline` for primitive and Basis Library functions (311331c, c864492, 4f2d213). * Add `-codegen-fuse-op-and-chk {false|true}` compile-time option to explicitly fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the C and LLVM codegens (6b738b8, 3d1e89c, 68f8512, 82c019f, 61de560, 5363199, 0d46a85). It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at #273 and #292 suggest that this does not always have significant impact, but sometimes `-codegen-fuse-op-and-chk true` can have a positive. Unfortunately, it can also have a (significant) negative impact. In `matrix-multiply` and `vector-rev`, fusing can cause GCC to not recognize that an explicit sequence index can be replaced by a stride length; in these benchmarks, it would be nice if MLton eliminated the overflow checks. config command C04 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 C09 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang C11 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -codegen-fuse-op-and-chk true C15 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -codegen-fuse-op-and-chk true task_clock [email protected] (2-level) program `C11/C04` `C15/C09` barnes-hut 1.005 0.9925 boyer 1.052 1.013 checksum 1.022 1.028 count-graphs 0.9722 1.002 DLXSimulator 1.004 0.9959 even-odd 0.8768 1.003 fft 0.9592 1.016 fib 0.9732 0.9798 flat-array 0.8148 1.019 hamlet 0.9966 1.030 imp-for 0.8993 0.7985 knuth-bendix 1.008 1.013 lexgen 0.9851 1.043 life 0.9954 1.006 logic 0.9994 1.014 mandelbrot 0.9440 1.013 matrix-multiply 1.336 1.009 md5 0.9604 1.007 merge 0.9675 1.037 mlyacc 1.032 1.029 model-elimination 1.010 1.004 mpuz 1.035 0.9599 nucleic 0.9938 0.9983 output1 0.9278 0.9709 peek 0.9850 1.035 pidigits 0.9702 0.9538 psdes-random 1.017 0.9986 ratio-regions 0.9801 0.9887 ray 0.9795 1.009 raytrace 0.9959 1.026 simple 0.9764 1.010 smith-normal-form 1.002 1.049 string-concat 0.7919 0.9035 tailfib 1.030 1.227 tailmerge 1.017 0.9980 tak 0.9790 0.9988 tensor 0.5258 1.000 tsp 0.9845 1.013 tyan 1.019 0.9739 vector-rev 1.178 1.253 vector32-concat 0.8703 0.9230 vector64-concat 0.8906 0.9038 vliw 0.9921 1.044 wc-input1 1.060 0.9809 wc-scanStream 0.9166 1.040 zebra 1.008 1.020 zern 1.051 1.089 MIN 0.5258 0.7985 GMEAN 0.9720 1.007 MAX 1.336 1.253 Note: the issue with `md5` mentioned in the commit messages are with respect to the `md5` benchmark before 2daaebf. Overall, this simplifies the C and LLVM codegen slightly, although there is little significant performance change: config command C02 /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen c -cc gcc-9 C04 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 C08 /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen llvm -cc clang C09 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang task_clock [email protected] (2-level) program `C04/C02` `C09/C08` barnes-hut 1.036 1.025 boyer 0.9731 1.006 checksum 0.9652 1.002 count-graphs 0.9988 0.9964 DLXSimulator 0.9970 1.023 even-odd 1.002 0.9881 fft 1.026 0.9674 fib 0.9034 0.7846 flat-array 1.014 1.021 hamlet 0.9740 1.010 imp-for 0.9707 0.9908 knuth-bendix 0.9077 0.9777 lexgen 1.048 0.8985 life 1.002 0.9827 logic 1.006 0.9867 mandelbrot 1.000 1.011 matrix-multiply 1.020 0.9957 md5 0.9700 0.9960 merge 0.9974 0.9818 mlyacc 1.003 0.9824 model-elimination 0.9936 0.9817 mpuz 0.9815 0.9466 nucleic 0.9946 1.002 output1 1.007 1.026 peek 0.9832 0.9898 pidigits 0.9950 1.047 psdes-random 1.009 0.9869 ratio-regions 0.9978 0.9725 ray 0.9938 0.9663 raytrace 0.9975 1.032 simple 0.9936 1.000 smith-normal-form 1.038 0.9941 string-concat 1.041 1.014 tailfib 0.9865 0.9741 tailmerge 1.010 1.020 tak 0.9331 0.9041 tensor 0.9938 0.9941 tsp 0.9825 1.004 tyan 0.9960 0.9879 vector-rev 1.014 0.9091 vector32-concat 1.090 0.9016 vector64-concat 0.9994 0.9800 vliw 0.9995 0.9876 wc-input1 0.9685 0.8634 wc-scanStream 1.178 1.105 zebra 0.9857 0.9900 zern 0.9733 0.9890 MIN 0.9034 0.7846 GMEAN 0.9982 0.9815 MAX 1.178 1.105
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
MLton currently implements checked arithmetic with specialized primitives. These primitives are exposed to the Basis Library implementation as functions that implicitly raise a primitive exception:
In the XML IR, special care is taken to "remember" the primive exception associated with these primitives in order to implement exceptions correctly. In the SSA, SSA2, RSSA, and Machine IRs, these primitives are treated as transfers, rather than statements.
This pull request implements a possibly better implementation of these operations as simple primitives that return a boolean:
This would eliminate the special cases in the XML IR and in the SSA, SSA2, RSSA, and Machine IRs, where the primitives would be treated as statements. Other compilers provide overflow checking via boolean-returning functions:
Implementation
This patch refactors the primitive checked-arithmetic operations such that the suffix
?
represents a new overflow-checking predicate, a suffix!
represents the non-overflow-checking variant, and a suffix$
represents the overflow-checking variant (mnemonic:$
for "safe" or "expensive"). The behavior of the new$
-operations is controlled by a compile-time constant,MLton.newOverflow
. When set to false (the default), the$
-operations make use of the old-style implicit-Overflow
primitives. When set to true, the$
-operations are implemented as anif
-expression that branches on the result of the corresponding?
-operation and either raises theOverflow
exception or returns the result of the corresponding!
-operation. Finally, the bare operation is aliased to either the$
-form (with overflow detection enabled) or the!
-form (with overflow detection disabled). Essentially:Note that the checked-arithmetic using
!
- and$
-operations is careful to perform the!
-operation before the$
-operation. With the native-codegens, a new peephole optimization combines the separate unchecked-arithmetic operation and checked-arithmetic-predicate operation into a single instruction. For the C-codgen, the new checked-arithmetic-predicate primitives are translated to usesof the
__builtin_{add,sub,mul}_overflow
intrinsics (which improves upon the previous explicit conditional checking, but requires gcc 5 or greater). Similarly, for the LLVM-codgen, the new checked-arithmetic-predicate primitives are translated to uses of the{sadd,uadd,smul,umul,ssub,usub}.with.overflow
intrinsics. For both the C- and LLVM-codegens, it is expected that these intrinsics will be combined with the separate unchecked-arithmetic operation.In addition, the
RedundantTests
optimization has been extended to eliminate the overflow test when adding or subtracting 1 with the new primitives.Performance
The native-codegen peephole optimization and
RedundantTests
have been mostly sufficient to keep performance on par with the older checked-arithmetic primitives, and in some cases performance has even significantly improved. Below is a summary of the exceptional runtime ratios in the different codegens (both positive and negative):No benchmarks were consistently worse or better on all codegen, e.g., the
imp-for
benchmark performed exceptionally badly on the LLVM codegen, but was much faster on the C codegen and about even on the native codegen. For this particular benchmark, the cause of the slowdown with the LLVM codegen has yet to be discovered. Similarly, the cause of the slowdown inlexgen
with the C- and LLVM-codegens is unknown. For themd5
benchmark, on the other hand, the cause of the slowdown with the native codegen seems to be a failure to eliminate common subexpressions in certain circumstances, which can potentially be improved in the future. Improvements in the C-codegen are likely to be due to the better overflow checking with builtins.Future work
The
CommonSubexp
optimization currently handlesArith
transfers specially; in particular, the result of anArith
transfer can be used in commonArith
transfers that it dominates. This was done so that code like:can be transformed to
With the new checked-arithmetic-predicate primitives, the computation of the boolean value may be common-subexpr eliminated, but
Case
discrimination will not. This forces the boolean value to be reified and to be discriminated multiple times (though, perhapsKnownCase
could eliminate subsequent discriminations). ExtendingCommonSubexpr
to propagate flow-sensitive relationships atCase
transfers to the dominated targets could improve the performancemd5
withMLton.newOverflow true
and potentially improve performance elsewhere as well (e.g., by propagating the results of comparisons as well).Once all performance slowdowns with
MLton.newOverflow true
have been eliminated, it would be desirable to remove the old-style implicit-Overflow
primitives andArith
transfers. This would eliminate many previous instances of special-case code to handle these primitives and transfers.Finally, it may be worth investigating an implementation of the checked-operations via
rather than
The advantage of calculating the boolean first is that when
x
(ory
) is a loop variable andr
will be passed as the value for the next iteration, then bothx
andr
could be assigned the same location (pseudo-register or stack slot).x
andr
cannot share the same location when the boolean is calculated second, becausex
andr
are both live at the calculation of the boolean. See #218. This would require reworking the native-codegen peephole optimization.