Add new overflow-checking primitives #273

daemanos · 2018-08-06T18:37:38Z

Motivation

MLton currently implements checked arithmetic with specialized primitives. These primitives are exposed to the Basis Library implementation as functions that implicitly raise a primitive exception:

val +! = _prim "WordS8_addCheck": int * int -> int;

In the XML IR, special care is taken to "remember" the primive exception associated with these primitives in order to implement exceptions correctly. In the SSA, SSA2, RSSA, and Machine IRs, these primitives are treated as transfers, rather than statements.

This pull request implements a possibly better implementation of these operations as simple primitives that return a boolean:

val +! = _prim "Word8_add": int * int -> int;
val +? = _prim "WordS8_addCheckP": int * int -> bool;
val +$ = fn (x, y) => let val r = +! (x, y)
                      in  if +? (x, y) then raise Overflow else r
                      end

This would eliminate the special cases in the XML IR and in the SSA, SSA2, RSSA, and Machine IRs, where the primitives would be treated as statements. Other compilers provide overflow checking via boolean-returning functions:

Implementation

This patch refactors the primitive checked-arithmetic operations such that the suffix ? represents a new overflow-checking predicate, a suffix ! represents the non-overflow-checking variant, and a suffix $ represents the overflow-checking variant (mnemonic: $ for "safe" or "expensive"). The behavior of the new $-operations is controlled by a compile-time constant, MLton.newOverflow. When set to false (the default), the $-operations make use of the old-style implicit-Overflow primitives. When set to true, the $-operations are implemented as an if-expression that branches on the result of the corresponding ?-operation and either raises the Overflow exception or returns the result of the corresponding !-operation. Finally, the bare operation is aliased to either the $-form (with overflow detection enabled) or the !-form (with overflow detection disabled). Essentially:

val +! = _prim "Word8_add": int * int -> int;
val +? = _prim "WordS8_addCheckP": int * int -> bool;
val +$ = if MLton.newOverflow
         then fn (x, y) => let val r = +! (x, y)
                           in  if +? (x, y) then raise Overflow else r
                           end
         else fn (x, y) => ((_prim "WordS8_addCheckP": int * int -> int;) (x, y))
                           handle PrimOverflow => raise Overflow
val + = if MLton.detectOverflow then +$ else +!

Note that the checked-arithmetic using !- and $-operations is careful to perform the !-operation before the $-operation. With the native-codegens, a new peephole optimization combines the separate unchecked-arithmetic operation and checked-arithmetic-predicate operation into a single instruction. For the C-codgen, the new checked-arithmetic-predicate primitives are translated to uses
of the __builtin_{add,sub,mul}_overflow intrinsics (which improves upon the previous explicit conditional checking, but requires gcc 5 or greater). Similarly, for the LLVM-codgen, the new checked-arithmetic-predicate primitives are translated to uses of the {sadd,uadd,smul,umul,ssub,usub}.with.overflow intrinsics. For both the C- and LLVM-codegens, it is expected that these intrinsics will be combined with the separate unchecked-arithmetic operation.

In addition, the RedundantTests optimization has been extended to eliminate the overflow test when adding or subtracting 1 with the new primitives.

Performance

The native-codegen peephole optimization and RedundantTests have been mostly sufficient to keep performance on par with the older checked-arithmetic primitives, and in some cases performance has even significantly improved. Below is a summary of the exceptional runtime ratios in the different codegens (both positive and negative):

Benchmark	Native	LLVM	C
even-odd	1.00	1.00	1.09
hamlet	0.98	0.90	0.93
imp-for	0.99	1.50	0.46
lexgen	0.92	1.31	1.24
matrix-multiply	0.99	1.00	0.87
md5	1.06	1.01	0.97
tensor	1.01	1.00	0.57

No benchmarks were consistently worse or better on all codegen, e.g., the imp-for benchmark performed exceptionally badly on the LLVM codegen, but was much faster on the C codegen and about even on the native codegen. For this particular benchmark, the cause of the slowdown with the LLVM codegen has yet to be discovered. Similarly, the cause of the slowdown in lexgen with the C- and LLVM-codegens is unknown. For the md5 benchmark, on the other hand, the cause of the slowdown with the native codegen seems to be a failure to eliminate common subexpressions in certain circumstances, which can potentially be improved in the future. Improvements in the C-codegen are likely to be due to the better overflow checking with builtins.

Future work

The CommonSubexp optimization currently handles Arith transfers specially; in particular, the result of an Arith transfer can be used in common Arith transfers that it dominates. This was done so that code like:

(n + m) + (n + m)

can be transformed to

let val r = n + m in r + r end

With the new checked-arithmetic-predicate primitives, the computation of the boolean value may be common-subexpr eliminated, but Case discrimination will not. This forces the boolean value to be reified and to be discriminated multiple times (though, perhaps KnownCase could eliminate subsequent discriminations). Extending CommonSubexpr to propagate flow-sensitive relationships at Case transfers to the dominated targets could improve the performance md5 with MLton.newOverflow true and potentially improve performance elsewhere as well (e.g., by propagating the results of comparisons as well).

Once all performance slowdowns with MLton.newOverflow true have been eliminated, it would be desirable to remove the old-style implicit-Overflow primitives and Arith transfers. This would eliminate many previous instances of special-case code to handle these primitives and transfers.

Finally, it may be worth investigating an implementation of the checked-operations via

val +$ = fn (x, y) => let val b = +? (x, y)
                          val r = +! (x, y)
                      in  if b then raise Overflow else r
                      end

rather than

val +$ = fn (x, y) => let val r = +! (x, y)
                      in  if +? (x, y) then raise Overflow else r
                      end

The advantage of calculating the boolean first is that when x (or y) is a loop variable and r will be passed as the value for the next iteration, then both x and r could be assigned the same location (pseudo-register or stack slot). x and r cannot share the same location when the boolean is calculated second, because x and r are both live at the calculation of the boolean. See #218. This would require reworking the native-codegen peephole optimization.

Previously, the `!` family of primitive operators (`+!` etc.) was an operation that always checked for overflow, whereas the `?` family never checked for overflow. However, as a first step towards transitioning to a model wherein the `?` family is a predicate that checks for overflow, here the semantics are flipped such that the `!` family never checks for overflow, and `?` always does.

This change introduces a new family of `$` operators, which always check their arguments for overflow and raise an appropriate exception. The `?` family of operators is changed to detect overflow (returning a `Bool`), and the `$` family is defined in terms of the corresponding `?` and `!` operators. Previous uses of `?` in the basis library are also corrected to use the new system.

Previously the old primitives were simply deleted; this commit introduces a compile-time switch (`MLton.newOverflow`) to switch between the old and new systems. This should allow for easier development and eventual integration of the new primitives, which can be incrementally implemented on the backend while keeping the old ones intact.

A previous change introduced the use of GCC builtins for overflow checking with the `newOverflow` flavor of checked primitive operations. However, these builtins are compatible only with GCC versions 7 and above, so to improve backwards compatibility a compatibility layer is added using the older builtins for legacy GCC versions. In addition, since much of the code for the new and old primitive operations was shared, this shared code is pulled into the compatibility layer (provided as `_overflow_b` macros).

Previously, the new overflow-checking primitives were implemented by first checking for overflow and then performing the operation. This change refactors the implementation to instead perform and store the operation first, then check for overflow. This will allow the peephole optimization to be more easily implemented for the new overflow primitives in the native codegens.

The native codegens (amd64, x86) previously used the pmd and imul2 instructions incorrectly to check for overflow (based erroneously on the pattern of the corresponding arithmetic operations). This fix changes the implementation to use `pmdcc` for all unsigned operations.

The existing redundantTests code consisted of a map over statements and a separate check for the presence of an arith transfer for the old overflow-checking primitives. Initially an additional map over statements was added to deal with the new overflow-checking primitives; this change merges the two maps into one.

Adds a new peephole optimization pass for the native codegens to eliminate redundant AL instructions. In general, whenever a checked word operation is generated, the `Word_*` and `Word_*CheckP` operations generate distinct assembly instructions. However, only one of these instructions is actually needed, since both perform the same operation and set the same overflow flag, so the generated assembly for a `Word_*CheckP` immediately following an identical `Word_*` operation can be safely removed. Currently, blocks where one `mov` instruction has been hoisted to the front of the block are also recognized.

Previously the `elimALRedundant` optimization was missing several cases because it occurred after the `moveHoist` optimization. Originally, an additional case was added to the `elimALRedundant` optimization to handle cases where one move had been hoisted immediately before the corresponding arithmetic operations, but this did not account for cases where one or both moves were hoisted elsewhere in the block. This change introduces a new group of optimizations in the `PeepholeLivenessBlock` section which execute before `moveHoist`. In the future additional optimizations may be added, but at present only the `elimALRedundant` optimization occurs in this group. In addition, the `elimALRedundant` was missing a branch to check for imul2 operations, which has now been added.

Previously, some redundant `mulCheckP` operations were missed in the main peephole optimization pass due to a previous multiplication of a variable by 2 being replaced by an addition of that variable to itself. This change adds a new peephole optimization to catch such operations.

Like `Word_mulCheck`, `Word_mulCheckP` should not use the `@llvm.{s,u}mul.with.overflow.i64` intrinsics, because they compile to calls to `__mulodi4` and `__udivdi3`, which are provided by compiler-rt and not always by libgcc.

The `elimALRedundant_pseudo` peephole optimization was removed from the amd64 codegen by 56d4362, but should also be removed from the x86 codegen.

Commit f9e1cb6 renamed the `Word_*CheckOverflows` C functions to `Word_*CheckP` C functions.

Addition, subtraction, and negation of 64bit operands on a 32bit platform must first add/sub the low bits and then addc/subb the high bits in order to correctly set the overflow flag. However, this does not yet pass all the regression tests: testing fixed-integer 15a16,21 > Int64: abs ~9223372036854775807 = Overflow <> 9223372036854775807 > Int64: ~ ~9223372036854775807 = Overflow <> 9223372036854775807 > Int64: fromString o toString 9223372036854775807 = Overflow <> 9223372036854775807 > Int64: fromString o toString 9223372036854775806 = Overflow <> 9223372036854775806 > unhandled exception: Overflow > Nonzero exit status. fixed-integer: difference with -type-check true -const MLton.newOverflow true

The `elimDeadDsts` peephole optimization attempts to eliminate instructions whose destinations are dead. There was limited code for preserving instructions that are immediately followed by a `setcc` instruction or a `if` transfer. However, it did not preserve instructions that were immediately followed by an `adc` or `sbb` instruction. This would lead to mis-compilation of `Word64_{add,neg,sub}CheckP` primitives. For example, a `Word64_addCheckP` could initially be translated to: movl src1Lo,tmpLo movl src1Hi,tmpHi addl src2Lo,tmpLo adcl src2Hi,tmpHi seto dst The "tmpLo" destination in the "addl src2Lo,tmpLo" instruction is dead and so the instruction would be eliminated; in turn, the "tmpLo" destination in the "movl src1Lo,tmpLo" instruction is dead and so the instruction would be eliminated. However, although the "tmpHi" destination in the "adcl src2Hi,tmpHi" instruction is dead, the instruction is preserved because of the subsequent "seto" instruction. The end result would be: movl src1Hi,tmpHi adcl src2Hi,tmpHi seto dst which fails to set "dst" appropriately. Now, in addition to preserving instructions that are immediately followed by a `setcc` instruction or an `if` transfer, an instruction immediately preserved by a `adc` or `sbb` instruction is preserved.

A `mov` instruction does not change the condition code flags, so continue looking for flag-dependent instructions past a `mov` instruction.

Previously, a `Word64_negCheckP` was initially translated to: movl $0,tmp subl srcLo,tmp movl $0,tmp sbbl srcHi,tmp seto dst With commits 97315a4 and 3a0a42d, the "subl srcLo,tmp" instruction is not eliminated by the `elimDeadDsts` peephole optimization. However, register allocation unconditionally rewrites `movl $0,dst` as `xorl dst,dst`, leaving code like: xorl tmp,tmp subl srcLo,tmp xorl tmp,tmp sbbl srcHi,tmp However, while a `mov` preserves the condition code flags, a `xorl` instruction clears the CF flag, which leads to incorrect flags being used and set by the `sbb` instruction. Similarly, a `Word64_addCheckP` was initially translated to: movl src1Lo,tmp addl src2Lo,tmp movl src1Hi,tmp adcl src2Hi,tmp seto dst which, if src1Hi were 0, would result in movl src1Lo,tmp addl src2Lo,tmp xorl tmp,tmp adcl src2Hi,tmp which leads to incorrect flags being used and set by the `adc` instruction. The existing 64-bit primitives avoid this problem by performing all of the `mov` instructions before the arithmetic instruction. Now, a `Word64_negCheckP` primitive is initially translated to: movl $0,tmpLo movl $0,tmpHi subl srcLo,tmpLo sbbl srcHi,tmpHi seto dst which can be correctly rewritten to: xorl tmpLo,tmpLo xorl tmpHi,tmpHi subl srcLo,tmpLo sbbl srcHi,tmpHi seto dst and a `Word64_addCheckP` primitive is initially translated to: movl src1Lo,tmpLo movl src1Hi,tmpHi addl src2Lo,tmpLo adcl src2Hi,tmpHi seto dst which, if src1Hi were 0, can be correctly rewritten to: movl src1Lo,tmpLo xorl tmpHi,tmpHi addl src2Lo,tmpLo adcl src2Hi,tmpHi

Add new overflow-checking primitives Motivation MLton currently implements checked arithmetic with specialized primitives. These primitives are exposed to the Basis Library implementation as functions that implicitly raise a primitive exception: val +! = _prim "WordS8_addCheck": int * int -> int; In the XML IR, special care is taken to "remember" the primive exception associated with these primitives in order to implement exceptions correctly. In the SSA, SSA2, RSSA, and Machine IRs, these primitives are treated as transfers, rather than statements. This pull request implements a possibly better implementation of these operations as simple primitives that return a boolean: val +! = _prim "Word8_add": int * int -> int; val +? = _prim "WordS8_addCheckP": int * int -> bool; val +$ = fn (x, y) => let val r = +! (x, y) in if +? (x, y) then raise Overflow else r end This would eliminate the special cases in the XML IR and in the SSA, SSA2, RSSA, and Machine IRs, where the primitives would be treated as statements. Other compilers provide overflow checking via boolean-returning functions: * https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html * https://llvm.org/docs/LangRef.html#arithmetic-with-overflow-intrinsics Implementation This patch refactors the primitive checked-arithmetic operations such that the suffix `?` represents a new overflow-checking predicate, a suffix `!` represents the non-overflow-checking variant, and a suffix `$` represents the overflow-checking variant (mnemonic: `$` for "safe" or "expensive"). The behavior of the new `$`-operations is controlled by a compile-time constant, `MLton.newOverflow`. When set to false (the default), the `$`-operations make use of the old-style implicit-`Overflow` primitives. When set to true, the `$`-operations are implemented as an `if`-expression that branches on the result of the corresponding `?`-operation and either raises the `Overflow` exception or returns the result of the corresponding `!`-operation. Finally, the bare operation is aliased to either the `$`-form (with overflow detection enabled) or the `!`-form (with overflow detection disabled). Essentially: val +! = _prim "Word8_add": int * int -> int; val +? = _prim "WordS8_addCheckP": int * int -> bool; val +$ = if MLton.newOverflow then fn (x, y) => let val r = +! (x, y) in if +? (x, y) then raise Overflow else r end else fn (x, y) => ((_prim "WordS8_addCheckP": int * int -> int;) (x, y)) handle PrimOverflow => raise Overflow val + = if MLton.detectOverflow then +$ else +! Note that the checked-arithmetic using `!`- and `$`-operations is careful to perform the `!`-operation before the `$`-operation. With the native-codegens, a new peephole optimization combines the separate unchecked-arithmetic operation and checked-arithmetic-predicate operation into a single instruction. For the C-codgen, the new checked-arithmetic-predicate primitives are translated to uses of the `__builtin_{add,sub,mul}_overflow` intrinsics (which improves upon the previous explicit conditional checking, but requires gcc 5 or greater). Similarly, for the LLVM-codgen, the new checked-arithmetic-predicate primitives are translated to uses of the `{sadd,uadd,smul,umul,ssub,usub}.with.overflow` intrinsics. For both the C- and LLVM-codegens, it is expected that these intrinsics will be combined with the separate unchecked-arithmetic operation. In addition, the `RedundantTests` optimization has been extended to eliminate the overflow test when adding or subtracting 1 with the new primitives. Performance The native-codegen peephole optimization and `RedundantTests` have been mostly sufficient to keep performance on par with the older checked-arithmetic primitives, and in some cases performance has even significantly improved. Below is a summary of the exceptional runtime ratios in the different codegens (both positive and negative): | Benchmark | Native | LLVM | C | |-----------------|--------|------|------| | even-odd | 1.00 | 1.00 | 1.09 | | hamlet | 0.98 | 0.90 | 0.93 | | imp-for | 0.99 | 1.50 | 0.46 | | lexgen | 0.92 | 1.31 | 1.24 | | matrix-multiply | 0.99 | 1.00 | 0.87 | | md5 | 1.06 | 1.01 | 0.97 | | tensor | 1.01 | 1.00 | 0.57 | No benchmarks were consistently worse or better on all codegen, e.g., the `imp-for` benchmark performed exceptionally badly on the LLVM codegen, but was much faster on the C codegen and about even on the native codegen. For this particular benchmark, the cause of the slowdown with the LLVM codegen has yet to be discovered. Similarly, the cause of the slowdown in `lexgen` with the C- and LLVM-codegens is unknown. For the `md5` benchmark, on the other hand, the cause of the slowdown with the native codegen seems to be a failure to eliminate common subexpressions in certain circumstances, which can potentially be improved in the future. Improvements in the C-codegen are likely to be due to the better overflow checking with builtins. Future work The `CommonSubexp` optimization currently handles `Arith` transfers specially; in particular, the result of an `Arith` transfer can be used in common `Arith` transfers that it dominates. This was done so that code like: (n + m) + (n + m) can be transformed to let val r = n + m in r + r end With the new checked-arithmetic-predicate primitives, the computation of the boolean value may be common-subexpr eliminated, but `Case` discrimination will not. This forces the boolean value to be reified and to be discriminated multiple times (though, perhaps `KnownCase` could eliminate subsequent discriminations). Extending `CommonSubexpr` to propagate flow-sensitive relationships at `Case` transfers to the dominated targets could improve the performance `md5` with `MLton.newOverflow true` and potentially improve performance elsewhere as well (e.g., by propagating the results of comparisons as well). Once all performance slowdowns with `MLton.newOverflow true` have been eliminated, it would be desirable to remove the old-style implicit-`Overflow` primitives and `Arith` transfers. This would eliminate many previous instances of special-case code to handle these primitives and transfers. Finally, it may be worth investigating an implementation of the checked-operations via val +$ = fn (x, y) => let val b = +? (x, y) val r = +! (x, y) in if b then raise Overflow else r end rather than val +$ = fn (x, y) => let val r = +! (x, y) in if +? (x, y) then raise Overflow else r end The advantage of calculating the boolean first is that when `x` (or `y`) is a loop variable and `r` will be passed as the value for the next iteration, then both `x` and `r` could be assigned the same location (pseudo-register or stack slot). `x` and `r` cannot share the same location when the boolean is calculated second, because `x` and `r` are both live at the calculation of the boolean. See #218. This would require reworking the native-codegen peephole optimization.

Remove old-style arithmetic primitives In #273, MLton was extended to support overflow-checking primitives for arithmetic operations. This allowed checked arithmetic operations to be encoded as an unchecked arithmetic operation followed by an overflow-checking operation followed by an `if` that either raises the `Overflow` exception or returns the arithmetic result. However, the overflow primitives (that implicitly raise a special `PrimOverflow` exception in the early ILs and are used in `Arith` transfers in the late ILs) remained in the compiler. This pull request removes the old overflow primitives entirely, along with the special-case code required to support them.

It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). This flag will be used to enable explicit fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the codegens.

It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). These functions compute both the arithmetic result and a boolean indicating overflow (using `__builtin_<op>_overflow`). They will be used for explicit fusing of adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the C codegen for `-codegen-fuse-op-and-check true`.

It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at MLton#273 and MLton#292 suggest that this does not always have significant impact, but a close look at the `md5` benchmark shows that the native codegen significantly outperforms the C codegen with gcc-9 due to redundant arithmetic computations (one for `Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`). (Note: Because the final md5 state is not used by the `md5` benchmark program, MLton actually optimizes out most of the md5 computation. What is left is a lot of arithmetic from `PackWord32Little.subVec` to check for indices that should raise `Subscript`.) For example, with `-codegen-fuse-op-and-check false` and gcc-9, the `transform` function of `md5` has the following assembly: movl %r9d, %r10d subl $1, %r10d jo .L650 leal -1(%r8), %r10d movl %r10d, %r12d addl %r10d, %edx jo .L650 addl %r10d, %r11d cmpl %eax, %r11d jnb .L656 movl %ebp, %edx addl $1, %edx jo .L659 leal 1(%rcx), %edx movl %edx, %r11d imull %r9d, %r11d jo .L650 imull %r8d, %edx movl %edx, %r11d addl %r10d, %r11d jo .L650 leal (%rdx,%r10), %r11d cmpl %eax, %r11d jnb .L665 What seems to have happened is that gcc has arranged for equivalent values to be in `%r8` and `%r9`. In the first three lines, there is an implementation of `WordS32_subCheckP (X, 1)` using `subl/jo`, while in the fourth line, there is an implementation of `Word32_sub (X, 1)` using `lea` with an offset of `-1`. Notice that `%r10` is used for the result of both, so the fourth line is redundant (the value is already in `%r10`). On the other hand, with `-codegen-fuse-op-and-check true` and gcc-9, the `transform` function of `md5` has the following assembly: movl %r8d, %r9d subl $1, %r9d jo .L645 addl %r9d, %ecx jo .L645 cmpl %edx, %ecx jnb .L651 movl %eax, %ecx addl $1, %ecx jo .L654 imull %r8d, %ecx jo .L645 addl %r9d, %ecx jo .L645 cmpl %edx, %ecx jnb .L660

Updates to C and LLVM codegens. Highlights: * Add `Machine.Program.rflow` to compute `{returns,raises}To` control flow (654c557) and use in `functor Chunkify` (1b3b7b8) and in Machine IR `Raise/Return` transfers (cf8e487). * Add `chunk-jump-table {false|true}` compile-time option to force generation of a jump table for the chunk switch (8e0dd2d, 5b6439b, 087a5b1). * Add `-chunk-{{must,may}-rto-self,must-rto-sing,must-rto-other}-opt` compile-time options to optimize return/raise transfers (7c10c70, 4d5abde, 4b7c649, c3b9905, 473808f) * Experiment using LLVM's `cc10` (aka, `ghccc`) calling convention (2e26ebd). * Experiment with a new `simple` chunkify strategy (3330cbe, 3d9c499, 138512f, faef164, d1df0de); generally performs about the same as `coalesce4096`, significantly improves `fib` and `tak` (for GCC), slightly improves `hamlet`, but slightly worsens `raytrace`: config command C04 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 C05 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -chunkify simple C09 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang C10 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -chunkify simple task_clock [email protected] (2-level) program `C05/C04` `C10/C09` barnes-hut 0.9978 0.9589 boyer 1.064 1.076 checksum 1.051 0.9775 count-graphs 1.005 0.9876 DLXSimulator 1.000 0.9905 even-odd 1.037 0.9989 fft 0.9616 0.9537 fib 0.6689 0.6260 flat-array 1.000 0.9645 hamlet 0.9547 0.9322 imp-for 1.067 1.014 knuth-bendix 1.092 1.031 lexgen 1.031 1.078 life 1.002 0.9911 logic 1.016 1.015 mandelbrot 0.9776 1.030 matrix-multiply 0.9903 0.9844 md5 1.008 0.9940 merge 0.9927 1.062 mlyacc 0.9810 1.024 model-elimination 0.9877 0.9743 mpuz 1.011 1.010 nucleic 1.036 1.030 output1 0.9943 1.021 peek 1.036 1.027 pidigits 1.000 0.9653 psdes-random 1.009 1.014 ratio-regions 0.9985 0.9881 ray 0.9738 0.9601 raytrace 1.101 1.100 simple 0.9620 0.9272 smith-normal-form 0.9690 0.9806 string-concat 0.9610 0.9772 tailfib 1.006 0.9292 tailmerge 0.9847 1.023 tak 0.8264 1.013 tensor 1.010 0.9998 tsp 0.9981 1.010 tyan 1.045 1.027 vector-rev 1.012 0.9891 vector32-concat 0.9495 1.030 vector64-concat 1.098 0.9744 vliw 0.9413 1.019 wc-input1 0.9301 1.098 wc-scanStream 1.114 0.9234 zebra 1.008 1.001 zern 0.9819 1.014 MIN 0.6689 0.6260 GMEAN 0.9940 0.9912 MAX 1.114 1.100 The `simple` chunkify strategy is not (yet) suitable for a self-compile; it can generate excessively large chunks, including one for a self-compile that requires 8min to compile by `gcc`. * Add `expect: WordX.t option` to RSSA and Machine `Switch.T` (911b5d4, e2b27ab, 695320d) and add `-gc-expect {none|false|true}` compile-time option, where `-gc-expect false` should indicate that performing a GC is cold path (823815a); no notable performance impact. * Lots of tweaks to C codegen, ultimately eliminating almost all `c-chunk.h` macros. * Eliminate unused `Machine.Operand.Contents` constructor (006269b). * Make a major refactoring of LLVM codegen (cec30c5). * Implement `Real<N>_qequal` for C codegen (9b7b2bd) and use `is{less,lessequal}` for `Real<N>_l{t,e}` for C codegen (7b55819). * Generalize LLVM type-based alias-analysis (27709ef). * Add `-llvm-aamd scope` for simple `noalias`/`alias.scope` alias-analysis metadata in LLVM codegen (b825f56); no notable performance impact. * Use C99/C11 `inline` for primitive and Basis Library functions (311331c, c864492, 4f2d213). * Add `-codegen-fuse-op-and-chk {false|true}` compile-time option to explicitly fuse adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the C and LLVM codegens (6b738b8, 3d1e89c, 68f8512, 82c019f, 61de560, 5363199, 0d46a85). It appears that GCC (and, to a lesser extent) Clang/LLVM do not always successfully fuse adjacent adjacent `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives. The performance results reported at #273 and #292 suggest that this does not always have significant impact, but sometimes `-codegen-fuse-op-and-chk true` can have a positive. Unfortunately, it can also have a (significant) negative impact. In `matrix-multiply` and `vector-rev`, fusing can cause GCC to not recognize that an explicit sequence index can be replaced by a stride length; in these benchmarks, it would be nice if MLton eliminated the overflow checks. config command C04 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 C09 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang C11 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -codegen-fuse-op-and-chk true C15 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -codegen-fuse-op-and-chk true task_clock [email protected] (2-level) program `C11/C04` `C15/C09` barnes-hut 1.005 0.9925 boyer 1.052 1.013 checksum 1.022 1.028 count-graphs 0.9722 1.002 DLXSimulator 1.004 0.9959 even-odd 0.8768 1.003 fft 0.9592 1.016 fib 0.9732 0.9798 flat-array 0.8148 1.019 hamlet 0.9966 1.030 imp-for 0.8993 0.7985 knuth-bendix 1.008 1.013 lexgen 0.9851 1.043 life 0.9954 1.006 logic 0.9994 1.014 mandelbrot 0.9440 1.013 matrix-multiply 1.336 1.009 md5 0.9604 1.007 merge 0.9675 1.037 mlyacc 1.032 1.029 model-elimination 1.010 1.004 mpuz 1.035 0.9599 nucleic 0.9938 0.9983 output1 0.9278 0.9709 peek 0.9850 1.035 pidigits 0.9702 0.9538 psdes-random 1.017 0.9986 ratio-regions 0.9801 0.9887 ray 0.9795 1.009 raytrace 0.9959 1.026 simple 0.9764 1.010 smith-normal-form 1.002 1.049 string-concat 0.7919 0.9035 tailfib 1.030 1.227 tailmerge 1.017 0.9980 tak 0.9790 0.9988 tensor 0.5258 1.000 tsp 0.9845 1.013 tyan 1.019 0.9739 vector-rev 1.178 1.253 vector32-concat 0.8703 0.9230 vector64-concat 0.8906 0.9038 vliw 0.9921 1.044 wc-input1 1.060 0.9809 wc-scanStream 0.9166 1.040 zebra 1.008 1.020 zern 1.051 1.089 MIN 0.5258 0.7985 GMEAN 0.9720 1.007 MAX 1.336 1.253 Note: the issue with `md5` mentioned in the commit messages are with respect to the `md5` benchmark before 2daaebf. Overall, this simplifies the C and LLVM codegen slightly, although there is little significant performance change: config command C02 /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen c -cc gcc-9 C04 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 C08 /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen llvm -cc clang C09 /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang task_clock [email protected] (2-level) program `C04/C02` `C09/C08` barnes-hut 1.036 1.025 boyer 0.9731 1.006 checksum 0.9652 1.002 count-graphs 0.9988 0.9964 DLXSimulator 0.9970 1.023 even-odd 1.002 0.9881 fft 1.026 0.9674 fib 0.9034 0.7846 flat-array 1.014 1.021 hamlet 0.9740 1.010 imp-for 0.9707 0.9908 knuth-bendix 0.9077 0.9777 lexgen 1.048 0.8985 life 1.002 0.9827 logic 1.006 0.9867 mandelbrot 1.000 1.011 matrix-multiply 1.020 0.9957 md5 0.9700 0.9960 merge 0.9974 0.9818 mlyacc 1.003 0.9824 model-elimination 0.9936 0.9817 mpuz 0.9815 0.9466 nucleic 0.9946 1.002 output1 1.007 1.026 peek 0.9832 0.9898 pidigits 0.9950 1.047 psdes-random 1.009 0.9869 ratio-regions 0.9978 0.9725 ray 0.9938 0.9663 raytrace 0.9975 1.032 simple 0.9936 1.000 smith-normal-form 1.038 0.9941 string-concat 1.041 1.014 tailfib 0.9865 0.9741 tailmerge 1.010 1.020 tak 0.9331 0.9041 tensor 0.9938 0.9941 tsp 0.9825 1.004 tyan 0.9960 0.9879 vector-rev 1.014 0.9091 vector32-concat 1.090 0.9016 vector64-concat 0.9994 0.9800 vliw 0.9995 0.9876 wc-input1 0.9685 0.8634 wc-scanStream 1.178 1.105 zebra 0.9857 0.9900 zern 0.9733 0.9890 MIN 0.9034 0.7846 GMEAN 0.9982 0.9815 MAX 1.178 1.105

daemanos added 30 commits May 23, 2018 12:05

Ignore vim configuration files

d4c412e

Add minimal backend support for new primitives

f9e1cb6

Remove unnecessary cases of Arith transfers

7bb41e5

Add support for new primitives in amd64 codegen

81d8cf8

Use GCC builtins for overflow checking

efef5dc

Add support for LLVM codegen

b05ba13

Correct typo in neg_overflow_b application

8de2d7c

Add overflowCheckTemp storage for x86 codegen

edafb83

Make negCheckP use 0-x for bounds checking

87ea212

Add support for x86 codegen

4a32966

Use GCC builtins in implementation of word prims

a68f209

Add support for new primitives in redundantTests

1822eb4

Add arithmetic identity support for newOverflow

267504c

Use 0-x for negation overflow checking in amd64

e6abe7f

Remove unused variables

72cb10b

Rename negal(64)cc to neg(64)cc

844d11f

Add layout for amd64 chunks/blocks

01c9f12

Add elimALRedundant pass to x86 codegen

81f2c6e

Fix undefined reference in amd64-simplify.fun

5de7724

Replace 0-x in negCheckP with a NEG instruction

47465f0

daemanos and others added 21 commits August 28, 2018 09:31

Fix int-mul-pow2 bug

8fea7b6

Remove debug printing from amd64-simplify.fun

0da9dfe

Remove superfluous elimALRedundant_pseudo optimization

56d4362

Correct initial refactoring of overflow primitives

03e41a0

Correct initial refactoring of unchecked/checked arithmetic fs

724f1d1

Alphabetize symbols in include/amd64-main.h

83d6955

Remove superfluous elimALRedundant_pseudo optimization

a1a4c0f

The `elimALRedundant_pseudo` peephole optimization was removed from the amd64 codegen by 56d4362, but should also be removed from the x86 codegen.

Delete trailing whitespace

df2e9e1

Tweak code formatting and style

cf7c3a7

Refactor to share code for simplifying arithmetic check primitives

939e827

Share code for type-checking comparisons and overflow-checking prims

9666e8a

Code formatting

a720283

Fix #undef Word_negCheckP (had been #undef WordS_negCheckP)

ce4352d

Fix translation of Word_*Check primitives to C functions

b91bb9a

Commit f9e1cb6 renamed the `Word_*CheckOverflows` C functions to `Word_*CheckP` C functions.

Whitespace fixes

5b9520e

Strengthen elimAddSub1 and elimDeadDsts peephole optimizations

3a0a42d

A `mov` instruction does not change the condition code flags, so continue looking for flag-dependent instructions past a `mov` instruction.

Add CHANGELOG entry for new Overflow-checking primitives

2dc3968

MatthewFluet merged commit 2dc3968 into MLton:master Oct 15, 2018

daemanos mentioned this pull request Mar 18, 2019

Remove old-style arithmetic primitives #292

Merged

MatthewFluet mentioned this pull request Nov 22, 2019

C and LLVM codegen updates #351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new overflow-checking primitives #273

Add new overflow-checking primitives #273

daemanos commented Aug 6, 2018 •

edited by MatthewFluet

Loading

Add new overflow-checking primitives #273

Add new overflow-checking primitives #273

Conversation

daemanos commented Aug 6, 2018 • edited by MatthewFluet Loading

Motivation

Implementation

Performance

Future work

daemanos commented Aug 6, 2018 •

edited by MatthewFluet

Loading