Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new overflow-checking primitives #273

Merged
merged 57 commits into from
Oct 15, 2018
Merged

Conversation

daemanos
Copy link
Contributor

@daemanos daemanos commented Aug 6, 2018

Motivation

MLton currently implements checked arithmetic with specialized primitives. These primitives are exposed to the Basis Library implementation as functions that implicitly raise a primitive exception:

val +! = _prim "WordS8_addCheck": int * int -> int;

In the XML IR, special care is taken to "remember" the primive exception associated with these primitives in order to implement exceptions correctly. In the SSA, SSA2, RSSA, and Machine IRs, these primitives are treated as transfers, rather than statements.

This pull request implements a possibly better implementation of these operations as simple primitives that return a boolean:

val +! = _prim "Word8_add": int * int -> int;
val +? = _prim "WordS8_addCheckP": int * int -> bool;
val +$ = fn (x, y) => let val r = +! (x, y)
                      in  if +? (x, y) then raise Overflow else r
                      end

This would eliminate the special cases in the XML IR and in the SSA, SSA2, RSSA, and Machine IRs, where the primitives would be treated as statements. Other compilers provide overflow checking via boolean-returning functions:

Implementation

This patch refactors the primitive checked-arithmetic operations such that the suffix ? represents a new overflow-checking predicate, a suffix ! represents the non-overflow-checking variant, and a suffix $ represents the overflow-checking variant (mnemonic: $ for "safe" or "expensive"). The behavior of the new $-operations is controlled by a compile-time constant, MLton.newOverflow. When set to false (the default), the $-operations make use of the old-style implicit-Overflow primitives. When set to true, the $-operations are implemented as an if-expression that branches on the result of the corresponding ?-operation and either raises the Overflow exception or returns the result of the corresponding !-operation. Finally, the bare operation is aliased to either the $-form (with overflow detection enabled) or the !-form (with overflow detection disabled). Essentially:

val +! = _prim "Word8_add": int * int -> int;
val +? = _prim "WordS8_addCheckP": int * int -> bool;
val +$ = if MLton.newOverflow
         then fn (x, y) => let val r = +! (x, y)
                           in  if +? (x, y) then raise Overflow else r
                           end
         else fn (x, y) => ((_prim "WordS8_addCheckP": int * int -> int;) (x, y))
                           handle PrimOverflow => raise Overflow
val + = if MLton.detectOverflow then +$ else +!

Note that the checked-arithmetic using !- and $-operations is careful to perform the !-operation before the $-operation. With the native-codegens, a new peephole optimization combines the separate unchecked-arithmetic operation and checked-arithmetic-predicate operation into a single instruction. For the C-codgen, the new checked-arithmetic-predicate primitives are translated to uses
of the __builtin_{add,sub,mul}_overflow intrinsics (which improves upon the previous explicit conditional checking, but requires gcc 5 or greater). Similarly, for the LLVM-codgen, the new checked-arithmetic-predicate primitives are translated to uses of the {sadd,uadd,smul,umul,ssub,usub}.with.overflow intrinsics. For both the C- and LLVM-codegens, it is expected that these intrinsics will be combined with the separate unchecked-arithmetic operation.

In addition, the RedundantTests optimization has been extended to eliminate the overflow test when adding or subtracting 1 with the new primitives.

Performance

The native-codegen peephole optimization and RedundantTests have been mostly sufficient to keep performance on par with the older checked-arithmetic primitives, and in some cases performance has even significantly improved. Below is a summary of the exceptional runtime ratios in the different codegens (both positive and negative):

Benchmark Native LLVM C
even-odd 1.00 1.00 1.09
hamlet 0.98 0.90 0.93
imp-for 0.99 1.50 0.46
lexgen 0.92 1.31 1.24
matrix-multiply 0.99 1.00 0.87
md5 1.06 1.01 0.97
tensor 1.01 1.00 0.57

No benchmarks were consistently worse or better on all codegen, e.g., the imp-for benchmark performed exceptionally badly on the LLVM codegen, but was much faster on the C codegen and about even on the native codegen. For this particular benchmark, the cause of the slowdown with the LLVM codegen has yet to be discovered. Similarly, the cause of the slowdown in lexgen with the C- and LLVM-codegens is unknown. For the md5 benchmark, on the other hand, the cause of the slowdown with the native codegen seems to be a failure to eliminate common subexpressions in certain circumstances, which can potentially be improved in the future. Improvements in the C-codegen are likely to be due to the better overflow checking with builtins.

Future work

The CommonSubexp optimization currently handles Arith transfers specially; in particular, the result of an Arith transfer can be used in common Arith transfers that it dominates. This was done so that code like:

(n + m) + (n + m)

can be transformed to

let val r = n + m in r + r end

With the new checked-arithmetic-predicate primitives, the computation of the boolean value may be common-subexpr eliminated, but Case discrimination will not. This forces the boolean value to be reified and to be discriminated multiple times (though, perhaps KnownCase could eliminate subsequent discriminations). Extending CommonSubexpr to propagate flow-sensitive relationships at Case transfers to the dominated targets could improve the performance md5 with MLton.newOverflow true and potentially improve performance elsewhere as well (e.g., by propagating the results of comparisons as well).

Once all performance slowdowns with MLton.newOverflow true have been eliminated, it would be desirable to remove the old-style implicit-Overflow primitives and Arith transfers. This would eliminate many previous instances of special-case code to handle these primitives and transfers.

Finally, it may be worth investigating an implementation of the checked-operations via

val +$ = fn (x, y) => let val b = +? (x, y)
                          val r = +! (x, y)
                      in  if b then raise Overflow else r
                      end

rather than

val +$ = fn (x, y) => let val r = +! (x, y)
                      in  if +? (x, y) then raise Overflow else r
                      end

The advantage of calculating the boolean first is that when x (or y) is a loop variable and r will be passed as the value for the next iteration, then both x and r could be assigned the same location (pseudo-register or stack slot). x and r cannot share the same location when the boolean is calculated second, because x and r are both live at the calculation of the boolean. See #218. This would require reworking the native-codegen peephole optimization.

daemanos added 30 commits May 23, 2018 12:05
Previously, the `!` family of primitive operators (`+!` etc.) was an
operation that always checked for overflow, whereas the `?` family never
checked for overflow. However, as a first step towards transitioning to
a model wherein the `?` family is a predicate that checks for overflow,
here the semantics are flipped such that the `!` family never checks for
overflow, and `?` always does.
This change introduces a new family of `$` operators, which always check
their arguments for overflow and raise an appropriate exception. The `?`
family of operators is changed to detect overflow (returning a `Bool`),
and the `$` family is defined in terms of the corresponding `?` and `!`
operators.

Previous uses of `?` in the basis library are also corrected to use the
new system.
Previously the old primitives were simply deleted; this commit
introduces a compile-time switch (`MLton.newOverflow`) to switch between
the old and new systems. This should allow for easier development and
eventual integration of the new primitives, which can be incrementally
implemented on the backend while keeping the old ones intact.
A previous change introduced the use of GCC builtins for overflow
checking with the `newOverflow` flavor of checked primitive operations.
However, these builtins are compatible only with GCC versions 7 and
above, so to improve backwards compatibility a compatibility layer is
added using the older builtins for legacy GCC versions.

In addition, since much of the code for the new and old primitive
operations was shared, this shared code is pulled into the compatibility
layer (provided as `_overflow_b` macros).
Previously, the new overflow-checking primitives were implemented by
first checking for overflow and then performing the operation. This
change refactors the implementation to instead perform and store the
operation first, then check for overflow.

This will allow the peephole optimization to be more easily implemented
for the new overflow primitives in the native codegens.
The native codegens (amd64, x86) previously used the pmd and imul2
instructions incorrectly to check for overflow (based erroneously on the
pattern of the corresponding arithmetic operations). This fix changes
the implementation to use `pmdcc` for all unsigned operations.
The existing redundantTests code consisted of a map over statements and
a separate check for the presence of an arith transfer for the old
overflow-checking primitives. Initially an additional map over
statements was added to deal with the new overflow-checking primitives;
this change merges the two maps into one.
Adds a new peephole optimization pass for the native codegens to
eliminate redundant AL instructions. In general, whenever a checked word
operation is generated, the `Word_*` and `Word_*CheckP` operations
generate distinct assembly instructions. However, only one of these
instructions is actually needed, since both perform the same operation
and set the same overflow flag, so the generated assembly for a
`Word_*CheckP` immediately following an identical `Word_*` operation can
be safely removed.

Currently, blocks where one `mov` instruction has been hoisted to the
front of the block are also recognized.
Previously the `elimALRedundant` optimization was missing several cases
because it occurred after the `moveHoist` optimization. Originally, an
additional case was added to the `elimALRedundant` optimization to
handle cases where one move had been hoisted immediately before the
corresponding arithmetic operations, but this did not account for cases
where one or both moves were hoisted elsewhere in the block.

This change introduces a new group of optimizations in the
`PeepholeLivenessBlock` section which execute before `moveHoist`. In the
future additional optimizations may be added, but at present only the
`elimALRedundant` optimization occurs in this group.

In addition, the `elimALRedundant` was missing a branch to check for
imul2 operations, which has now been added.
Previously, some redundant `mulCheckP` operations were missed in the
main peephole optimization pass due to a previous multiplication of a
variable by 2 being replaced by an addition of that variable to itself.
This change adds a new peephole optimization to catch such operations.
daemanos and others added 21 commits August 28, 2018 09:31
Like `Word_mulCheck`, `Word_mulCheckP` should not use the
`@llvm.{s,u}mul.with.overflow.i64` intrinsics, because they compile to calls to
`__mulodi4` and `__udivdi3`, which are provided by compiler-rt and not always by
libgcc.
The `elimALRedundant_pseudo` peephole optimization was removed from
the amd64 codegen by 56d4362, but should also be removed from the x86
codegen.
Commit f9e1cb6 renamed the `Word_*CheckOverflows` C functions to
`Word_*CheckP` C functions.
Addition, subtraction, and negation of 64bit operands on a 32bit
platform must first add/sub the low bits and then addc/subb the high
bits in order to correctly set the overflow flag.

However, this does not yet pass all the regression tests:

    testing fixed-integer
    15a16,21
    > Int64: abs ~9223372036854775807 = Overflow <> 9223372036854775807
    > Int64: ~ ~9223372036854775807 = Overflow <> 9223372036854775807
    > Int64: fromString o toString 9223372036854775807 = Overflow <> 9223372036854775807
    > Int64: fromString o toString 9223372036854775806 = Overflow <> 9223372036854775806
    > unhandled exception: Overflow
    > Nonzero exit status.
    fixed-integer: difference with -type-check true -const MLton.newOverflow true
The `elimDeadDsts` peephole optimization attempts to eliminate
instructions whose destinations are dead.  There was limited code for
preserving instructions that are immediately followed by a `setcc`
instruction or a `if` transfer.  However, it did not preserve
instructions that were immediately followed by an `adc` or `sbb`
instruction.  This would lead to mis-compilation of
`Word64_{add,neg,sub}CheckP` primitives.  For example, a
`Word64_addCheckP` could initially be translated to:

    movl src1Lo,tmpLo
    movl src1Hi,tmpHi
    addl src2Lo,tmpLo
    adcl src2Hi,tmpHi
    seto dst

The "tmpLo" destination in the "addl src2Lo,tmpLo" instruction is dead
and so the instruction would be eliminated; in turn, the "tmpLo"
destination in the "movl src1Lo,tmpLo" instruction is dead and so the
instruction would be eliminated.  However, although the "tmpHi"
destination in the "adcl src2Hi,tmpHi" instruction is dead, the
instruction is preserved because of the subsequent "seto"
instruction.  The end result would be:

    movl src1Hi,tmpHi
    adcl src2Hi,tmpHi
    seto dst

which fails to set "dst" appropriately.

Now, in addition to preserving instructions that are immediately
followed by a `setcc` instruction or an `if` transfer, an instruction
immediately preserved by a `adc` or `sbb` instruction is preserved.
A `mov` instruction does not change the condition code flags, so continue
looking for flag-dependent instructions past a `mov` instruction.
Previously, a `Word64_negCheckP` was initially translated to:

    movl $0,tmp
    subl srcLo,tmp
    movl $0,tmp
    sbbl srcHi,tmp
    seto dst

With commits 97315a4 and 3a0a42d, the "subl srcLo,tmp" instruction is
not eliminated by the `elimDeadDsts` peephole optimization.  However,
register allocation unconditionally rewrites `movl $0,dst` as
`xorl dst,dst`, leaving code like:

    xorl tmp,tmp
    subl srcLo,tmp
    xorl tmp,tmp
    sbbl srcHi,tmp

However, while a `mov` preserves the condition code flags, a `xorl`
instruction clears the CF flag, which leads to incorrect flags being
used and set by the `sbb` instruction.

Similarly, a `Word64_addCheckP` was initially translated to:

    movl src1Lo,tmp
    addl src2Lo,tmp
    movl src1Hi,tmp
    adcl src2Hi,tmp
    seto dst

which, if src1Hi were 0, would result in

    movl src1Lo,tmp
    addl src2Lo,tmp
    xorl tmp,tmp
    adcl src2Hi,tmp

which leads to incorrect flags being used and set by the `adc` instruction.

The existing 64-bit primitives avoid this problem by performing all of the `mov`
instructions before the arithmetic instruction.

Now, a `Word64_negCheckP` primitive is initially translated to:

    movl $0,tmpLo
    movl $0,tmpHi
    subl srcLo,tmpLo
    sbbl srcHi,tmpHi
    seto dst

which can be correctly rewritten to:

    xorl tmpLo,tmpLo
    xorl tmpHi,tmpHi
    subl srcLo,tmpLo
    sbbl srcHi,tmpHi
    seto dst

and a `Word64_addCheckP` primitive is initially translated to:

    movl src1Lo,tmpLo
    movl src1Hi,tmpHi
    addl src2Lo,tmpLo
    adcl src2Hi,tmpHi
    seto dst

which, if src1Hi were 0, can be correctly rewritten to:

    movl src1Lo,tmpLo
    xorl tmpHi,tmpHi
    addl src2Lo,tmpLo
    adcl src2Hi,tmpHi
@MatthewFluet MatthewFluet merged commit 2dc3968 into MLton:master Oct 15, 2018
MatthewFluet added a commit that referenced this pull request Oct 15, 2018
Add new overflow-checking primitives

Motivation

MLton currently implements checked arithmetic with specialized primitives.
These primitives are exposed to the Basis Library implementation as functions
that implicitly raise a primitive exception:

    val +! = _prim "WordS8_addCheck": int * int -> int;

In the XML IR, special care is taken to "remember" the primive exception
associated with these primitives in order to implement exceptions correctly.  In
the SSA, SSA2, RSSA, and Machine IRs, these primitives are treated as transfers,
rather than statements.

This pull request implements a possibly better implementation of these
operations as simple primitives that return a boolean:

    val +! = _prim "Word8_add": int * int -> int;
    val +? = _prim "WordS8_addCheckP": int * int -> bool;
    val +$ = fn (x, y) => let val r = +! (x, y)
                          in  if +? (x, y) then raise Overflow else r
                          end

This would eliminate the special cases in the XML IR and in the SSA, SSA2, RSSA,
and Machine IRs, where the primitives would be treated as statements.  Other
compilers provide overflow checking via boolean-returning functions:

 * https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
 * https://llvm.org/docs/LangRef.html#arithmetic-with-overflow-intrinsics

Implementation

This patch refactors the primitive checked-arithmetic operations such that the
suffix `?` represents a new overflow-checking predicate, a suffix `!` represents
the non-overflow-checking variant, and a suffix `$` represents the
overflow-checking variant (mnemonic: `$` for "safe" or "expensive"). The
behavior of the new `$`-operations is controlled by a compile-time constant,
`MLton.newOverflow`.  When set to false (the default), the `$`-operations make
use of the old-style implicit-`Overflow` primitives.  When set to true, the
`$`-operations are implemented as an `if`-expression that branches on the result
of the corresponding `?`-operation and either raises the `Overflow` exception or
returns the result of the corresponding `!`-operation.  Finally, the bare
operation is aliased to either the `$`-form (with overflow detection enabled) or
the `!`-form (with overflow detection disabled).  Essentially:

    val +! = _prim "Word8_add": int * int -> int;
    val +? = _prim "WordS8_addCheckP": int * int -> bool;
    val +$ = if MLton.newOverflow
             then fn (x, y) => let val r = +! (x, y)
                               in  if +? (x, y) then raise Overflow else r
                               end
             else fn (x, y) => ((_prim "WordS8_addCheckP": int * int -> int;) (x, y))
                               handle PrimOverflow => raise Overflow
    val + = if MLton.detectOverflow then +$ else +!

Note that the checked-arithmetic using `!`- and `$`-operations is careful to
perform the `!`-operation before the `$`-operation.  With the native-codegens, a
new peephole optimization combines the separate unchecked-arithmetic operation
and checked-arithmetic-predicate operation into a single instruction.  For the
C-codgen, the new checked-arithmetic-predicate primitives are translated to uses
of the `__builtin_{add,sub,mul}_overflow` intrinsics (which improves upon the
previous explicit conditional checking, but requires gcc 5 or greater).
Similarly, for the LLVM-codgen, the new checked-arithmetic-predicate primitives
are translated to uses of the `{sadd,uadd,smul,umul,ssub,usub}.with.overflow`
intrinsics.  For both the C- and LLVM-codegens, it is expected that these
intrinsics will be combined with the separate unchecked-arithmetic operation.

In addition, the `RedundantTests` optimization has been extended to eliminate
the overflow test when adding or subtracting 1 with the new primitives.

Performance

The native-codegen peephole optimization and `RedundantTests` have been mostly
sufficient to keep performance on par with the older checked-arithmetic
primitives, and in some cases performance has even significantly improved.  Below
is a summary of the exceptional runtime ratios in the different codegens (both
positive and negative):

| Benchmark       | Native | LLVM |    C |
|-----------------|--------|------|------|
| even-odd        |   1.00 | 1.00 | 1.09 |
| hamlet          |   0.98 | 0.90 | 0.93 |
| imp-for         |   0.99 | 1.50 | 0.46 |
| lexgen          |   0.92 | 1.31 | 1.24 |
| matrix-multiply |   0.99 | 1.00 | 0.87 |
| md5             |   1.06 | 1.01 | 0.97 |
| tensor          |   1.01 | 1.00 | 0.57 |

No benchmarks were consistently worse or better on all codegen, e.g., the
`imp-for` benchmark performed exceptionally badly on the LLVM codegen, but was
much faster on the C codegen and about even on the native codegen.  For this
particular benchmark, the cause of the slowdown with the LLVM codegen has yet to
be discovered.  Similarly, the cause of the slowdown in `lexgen` with the C- and
LLVM-codegens is unknown.  For the `md5` benchmark, on the other hand, the cause
of the slowdown with the native codegen seems to be a failure to eliminate
common subexpressions in certain circumstances, which can potentially be
improved in the future.  Improvements in the C-codegen are likely to be due to
the better overflow checking with builtins.

Future work

The `CommonSubexp` optimization currently handles `Arith` transfers specially;
in particular, the result of an `Arith` transfer can be used in common `Arith`
transfers that it dominates.  This was done so that code like:

    (n + m) + (n + m)

can be transformed to

    let val r = n + m in r + r end

With the new checked-arithmetic-predicate primitives, the computation of the
boolean value may be common-subexpr eliminated, but `Case` discrimination will
not.  This forces the boolean value to be reified and to be discriminated
multiple times (though, perhaps `KnownCase` could eliminate subsequent
discriminations).  Extending `CommonSubexpr` to propagate flow-sensitive
relationships at `Case` transfers to the dominated targets could improve the
performance `md5` with `MLton.newOverflow true` and potentially improve
performance elsewhere as well (e.g., by propagating the results of comparisons
as well).

Once all performance slowdowns with `MLton.newOverflow true` have been
eliminated, it would be desirable to remove the old-style implicit-`Overflow`
primitives and `Arith` transfers.  This would eliminate many previous instances
of special-case code to handle these primitives and transfers.

Finally, it may be worth investigating an implementation of the
checked-operations via

    val +$ = fn (x, y) => let val b = +? (x, y)
                              val r = +! (x, y)
                          in  if b then raise Overflow else r
                          end

rather than

    val +$ = fn (x, y) => let val r = +! (x, y)
                          in  if +? (x, y) then raise Overflow else r
                          end

The advantage of calculating the boolean first is that when `x` (or `y`) is a
loop variable and `r` will be passed as the value for the next iteration, then
both `x` and `r` could be assigned the same location (pseudo-register or stack
slot).  `x` and `r` cannot share the same location when the boolean is
calculated second, because `x` and `r` are both live at the calculation of the
boolean.  See #218.  This would require reworking the native-codegen
peephole optimization.
MatthewFluet added a commit that referenced this pull request Mar 21, 2019
Remove old-style arithmetic primitives

In #273, MLton was extended to support
overflow-checking primitives for arithmetic operations. This
allowed checked arithmetic operations to be encoded as an
unchecked arithmetic operation followed by an overflow-checking
operation followed by an `if` that either raises the `Overflow`
exception or returns the arithmetic result.

However, the overflow primitives (that implicitly raise a special
`PrimOverflow` exception in the early ILs and are used in `Arith`
transfers in the late ILs) remained in the compiler.

This pull request removes the old overflow primitives entirely,
along with the special-case code required to support them.
MatthewFluet added a commit to MatthewFluet/mlton that referenced this pull request Nov 5, 2019
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always
successfully fuse adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives.  The performance results
reported at MLton#273 and
MLton#292 suggest that this does not
always have significant impact, but a close look at the `md5`
benchmark shows that the native codegen significantly outperforms the
C codegen with gcc-9 due to redundant arithmetic computations (one for
`Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`).

This flag will be used to enable explicit fusing of adjacent
`Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives in the
codegens.
MatthewFluet added a commit to MatthewFluet/mlton that referenced this pull request Nov 5, 2019
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always
successfully fuse adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives.  The performance results
reported at MLton#273 and
MLton#292 suggest that this does not
always have significant impact, but a close look at the `md5`
benchmark shows that the native codegen significantly outperforms the
C codegen with gcc-9 due to redundant arithmetic computations (one for
`Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`).

These functions compute both the arithmetic result and a boolean
indicating overflow (using `__builtin_<op>_overflow`).  They will be
used for explicit fusing of adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives in the C codegen for
`-codegen-fuse-op-and-check true`.
MatthewFluet added a commit to MatthewFluet/mlton that referenced this pull request Nov 5, 2019
It appears that GCC (and, to a lesser extent) Clang/LLVM do not always
successfully fuse adjacent `Word<N>_<op>` and
`Word{S,U}<N>_<op>CheckP` primitives.  The performance results
reported at MLton#273 and
MLton#292 suggest that this does not
always have significant impact, but a close look at the `md5`
benchmark shows that the native codegen significantly outperforms the
C codegen with gcc-9 due to redundant arithmetic computations (one for
`Word{S,U}<N>_<op>CheckP` and another for `Word<N>_<op>`).

(Note: Because the final md5 state is not used by the `md5` benchmark
program, MLton actually optimizes out most of the md5 computation.
What is left is a lot of arithmetic from `PackWord32Little.subVec` to
check for indices that should raise `Subscript`.)

For example, with `-codegen-fuse-op-and-check false` and gcc-9, the
`transform` function of `md5` has the following assembly:

	movl	%r9d, %r10d
	subl	$1, %r10d
	jo	.L650
	leal	-1(%r8), %r10d
	movl	%r10d, %r12d
	addl	%r10d, %edx
	jo	.L650
	addl	%r10d, %r11d
	cmpl	%eax, %r11d
	jnb	.L656
	movl	%ebp, %edx
	addl	$1, %edx
	jo	.L659
	leal	1(%rcx), %edx
	movl	%edx, %r11d
	imull	%r9d, %r11d
	jo	.L650
	imull	%r8d, %edx
	movl	%edx, %r11d
	addl	%r10d, %r11d
	jo	.L650
	leal	(%rdx,%r10), %r11d
	cmpl	%eax, %r11d
	jnb	.L665

What seems to have happened is that gcc has arranged for equivalent
values to be in `%r8` and `%r9`.  In the first three lines, there is
an implementation of `WordS32_subCheckP (X, 1)` using `subl/jo`, while
in the fourth line, there is an implementation of `Word32_sub (X, 1)`
using `lea` with an offset of `-1`.  Notice that `%r10` is used for
the result of both, so the fourth line is redundant (the value is
already in `%r10`).

On the other hand, with `-codegen-fuse-op-and-check true` and gcc-9,
the `transform` function of `md5` has the following assembly:

	movl	%r8d, %r9d
	subl	$1, %r9d
	jo	.L645
	addl	%r9d, %ecx
	jo	.L645
	cmpl	%edx, %ecx
	jnb	.L651
	movl	%eax, %ecx
	addl	$1, %ecx
	jo	.L654
	imull	%r8d, %ecx
	jo	.L645
	addl	%r9d, %ecx
	jo	.L645
	cmpl	%edx, %ecx
	jnb	.L660
MatthewFluet added a commit that referenced this pull request Nov 22, 2019
Updates to C and LLVM codegens. Highlights:

* Add `Machine.Program.rflow` to compute `{returns,raises}To` control
  flow (654c557) and use in `functor Chunkify` (1b3b7b8) and in
  Machine IR `Raise/Return` transfers (cf8e487).
* Add `chunk-jump-table {false|true}` compile-time option to force
  generation of a jump table for the chunk switch (8e0dd2d,
  5b6439b, 087a5b1).
* Add `-chunk-{{must,may}-rto-self,must-rto-sing,must-rto-other}-opt`
  compile-time options to optimize return/raise transfers (7c10c70,
  4d5abde, 4b7c649, c3b9905, 473808f)
* Experiment using LLVM's `cc10` (aka, `ghccc`) calling convention
  (2e26ebd).
* Experiment with a new `simple` chunkify strategy (3330cbe,
  3d9c499, 138512f, faef164, d1df0de); generally performs
  about the same as `coalesce4096`, significantly improves `fib` and
  `tak` (for GCC), slightly improves `hamlet`, but slightly worsens
  `raytrace`:

  config command                                                                                                                          
  C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
  C05    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -chunkify simple                                          
  C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        
  C10    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -chunkify simple                                       

  task_clock [email protected] (2-level)
  program           `C05/C04` `C10/C09`
  barnes-hut           0.9978    0.9589
  boyer                1.064     1.076 
  checksum             1.051     0.9775
  count-graphs         1.005     0.9876
  DLXSimulator         1.000     0.9905
  even-odd             1.037     0.9989
  fft                  0.9616    0.9537
  fib                  0.6689    0.6260
  flat-array           1.000     0.9645
  hamlet               0.9547    0.9322
  imp-for              1.067     1.014 
  knuth-bendix         1.092     1.031 
  lexgen               1.031     1.078 
  life                 1.002     0.9911
  logic                1.016     1.015 
  mandelbrot           0.9776    1.030 
  matrix-multiply      0.9903    0.9844
  md5                  1.008     0.9940
  merge                0.9927    1.062 
  mlyacc               0.9810    1.024 
  model-elimination    0.9877    0.9743
  mpuz                 1.011     1.010 
  nucleic              1.036     1.030 
  output1              0.9943    1.021 
  peek                 1.036     1.027 
  pidigits             1.000     0.9653
  psdes-random         1.009     1.014 
  ratio-regions        0.9985    0.9881
  ray                  0.9738    0.9601
  raytrace             1.101     1.100 
  simple               0.9620    0.9272
  smith-normal-form    0.9690    0.9806
  string-concat        0.9610    0.9772
  tailfib              1.006     0.9292
  tailmerge            0.9847    1.023 
  tak                  0.8264    1.013 
  tensor               1.010     0.9998
  tsp                  0.9981    1.010 
  tyan                 1.045     1.027 
  vector-rev           1.012     0.9891
  vector32-concat      0.9495    1.030 
  vector64-concat      1.098     0.9744
  vliw                 0.9413    1.019 
  wc-input1            0.9301    1.098 
  wc-scanStream        1.114     0.9234
  zebra                1.008     1.001 
  zern                 0.9819    1.014 
  MIN                  0.6689    0.6260
  GMEAN                0.9940    0.9912
  MAX                  1.114     1.100 

  The `simple` chunkify strategy is not (yet) suitable for a
  self-compile; it can generate excessively large chunks, including
  one for a self-compile that requires 8min to compile by `gcc`.
* Add `expect: WordX.t option` to RSSA and Machine `Switch.T`
  (911b5d4, e2b27ab, 695320d) and add `-gc-expect
  {none|false|true}` compile-time option, where `-gc-expect false`
  should indicate that performing a GC is cold path (823815a); no
  notable performance impact.
* Lots of tweaks to C codegen, ultimately eliminating almost all
  `c-chunk.h` macros.
* Eliminate unused `Machine.Operand.Contents` constructor (006269b).
* Make a major refactoring of LLVM codegen (cec30c5).
* Implement `Real<N>_qequal` for C codegen (9b7b2bd) and use
  `is{less,lessequal}` for `Real<N>_l{t,e}` for C codegen (7b55819).
* Generalize LLVM type-based alias-analysis (27709ef).
* Add `-llvm-aamd scope` for simple `noalias`/`alias.scope`
  alias-analysis metadata in LLVM codegen (b825f56); no notable
  performance impact.
* Use C99/C11 `inline` for primitive and Basis Library functions
  (311331c, c864492, 4f2d213).
* Add `-codegen-fuse-op-and-chk {false|true}` compile-time option to
  explicitly fuse adjacent `Word<N>_<op>` and
  `Word{S,U}<N>_<op>CheckP` primitives in the C and LLVM codegens
  (6b738b8, 3d1e89c, 68f8512, 82c019f, 61de560, 5363199,
  0d46a85).  It appears that GCC (and, to a lesser extent)
  Clang/LLVM do not always successfully fuse adjacent adjacent
  `Word<N>_<op>` and `Word{S,U}<N>_<op>CheckP` primitives.  The
  performance results reported at
  #273 and
  #292 suggest that this does not
  always have significant impact, but sometimes
  `-codegen-fuse-op-and-chk true` can have a positive.  Unfortunately,
  it can also have a (significant) negative impact.  In
  `matrix-multiply` and `vector-rev`, fusing can cause GCC to not
  recognize that an explicit sequence index can be replaced by a
  stride length; in these benchmarks, it would be nice if MLton
  eliminated the overflow checks.

  config command                                                                                                                          
  C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
  C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        
  C11    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9 -codegen-fuse-op-and-chk true                             
  C15    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang -codegen-fuse-op-and-chk true                          

  task_clock [email protected] (2-level)
  program           `C11/C04` `C15/C09`
  barnes-hut           1.005     0.9925
  boyer                1.052     1.013 
  checksum             1.022     1.028 
  count-graphs         0.9722    1.002 
  DLXSimulator         1.004     0.9959
  even-odd             0.8768    1.003 
  fft                  0.9592    1.016 
  fib                  0.9732    0.9798
  flat-array           0.8148    1.019 
  hamlet               0.9966    1.030 
  imp-for              0.8993    0.7985
  knuth-bendix         1.008     1.013 
  lexgen               0.9851    1.043 
  life                 0.9954    1.006 
  logic                0.9994    1.014 
  mandelbrot           0.9440    1.013 
  matrix-multiply      1.336     1.009 
  md5                  0.9604    1.007 
  merge                0.9675    1.037 
  mlyacc               1.032     1.029 
  model-elimination    1.010     1.004 
  mpuz                 1.035     0.9599
  nucleic              0.9938    0.9983
  output1              0.9278    0.9709
  peek                 0.9850    1.035 
  pidigits             0.9702    0.9538
  psdes-random         1.017     0.9986
  ratio-regions        0.9801    0.9887
  ray                  0.9795    1.009 
  raytrace             0.9959    1.026 
  simple               0.9764    1.010 
  smith-normal-form    1.002     1.049 
  string-concat        0.7919    0.9035
  tailfib              1.030     1.227 
  tailmerge            1.017     0.9980
  tak                  0.9790    0.9988
  tensor               0.5258    1.000 
  tsp                  0.9845    1.013 
  tyan                 1.019     0.9739
  vector-rev           1.178     1.253 
  vector32-concat      0.8703    0.9230
  vector64-concat      0.8906    0.9038
  vliw                 0.9921    1.044 
  wc-input1            1.060     0.9809
  wc-scanStream        0.9166    1.040 
  zebra                1.008     1.020 
  zern                 1.051     1.089 
  MIN                  0.5258    0.7985
  GMEAN                0.9720    1.007 
  MAX                  1.336     1.253 

  Note: the issue with `md5` mentioned in the commit messages are with
  respect to the `md5` benchmark before 2daaebf.

Overall, this simplifies the C and LLVM codegen slightly, although
there is little significant performance change:

config command                                                                                                                          
C02    /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen c -cc gcc-9                                                           
C04    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen c -cc gcc-9                                                           
C08    /home/mtf/devel/mlton/builds/g89891a411/bin/mlton -codegen llvm -cc clang                                                        
C09    /home/mtf/devel/mlton/builds/g098009d49/bin/mlton -codegen llvm -cc clang                                                        

task_clock [email protected] (2-level)
program           `C04/C02` `C09/C08`
barnes-hut           1.036     1.025 
boyer                0.9731    1.006 
checksum             0.9652    1.002 
count-graphs         0.9988    0.9964
DLXSimulator         0.9970    1.023 
even-odd             1.002     0.9881
fft                  1.026     0.9674
fib                  0.9034    0.7846
flat-array           1.014     1.021 
hamlet               0.9740    1.010 
imp-for              0.9707    0.9908
knuth-bendix         0.9077    0.9777
lexgen               1.048     0.8985
life                 1.002     0.9827
logic                1.006     0.9867
mandelbrot           1.000     1.011 
matrix-multiply      1.020     0.9957
md5                  0.9700    0.9960
merge                0.9974    0.9818
mlyacc               1.003     0.9824
model-elimination    0.9936    0.9817
mpuz                 0.9815    0.9466
nucleic              0.9946    1.002 
output1              1.007     1.026 
peek                 0.9832    0.9898
pidigits             0.9950    1.047 
psdes-random         1.009     0.9869
ratio-regions        0.9978    0.9725
ray                  0.9938    0.9663
raytrace             0.9975    1.032 
simple               0.9936    1.000 
smith-normal-form    1.038     0.9941
string-concat        1.041     1.014 
tailfib              0.9865    0.9741
tailmerge            1.010     1.020 
tak                  0.9331    0.9041
tensor               0.9938    0.9941
tsp                  0.9825    1.004 
tyan                 0.9960    0.9879
vector-rev           1.014     0.9091
vector32-concat      1.090     0.9016
vector64-concat      0.9994    0.9800
vliw                 0.9995    0.9876
wc-input1            0.9685    0.8634
wc-scanStream        1.178     1.105 
zebra                0.9857    0.9900
zern                 0.9733    0.9890
MIN                  0.9034    0.7846
GMEAN                0.9982    0.9815
MAX                  1.178     1.105
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants