x64: Wide operations used for some ALU ops instead of precisely-sized ones #10199

alexcrichton · 2025-02-06T17:28:55Z

I wanted to make a dedicated issue to continue discusison from #10110 (comment)

Currently Wasmtime on x64 switches to using only 32/64-bit ALU ops here, ignoring 8/16 bit types as input and using the wider operation instead. This leads to what I personally find is a confusing pattern which is sometimes the type is used sometimes it isn't. We already have to get everything correct for sunk operands as that's required to operate on the precise width, and I'm not sure what the benefit is to use a 32-bit instruction rather than an 8-bit instruction.

In the linked thread @cfallin mentions:

I don't know about performance implications -- I do recall vaguely something about "partial-register stalls" if one updates only a part of the destination register but maybe it's fine if the instruction already depends on the register as an input. Assuming that's not a concern (@abrown confirm?) then this seems fine with me.

I know this came up with high-latency instructions like sqrt where the problem we ran into was that a false dependency was created between instructions where some instructions operate on the full xmm width and some don't. I'm not sure if this is a problem for (what I assume are) low-latency instructions like and. Additonally I'm not sure if smaller-than-64-bit-width instructions preserve upper bits or sign/zero extend (I couldn't figure it out from the docs)

Otherwise though I would expect that even today where we clamp at 32 bits we still have this problem. That means we're already doing smaller-than-register-width operations which have the theoretical possibility of creating false dependencies.

Basically I view the current state as a bit of a weird inbetween of two worlds we could possibly be in:

Always use exact-width instructions
Always use 64-bit width instructions unless memory is related, then be precise.

Personally I feel like we should lean towards (1) under the assumption it doesn't have bad performance.

The text was updated successfully, but these errors were encountered:

cfallin · 2025-02-06T18:00:27Z

I did a bit more digging regarding partial register stalls. I've found conflicting notes online, and first-principles thinking suggests that andb %rA, %rB should be no worse than andl %rA, %rB from a data-dependency perspective: both are reading (some part or all) of rA and rB as a source, and writing to rB. Updating only the low part of rB doesn't change that; said update-merge would require the unchanged high bits of rB as input, but we're already reading rB as input.

However some sources indicate that a separate merge uop is needed on some microarchitectures when a narrow operation occurs -- internally this is thus executed as the 8-bit AND, then an "insert 8 bits into low 8 bits of original 64-bit register" microinstruction.

I've found evidence of this even on very new processors -- on my two-month-old Zen 5 machine (Ryzen 9950X) I see that this code

bits 64

global main
main:
    mov rax, 10*1000*1000*1000
    xor rcx, rcx

_l:
    and dl, cl
    sub rax, 1
    jnz _l

    ret

runs about 5% slower than this code

bits 64

global main
main:
    mov rax, 10*1000*1000*1000
    xor rcx, rcx

_l:
    and edx, ecx
    sub rax, 1
    jnz _l

    ret

(build both with nasm -felf64 test.asm -o test.o ; gcc test.o -o test). This indicates that some sort of merging is still happening.

The 32-bit case is different because AMD was wise in defining 32-bit ops to clear the upper bits of a 64-bit register when in long mode. But it seems for 8 and 16 bit cases, there is merging behavior that we want to avoid. The memory cases are different because the operand size matters for the load width, of course.

This all suggests that we should keep the status-quo in lowering rules, and also maybe reconsider our pattern around SETcc: setnz al when reifying a condition into a flag (icmp without branch fusion) is going to have a false dependency on whatever was previously in rax. We can break it by using a zeroing idiom recognized by the renamer, e.g. xor rax, rax.

cfallin · 2025-02-06T18:02:52Z

(For completeness, this would be almost your option 2, except "use 32-bit instructions when 8/16/32-bit"; at least in some edge cases, e.g. divides, this does have an impact on latency vs. 64 bit width, and I'd prefer to keep a consistent "narrower than 64 uses 32 bits" rule than make exceptions.)

abrown · 2025-02-06T18:39:47Z

I thought about this for a while during #10110 and my mind went to "better automated benchmarking." If it wasn't such a pain to measure, we would probably just make this decision based on real data. Like @cfallin, I suspected we wanted to break the false dependency but the overall effect of this is tedious to measure and not always conclusive.

As you all know I built a bunch of "benchmark results over time" infrastructure for Sightglass some time ago but abandoned the effort once it became clear I would have to maintain a benchmarking server in perpetuity. @jlb6740 tried to get this kind of thing automated into our CI, but the integration friction is likely what discourages its use. What about codspeed, though? Their criterion-compatible shim uses Valgrind for what they claim are stable results and they're working on a wall-time version as well. Here's an example of the "results over time" they capture: example.

This crazy benchmarking talk may seem like a tangent, but I think measuring "results over time" is what is going to help us most with this kind of issue and with finding other issues of the same sort.

alexcrichton · 2025-02-07T03:42:06Z

The 32-bit case is different because AMD was wise in defining 32-bit ops to clear the upper bits of a 64-bit register when in long mode

Whoa I had no idea! So either 32-bit or 64-bit instructions sever data dependencies from previous contents of the register. I'd definitely agree with you then that our current lowerings are probably ideal then and we shouldn't change them.

I thought about this for a while during #10110 and my mind went to "better automated benchmarking." ... What about codspeed, though? ... I think measuring "results over time" is what is going to help us most with this kind of issue and with finding other issues of the same sort.

I completely agree with your thinking here and I would love to see continuous benchmarking over time. I've never used codspeed myself but if they can run in our own CI and generate stable reports I see no reason to not use them myself. We'd probably have to think a bit harder about what exactly to benchmark since right now I believe cargo bench probably takes a few hours to execute by default and doesn't include anything on Sightglass. That being said we can likely pretty easily start folding things in and tweaking the defaults to be more useful

alexcrichton added the cranelift:area:x64 Issues related to x64 codegen label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x64: Wide operations used for some ALU ops instead of precisely-sized ones #10199

x64: Wide operations used for some ALU ops instead of precisely-sized ones #10199

alexcrichton commented Feb 6, 2025

cfallin commented Feb 6, 2025

cfallin commented Feb 6, 2025

abrown commented Feb 6, 2025

alexcrichton commented Feb 7, 2025

x64: Wide operations used for some ALU ops instead of precisely-sized ones #10199

x64: Wide operations used for some ALU ops instead of precisely-sized ones #10199

Comments

alexcrichton commented Feb 6, 2025

cfallin commented Feb 6, 2025

cfallin commented Feb 6, 2025

abrown commented Feb 6, 2025

alexcrichton commented Feb 7, 2025