-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x64: Wide operations used for some ALU ops instead of precisely-sized ones #10199
Comments
I did a bit more digging regarding partial register stalls. I've found conflicting notes online, and first-principles thinking suggests that However some sources indicate that a separate merge uop is needed on some microarchitectures when a narrow operation occurs -- internally this is thus executed as the 8-bit AND, then an "insert 8 bits into low 8 bits of original 64-bit register" microinstruction. I've found evidence of this even on very new processors -- on my two-month-old Zen 5 machine (Ryzen 9950X) I see that this code
runs about 5% slower than this code
(build both with The 32-bit case is different because AMD was wise in defining 32-bit ops to clear the upper bits of a 64-bit register when in long mode. But it seems for 8 and 16 bit cases, there is merging behavior that we want to avoid. The memory cases are different because the operand size matters for the load width, of course. This all suggests that we should keep the status-quo in lowering rules, and also maybe reconsider our pattern around |
(For completeness, this would be almost your option 2, except "use 32-bit instructions when 8/16/32-bit"; at least in some edge cases, e.g. divides, this does have an impact on latency vs. 64 bit width, and I'd prefer to keep a consistent "narrower than 64 uses 32 bits" rule than make exceptions.) |
I thought about this for a while during #10110 and my mind went to "better automated benchmarking." If it wasn't such a pain to measure, we would probably just make this decision based on real data. Like @cfallin, I suspected we wanted to break the false dependency but the overall effect of this is tedious to measure and not always conclusive. As you all know I built a bunch of "benchmark results over time" infrastructure for Sightglass some time ago but abandoned the effort once it became clear I would have to maintain a benchmarking server in perpetuity. @jlb6740 tried to get this kind of thing automated into our CI, but the integration friction is likely what discourages its use. What about codspeed, though? Their criterion-compatible shim uses Valgrind for what they claim are stable results and they're working on a wall-time version as well. Here's an example of the "results over time" they capture: example. This crazy benchmarking talk may seem like a tangent, but I think measuring "results over time" is what is going to help us most with this kind of issue and with finding other issues of the same sort. |
Whoa I had no idea! So either 32-bit or 64-bit instructions sever data dependencies from previous contents of the register. I'd definitely agree with you then that our current lowerings are probably ideal then and we shouldn't change them.
I completely agree with your thinking here and I would love to see continuous benchmarking over time. I've never used codspeed myself but if they can run in our own CI and generate stable reports I see no reason to not use them myself. We'd probably have to think a bit harder about what exactly to benchmark since right now I believe |
I wanted to make a dedicated issue to continue discusison from #10110 (comment)
Currently Wasmtime on x64 switches to using only 32/64-bit ALU ops here, ignoring 8/16 bit types as input and using the wider operation instead. This leads to what I personally find is a confusing pattern which is sometimes the type is used sometimes it isn't. We already have to get everything correct for sunk operands as that's required to operate on the precise width, and I'm not sure what the benefit is to use a 32-bit instruction rather than an 8-bit instruction.
In the linked thread @cfallin mentions:
I know this came up with high-latency instructions like sqrt where the problem we ran into was that a false dependency was created between instructions where some instructions operate on the full xmm width and some don't. I'm not sure if this is a problem for (what I assume are) low-latency instructions like
and
. Additonally I'm not sure if smaller-than-64-bit-width instructions preserve upper bits or sign/zero extend (I couldn't figure it out from the docs)Otherwise though I would expect that even today where we clamp at 32 bits we still have this problem. That means we're already doing smaller-than-register-width operations which have the theoretical possibility of creating false dependencies.
Basically I view the current state as a bit of a weird inbetween of two worlds we could possibly be in:
Personally I feel like we should lean towards (1) under the assumption it doesn't have bad performance.
The text was updated successfully, but these errors were encountered: