Replies: 2 comments 3 replies
-
what's your input layouts? are you using the vectorizing copy policy? |
Beta Was this translation helpful? Give feedback.
-
My full global memory layout is (2, 24, 512, 128) with strides (1572864, 65536, 128, 1). However, after performing calculations and copying back, the store is not vectorized. there is my full code if needed
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I’m trying to figure out why I can’t get a proper 128-bit vectorized store from registers to global memory using CuTe. When I load from global memory into registers, the compiler emits a nice LDG.E.128, like:
LDG.E.128 R8, [R8.64]
So far so good.
But when I try to store back to global memory from registers, even though I’m using a float4 (and the data should be 128-bit aligned), the SASS shows two separate instructions:
STG.E.64 [R2.64], R8

STG.E.64 [R2.64+0x8], R10
Not only is this not vectori
zed, but it also breaks memory coalescing across threads. I thought STG.E.128 was a valid instruction — am I misremembering? Or there is not support for LDG.E.128 in CuTe (very unlikely), or am i mising something
I’m trying to figure out why I can’t get a proper 128-bit vectorized store from registers to global memory using CuTe. When I load from global memory into registers, the compiler emits a nice LDG.E.128, like:
LDG.E.128 R8, [R8.64]
So far so good.
But when I try to store back to global memory from registers, even though I’m using a float4 (and the data should be 128-bit aligned), the SASS shows two separate instructions:
STG.E.64 [R2.64], R8
STG.E.64 [R2.64+0x8], R10
Not only is this not vectorized, but it also breaks memory coalescing across threads. I thought STG.E.128 was a valid instruction — am I misremembering? Or there is not support for LDG.E.128 in CuTe (very unlikely), or am i mising something
Beta Was this translation helpful? Give feedback.
All reactions