-
Notifications
You must be signed in to change notification settings - Fork 47
Description
The time from the kokkos serial backend (kokkos --serial) is slower for one thread than the standalone serial code (serial).
Looking at profiles, the function kernel_connect ( in plugin-PixelTriplets) takes significantly more time in the kokkos version. From the instructions retired, it is clearly performing more operations in the kokkos version.
The loops do not perform their iterations in the same order.
<outer loop from kokkos>
<kernel_connect function body>
for (int idx = firstCellIndex, nt = (nCells()); idx < nt; idx += leagueSize * blockDim) {
...
for (int j = first; j < numberOfPossibleNeighbors; j += stride) {
The serial version has no outer loop. However, printing the values for idx and j shows the same values get accessed, just in a different order between the versions.
It looks like (based on instructions retired), the kokkos version is doing more work (by a factor of 2x or more), in routines like areAlignedRZ. But based on printing out how many times it's called, they should be the same.