Implement a draft for faster reordering #131
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Relevant to #78.
When benchmarking the transport equation I realised that reorders, including accumulations (sum_intox), are taking nearly 50% of the runtime while bandwidth based model estimates 24%. This is due to current implementation achieving only around 10-20% of the available bandwidth, but a verbose implementation describing the rule explicitly for the operation can go up to 60%.
Ultimately, fixing #40, #66, and #78 improved time-to-solution for a single 1024^3 transeq call on a single node of ARCHER2 from 8.28 seconds to 4.52 seconds. Without the #40 and #66 implemented, the runtime for the whole transeq was 8.28 and the reorders including accumulations were 3.68 seconds, accounting for the 44% of the total runtime. Fixing this was key to sustain ~66% of the peak theoretical bandwidth on ARCHER2, which ultimately resulted in timings to be in line with our expectations for performance based on the bandwidth model.
I think this verbose implementation describing the mapping explicitly for each mapping improves the runtime of reorders around 4x, which is quite important for our target speedup on CPUs.
Happy to investigate a better approach more like the current implementation which is very neat, but I think we need performance somewhere close to what this suggested implementation achieves.
Any thoughts?