Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a draft for faster reordering #131

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

semi-h
Copy link
Member

@semi-h semi-h commented Dec 3, 2024

Relevant to #78.

When benchmarking the transport equation I realised that reorders, including accumulations (sum_intox), are taking nearly 50% of the runtime while bandwidth based model estimates 24%. This is due to current implementation achieving only around 10-20% of the available bandwidth, but a verbose implementation describing the rule explicitly for the operation can go up to 60%.

Ultimately, fixing #40, #66, and #78 improved time-to-solution for a single 1024^3 transeq call on a single node of ARCHER2 from 8.28 seconds to 4.52 seconds. Without the #40 and #66 implemented, the runtime for the whole transeq was 8.28 and the reorders including accumulations were 3.68 seconds, accounting for the 44% of the total runtime. Fixing this was key to sustain ~66% of the peak theoretical bandwidth on ARCHER2, which ultimately resulted in timings to be in line with our expectations for performance based on the bandwidth model.

I think this verbose implementation describing the mapping explicitly for each mapping improves the runtime of reorders around 4x, which is quite important for our target speedup on CPUs.

Happy to investigate a better approach more like the current implementation which is very neat, but I think we need performance somewhere close to what this suggested implementation achieves.

Any thoughts?

@semi-h semi-h linked an issue Dec 5, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate OMP reordering performance
1 participant