-
Notifications
You must be signed in to change notification settings - Fork 517
RFC: Specialize for non-mixed-dtype in elementwise_util #9388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/swolchok/382/head
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9388
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 85451ea with merge base 1572381 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Mixed dtype should be uncommon. Here is how we can specialize for the common case. Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles. ghstack-source-id: 079c8b97c745f0e4004303ead6ca21de596020cc ghstack-comment-id: 2735017566 Pull Request resolved: #9388
Mixed dtype should be uncommon. Here is how we can specialize for the common case. Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles. ghstack-source-id: ae7c28973153341837c479d8b5ae7f11998c6c76 ghstack-comment-id: 2735017566 Pull Request resolved: #9388
This is not ready for full review, but I'm leaving it as non-draft because I'd still like directional comments (hence RFC). I have some more PRs I will be adding to the stack once everything builds again. |
/// Return the one output type we are willing to emit specialized code | ||
/// to handle, given a compute type of CTYPE_COMMON and supported | ||
/// output types of out_dtypes. | ||
template <typename CTYPE_COMMON> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be CTYPE_COMPUTE
, right?
((inputs.first->scalar_type() == compute_type) && ...); | ||
|
||
constexpr ScalarType out_specialized_scalar_type = | ||
specialized_output_scalar_type<CTYPE_COMPUTE>(out_dtypes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, CTYPE_COMPUTE
Mixed dtype should be uncommon. Here is how we can specialize for the common case. Prepares us to tackle #9241 .
Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles.
To check performance, I edited examples/models/toy_model/model.py so that MulModule used inputs of size 3000, 2000 instead of 3, 2. I exported it with
python3 -m examples.portable.scripts.export --model_name mul
and saved the resultingmul.pte
. Then I built in release mode with optimized kernels on, but with mul.out removed from kernels/optimized/optimized.yaml, so that we would use the optimized_portable_kernels build of kernels/portable/op_mul.cpp. Finally, I ran 3 trials on my M2 Macbook Pro usingcmake-out/executor_runner --model_path mul3kby2k.pte --num_executions 1000 --cpu_threads 2
. Resulting times for 1000 iterations in ms:Previous diff: 8295, 8187, 8139
This diff: 2953, 2806, 2861
(For comparison, the actual optimized mul kernel took around 1000 ms to run 1000 iterations, and #9432 later in the stack arrived at similar numbers.)