-
Notifications
You must be signed in to change notification settings - Fork 129
Speedup DimShuffle and Reshape in C and Numba backends #1226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4e37832
to
8fb41d6
Compare
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (96.55%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1226 +/- ##
=======================================
Coverage 81.98% 81.99%
=======================================
Files 188 188
Lines 48568 48523 -45
Branches 8677 8668 -9
=======================================
- Hits 39819 39785 -34
+ Misses 6584 6579 -5
+ Partials 2165 2159 -6
|
ebc4263
to
b24ee8b
Compare
b24ee8b
to
94c84f8
Compare
6c93750
to
c243f36
Compare
@numba_basic.numba_njit(inline="always") | ||
def dimshuffle(x): | ||
return dimshuffle_inner(np.asarray(x), shuffle) | ||
return as_strided(x, shape=new_shape, strides=new_strides) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct that since we only allow changing the order of the axis, and don't introduce new axes or change the total number of arrays this never produces arrays where the same location in memory is used multiple times?
I think it would be a bad idea to introduce cases where that could happen. The as_strided docs also contains a warning note about this: https://numpy.org/devdocs/reference/generated/numpy.lib.stride_tricks.as_strided.html#numpy.lib.stride_tricks.as_strided
If that is in fact not an issue here, I think we should add a small note in the code here pointing out why that's not a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does produce multiple arrays that use the same memory and introduces/ removes axis. That's why the Op is marked as a view op to signal it can't be mutated if the original input is needed elsewhere or protected (ie provided by the user)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't create self overlapping memory and use the original strides which is what the docs recommend. Now the user could provide such an input but that would break any inplace rewrite regardless of this Op
c243f36
to
16123a7
Compare
@jessegrabowski @aseyboldt the new numba DimShuffle is revealing some inplace bug in the failing linalg grad tests (at least Before the DimShuffle in the grad graphs were probably forcing a copy with the as_contiguous call inside reshape. It can be avoided by doing a There are 3 options for the source of the bug:
Unfortunately we don't have something like the DebugMode for Numba that would identify which Op is the culprit? Can you think of a way to debug this? Here is the compiled grad graph of the cholesky test (I removed Elemwise inplace optimization so only SolveTriangular is supposedly allowed to do inplacing):
|
Running the test suit locally, I see that the |
0693144
to
e96b6a9
Compare
dimensions = (npy_intp*) malloc(nd_out * sizeof(npy_intp)); | ||
strides = (npy_intp*) malloc(nd_out * sizeof(npy_intp)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't know why we're using the externalCOp approach for this. It used to be a normal COp before, and then you didn't need this sort of alloc/params stuff...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the bug: #1233 |
5714253
to
315a937
Compare
This was caused by 223ee15, which used the generic `PyArray_IntpConverter` to convert the shape numpy vector into a simple C-array for the Reshape operation. There seems to be no need for this change as the strides were correctly used Profiling suggests the previous changes caused a 7.5x slowdown. The benchmark detects only a 2.3x slowdown due to the PyTensor call overhead.
Introduced in e593b0a due to a bug when inputs had zero-strides. The bug can be fixed just by removing a block that assumed some `full`/`broadcasting` behavior by the operation, but this is not happening with DimShuffle.
315a937
to
50801a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good :-)
TLDR: Make views cheap again!
All backends
Remove the unused inplace option, less branching is always helpful
Python backend
Implementation based onas_strided
that mimicks the old (and now current again) C-backend approach. This is debatable since the Python implementation has historically been there more for readability than performance. However it's useful for fast compiling graphs not to suck?as_strided
is actually insanely slow in python, so I'm back to transpose+reshape. The removal of some checks and unnecessaryasarray
calls provide a small speedup, although this is obviously not critical.Numba backend
Use
as_strided
which simplifies the implementation by a ton with two benefits:The example with non-contiguous arrays in #1111 has no penalty now, which is a 10,000x speedup (example is exaggerated with a very large array)
C backend
This was the original goal of this PR
There were two regressions caused by e593b0a and 223ee15.
The DimShuffle changes introduce a
PyArray_IntpConverter
to compute a new shape and used two calls to numpy C-functionsPyArray_Transpose
andPyArray_NewShape
. I suspect the slowdown comes mostly from the introduction ofPyArray_IntPConverter
but I couldn't find anything wrong with the simpler logic from Theano times, other than...The bug that motivated the changes had to do with a useless second pass on the dimensions calculations:
Which is baffling in that DimShuffle is not doing any broadcasting behavior, othen than expand_dims, which the first pass already handles correctly. Removing this loop fixes the original bug.
The Reshape changes were simpler, they introduced a generic
PyArray_IntpConverter
to convert the shape numpy vector into a simple C-array for the Reshape operation.The concern about strides in the commit message doesn't hold, because they were being used directly. I added a direct tests for odd strides just in case.
In the long term Reshape should take the separate shape entries as scalars, instead of a single vector. The user never defines a graph with a vector anyway, but with isolated entries, so the whole packing and unpacking is just overhead. This is tracked #881.
Profiling a simple function suggests the previous changes caused a 8.6-11x slowdown per Op call for the DimShuffle operation and a 5.4x slowdown for the Reshape operation. The new benchmarks detects only a 3.6x and 2.3x slowdown, respectively, due to the PyTensor call overhead.