matrix multiplication: optimizations around matmul2x2or3x3_nonzeroalpha!#1563
matrix multiplication: optimizations around matmul2x2or3x3_nonzeroalpha!#1563nsajko wants to merge 20 commits into
matmul2x2or3x3_nonzeroalpha!#1563Conversation
Lays the groundwork for adding more small-matrix special cases while keeping complexity low.
In case `n > 3`, which I feel is most of the time, this change makes it so there is less branching (one branch instead of two). The `code_typed` should also be smaller after this change, as Julia does not do common subexpression elimination, not without LLVM. This lays the groundwork to add special cases for `n < 2`, too, without causing extra branching in case `n > 3`.
All square matrices with less than four rows should now be implemented in pure Julia, without `ccall`/FFI. One-element vectors are included, too, being treated as 1x1 matrices.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1563 +/- ##
==========================================
- Coverage 94.33% 94.30% -0.04%
==========================================
Files 35 35
Lines 16007 16043 +36
==========================================
+ Hits 15100 15129 +29
- Misses 907 914 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
BenchmarkingExample: small square matrices, including Benchmark scriptbench.jl
using LinearAlgebra: LinearAlgebra
using BenchmarkTools: @btime
using Random: seed!, rand!
using FixedSizeArrays: FixedSizeMatrixDefault
function square_arrs(typ::Type, m::Int, n::Int)
ntuple((_ -> typ(undef, (n, n))), m)
end
const m = parse(Int, ARGS[1])
const max_n = parse(Int, ARGS[2])
const seed = parse(Int, ARGS[3])
const samples = parse(Int, ARGS[4])
const evals = parse(Int, ARGS[5])
@show m
@show max_n
@show seed
@show samples
@show evals
const elt = Float32
const typ_arr = Matrix{elt}
const typ_fsa = FixedSizeMatrixDefault{elt}
global arrs::NTuple{m, typ_arr}
global fsas::NTuple{m, typ_fsa}
for n ∈ 0:max_n
global arrs
global fsas
arrs = square_arrs(typ_arr, m, n)
fsas = square_arrs(typ_fsa, m, n)
seed!(seed)
foreach(rand!, arrs)
for i ∈ eachindex(arrs, fsas)
fsas[i] .= arrs[i]
end
print(' ' ^ 2)
@show n
print(' ' ^ 2)
@btime prod(arrs) seconds=Inf samples=samples evals=evals
print(' ' ^ 2)
@btime prod(fsas) seconds=Inf samples=samples evals=evals
end
|
dkarrasch
left a comment
There was a problem hiding this comment.
LGTM. Do we have tests for the 1x1 case?
Co-authored-by: Daniel Karrasch <daniel.karrasch@posteo.de>
| elseif tA_uc == 'S' | ||
| if isuppercase(tA) # tA == 'S' | ||
| A11 = symmetric(a11, :U) | ||
| else | ||
| A11 = symmetric(a11, :L) | ||
| end | ||
| elseif tA_uc == 'H' | ||
| if isuppercase(tA) # tA == 'H' | ||
| A11 = hermitian(a11, :U) |
There was a problem hiding this comment.
These are not covered by the added tests? Not immediately sure how to proceed. I suppose these cases are handled somewhere else, never dispatching to this part of the code.
There was a problem hiding this comment.
For what that is worth, the 2x2 and 3x3 correspondents (pre-existing) also lack complete coverage in the corresponding method body parts.
There was a problem hiding this comment.
Could it be that those never run by the small matmul branch, and always go through the symm/hemm route?
There was a problem hiding this comment.
I missed the Int elements. Nevermind.
Improvements:
Possibly slightly speeds up the general case; multiplying non-small matrices. Presumably because of decreasing the number of branches that non-special-cased/non-small matrices are subject to. However this speedup seems to only be significant for small-enough matrices.
Cover matrix products which are empty with the special case pure Julia code for small matrices. Speeds up that case. This includes matrix products where one of the "matrices" is an
AbstractVector.Cover matrix products of two 1x1 matrices with the special case pure Julia code for small matrices. Speeds up that case. This includes matrix products where one of the "matrices" is an
AbstractVector.The commit history is tidy for easier review.