Skip to content

Commit b715d94

Browse files
add GPU tutorial
1 parent f0762f9 commit b715d94

File tree

4 files changed

+160
-8
lines changed

4 files changed

+160
-8
lines changed

docs/pages.jl

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
11
# Put in a separate page so it can be used by SciMLDocs.jl
22

33
pages = ["index.md",
4-
"Tutorials" => Any["tutorials/linear.md",
4+
"tutorials/linear.md",
5+
"Tutorials" => Any[
56
"tutorials/caching_interface.md",
6-
"tutorials/accelerating_choices.md"],
7+
"tutorials/accelerating_choices.md",
8+
"tutorials/gpu.md"],
79
"Basics" => Any["basics/LinearProblem.md",
810
"basics/common_solver_opts.md",
911
"basics/OperatorAssumptions.md",

docs/src/tutorials/gpu.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# GPU-Accelerated Linear Solving in Julia
2+
3+
LinearSolve.jl provides two ways to GPU accelerate linear solves:
4+
5+
* Offloading: offloading takes a CPU-based problem and automatically transforms it into a
6+
GPU-based problem in the background, and returns the solution on CPU. Thus using
7+
offloading requires no change on the part of the user other than to choose an offloading
8+
solver.
9+
* Array type interface: the array type interface requires that the user defines the
10+
`LinearProblem` using an `AbstractGPUArray` type and chooses an appropriate solver
11+
(or uses the default solver). The solution will then be returned as a GPU array type.
12+
13+
The offloading approach has the advantage of being simpler and requiring no change to
14+
existing CPU code, while having the disadvantage of having more overhead. In the following
15+
sections we will demonstrate how to use each of the approaches.
16+
17+
!!! warn
18+
19+
GPUs are not always faster! Your matrices need to be sufficiently large in order for
20+
GPU accelerations to actually be faster. For offloading it's around 1,000 x 1,000 matrices
21+
and for Array type interface it's around 100 x 100. For sparse matrices, it is highly
22+
dependent on the sparsity pattern and the amount of fill-in.
23+
24+
## GPU-Offloading
25+
26+
GPU offloading is simple as it's done simply by changing the solver algorithm. Take the
27+
example from the start of the documentation:
28+
29+
```julia
30+
using LinearSolve
31+
32+
A = rand(4, 4)
33+
b = rand(4)
34+
prob = LinearProblem(A, b)
35+
sol = solve(prob)
36+
sol.u
37+
```
38+
39+
This computation can be moved to the GPU by the following:
40+
41+
```julia
42+
using CUDA # Add the GPU library
43+
sol = solve(prob, CudaOffloadFactorization())
44+
sol.u
45+
```
46+
47+
## GPUArray Interface
48+
49+
For more manual control over the factorization setup, you can use the
50+
[GPUArray interface](https://juliagpu.github.io/GPUArrays.jl/dev/), the most common
51+
instantiation being [CuArray for CUDA-based arrays on NVIDIA GPUs](https://cuda.juliagpu.org/stable/usage/array/).
52+
To use this, we simply send the matrix `A` and the value `b` over to the GPU and solve:
53+
54+
```julia
55+
using CUDA
56+
57+
A = rand(4, 4) |> cu
58+
b = rand(4) |> cu
59+
prob = LinearProblem(A, b)
60+
sol = solve(prob)
61+
sol.u
62+
```
63+
64+
```
65+
4-element CuArray{Float32, 1, CUDA.DeviceMemory}:
66+
-27.02665
67+
16.338171
68+
-77.650116
69+
106.335686
70+
```
71+
72+
Notice that the solution is a `CuArray`, and thus one must use `Array(sol.u)` if you with
73+
to return it to the CPU. This setup does no automated memory transfers and will thus only
74+
move things to CPU on command.
75+
76+
!!! warn
77+
78+
Many GPU functionalities, such as `CUDA.cu`, have a built-in preference for `Float32`.
79+
Generally it is much faster to use 32-bit floating point operations on GPU than 64-bit
80+
operations, and thus this is generally the right choice if going to such platforms.
81+
However, this change in numerical precision needs to be accounted for in your mathematics
82+
as it could lead to instabilities. To disable this, use a constructor that is more
83+
specific about the bitsize, such as `CuArray{Float64}(A)`. Additionally, preferring more
84+
stable factorization methods, such as `QRFactorization()`, can improve the numerics in
85+
such cases.
86+
87+
Similarly to other use cases, you can choose the solver, for example:
88+
89+
```julia
90+
sol = solve(prob, QRFactorization())
91+
```
92+
93+
## Sparse Matrices on GPUs
94+
95+
Currently, sparse matrix computations on GPUs are only supported for CUDA. This is done using
96+
the `CUDA.CUSPARSE` sublibrary.
97+
98+
```julia
99+
using LinearAlgebra, CUDA.CUSPARSE
100+
T = Float32
101+
n = 100
102+
A_cpu = sprand(T, n, n, 0.05) + I
103+
x_cpu = zeros(T, n)
104+
b_cpu = rand(T, n)
105+
106+
A_gpu_csr = CuSparseMatrixCSR(A_cpu)
107+
b_gpu = CuVector(b_cpu)
108+
```
109+
110+
In order to solve such problems using a direct method, you must add
111+
[CUDSS.jl](https://github.com/exanauts/CUDSS.jl). This looks like:
112+
113+
```julia
114+
using CUDSS
115+
sol = solve(prob, LUFactorization())
116+
```
117+
118+
!!! note
119+
120+
For now, CUDSS only supports CuSparseMatrixCSR type matrices.
121+
122+
Note that `KrylovJL` methods also work with sparse GPU arrays:
123+
124+
```julia
125+
sol = solve(prob, KrylovJL_GMRES())
126+
```
127+
128+
Note that CUSPARSE also has some GPU-based preconditioners, such as a built-in `ilu`. However:
129+
130+
```julia
131+
sol = solve(prob, KrylovJL_GMRES(precs = (A, p) -> (CUDA.CUSPARSE.ilu02!(A, 'O'), I)))
132+
```
133+
134+
However, right now CUSPARSE is missing the right `ldiv!` implementation for this to work
135+
in general. See https://github.com/SciML/LinearSolve.jl/issues/341 for details.

docs/src/tutorials/linear.md

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Getting Started with Solving Linear Systems in Julia
22

3-
A linear system $$Au=b$$ is specified by defining an `AbstractMatrix`. For the sake
4-
of simplicity, this tutorial will start by only showcasing concrete matrices.
3+
A linear system $$Au=b$$ is specified by defining an `AbstractMatrix` or `AbstractSciMLOperator`.
4+
For the sake of simplicity, this tutorial will start by only showcasing concrete matrices.
55

66
The following defines a matrix and a `LinearProblem` which is subsequently solved
77
by the default linear solver.
@@ -57,8 +57,23 @@ sol = solve(prob, KrylovJL_GMRES()) # Choosing algorithms is done the same way
5757
sol.u
5858
```
5959

60+
Similerly structure matrix types, like banded matrices, can be provided using special matrix
61+
types. While any `AbstractMatrix` type should be compatible via the general Julia interfaces,
62+
LinearSolve.jl specifically tests with the following cases:
63+
64+
* [BandedMatrices.jl](https://github.com/JuliaLinearAlgebra/BandedMatrices.jl)
65+
* [BlockDiagonals.jl](https://github.com/JuliaArrays/BlockDiagonals.jl)
66+
* [CUDA.jl](https://cuda.juliagpu.org/stable/) (CUDA GPU-based dense and sparse matrices)
67+
* [FastAlmostBandedMatrices.jl](https://github.com/SciML/FastAlmostBandedMatrices.jl)
68+
* [Metal.jl](https://metal.juliagpu.org/stable/) (Apple M-series GPU-based dense matrices)
69+
6070
## Using Matrix-Free Operators
6171

62-
`A`, or
63-
by providing a matrix-free operator for performing `A*x` operations via the
64-
function `A(u,p,t)` out-of-place and `A(du,u,p,t)` for in-place.
72+
In many cases where a sparse matrix gets really large, even the sparse representation
73+
cannot be stored in memory. However, in many such cases, such as with PDE discretizations,
74+
you may be able to write down a function that directly computes `A*x`. These "matrix-free"
75+
operators allow the user to define the `Ax=b` problem to be solved giving only the definition
76+
of `A*x` and allowing specific solvers (Krylov methods) to act without ever constructing
77+
the full matrix.
78+
79+
**This will be documented in more detail in the near future**

src/factorization.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ function SciMLBase.solve!(cache::LinearCache, alg::LUFactorization; kwargs...)
117117
end
118118
cache.cacheval = fact
119119

120-
if !LinearAlgebra.issuccess(fact)
120+
if hasmethod(LinearAlgebra.issuccess, Tuple{typeof(fact)}) && !LinearAlgebra.issuccess(fact)
121121
return SciMLBase.build_linear_solution(
122122
alg, cache.u, nothing, cache; retcode = ReturnCode.Failure)
123123
end

0 commit comments

Comments
 (0)