add GPU tutorial

ChrisRackauckas · ChrisRackauckas · commit b715d9419aa1 · 2025-04-26T09:49:22.000-04:00
diff --git a/docs/pages.jl b/docs/pages.jl
@@ -1,9 +1,11 @@
 # Put in a separate page so it can be used by SciMLDocs.jl
 
 pages = ["index.md",
-    "Tutorials" => Any["tutorials/linear.md",
+    "tutorials/linear.md",
+    "Tutorials" => Any[
         "tutorials/caching_interface.md",
-        "tutorials/accelerating_choices.md"],
+        "tutorials/accelerating_choices.md",
+        "tutorials/gpu.md"],
     "Basics" => Any["basics/LinearProblem.md",
         "basics/common_solver_opts.md",
         "basics/OperatorAssumptions.md",
diff --git a/docs/src/tutorials/gpu.md b/docs/src/tutorials/gpu.md
@@ -0,0 +1,135 @@
+# GPU-Accelerated Linear Solving in Julia
+
+LinearSolve.jl provides two ways to GPU accelerate linear solves:
+
+* Offloading: offloading takes a CPU-based problem and automatically transforms it into a
+  GPU-based problem in the background, and returns the solution on CPU. Thus using
+  offloading requires no change on the part of the user other than to choose an offloading
+  solver.
+* Array type interface: the array type interface requires that the user defines the
+  `LinearProblem` using an `AbstractGPUArray` type and chooses an appropriate solver
+  (or uses the default solver). The solution will then be returned as a GPU array type.
+
+The offloading approach has the advantage of being simpler and requiring no change to
+existing CPU code, while having the disadvantage of having more overhead. In the following
+sections we will demonstrate how to use each of the approaches.
+
+!!! warn
+
+    GPUs are not always faster! Your matrices need to be sufficiently large in order for
+    GPU accelerations to actually be faster. For offloading it's around 1,000 x 1,000 matrices
+    and for Array type interface it's around 100 x 100. For sparse matrices, it is highly
+    dependent on the sparsity pattern and the amount of fill-in.
+
+## GPU-Offloading
+
+GPU offloading is simple as it's done simply by changing the solver algorithm. Take the
+example from the start of the documentation:
+
+```julia
+using LinearSolve
+
+A = rand(4, 4)
+b = rand(4)
+prob = LinearProblem(A, b)
+sol = solve(prob)
+sol.u
+```
+
+This computation can be moved to the GPU by the following:
+
+```julia
+using CUDA # Add the GPU library
+sol = solve(prob, CudaOffloadFactorization())
+sol.u
+```
+
+## GPUArray Interface
+
+For more manual control over the factorization setup, you can use the
+[GPUArray interface](https://juliagpu.github.io/GPUArrays.jl/dev/), the most common
+instantiation being [CuArray for CUDA-based arrays on NVIDIA GPUs](https://cuda.juliagpu.org/stable/usage/array/).
+To use this, we simply send the matrix `A` and the value `b` over to the GPU and solve:
+
+```julia
+using CUDA
+
+A = rand(4, 4) |> cu
+b = rand(4) |> cu
+prob = LinearProblem(A, b)
+sol = solve(prob)
+sol.u
+```
+
+```
+4-element CuArray{Float32, 1, CUDA.DeviceMemory}:
+ -27.02665
+  16.338171
+ -77.650116
+ 106.335686
+```
+
+Notice that the solution is a `CuArray`, and thus one must use `Array(sol.u)` if you with
+to return it to the CPU. This setup does no automated memory transfers and will thus only
+move things to CPU on command.
+
+!!! warn
+
+    Many GPU functionalities, such as `CUDA.cu`, have a built-in preference for `Float32`.
+    Generally it is much faster to use 32-bit floating point operations on GPU than 64-bit
+    operations, and thus this is generally the right choice if going to such platforms.
+    However, this change in numerical precision needs to be accounted for in your mathematics
+    as it could lead to instabilities. To disable this, use a constructor that is more
+    specific about the bitsize, such as `CuArray{Float64}(A)`. Additionally, preferring more
+    stable factorization methods, such as `QRFactorization()`, can improve the numerics in
+    such cases.
+
+Similarly to other use cases, you can choose the solver, for example:
+
+```julia
+sol = solve(prob, QRFactorization())
+```
+
+## Sparse Matrices on GPUs
+
+Currently, sparse matrix computations on GPUs are only supported for CUDA. This is done using
+the `CUDA.CUSPARSE` sublibrary.
+
+```julia
+using LinearAlgebra, CUDA.CUSPARSE
+T = Float32
+n = 100
+A_cpu = sprand(T, n, n, 0.05) + I
+x_cpu = zeros(T, n)
+b_cpu = rand(T, n)
+
+A_gpu_csr = CuSparseMatrixCSR(A_cpu)
+b_gpu = CuVector(b_cpu)
+```
+
+In order to solve such problems using a direct method, you must add
+[CUDSS.jl](https://github.com/exanauts/CUDSS.jl). This looks like:
+
+```julia
+using CUDSS
+sol = solve(prob, LUFactorization())
+```
+
+!!! note
+
+    For now, CUDSS only supports CuSparseMatrixCSR type matrices.
+
+Note that `KrylovJL` methods also work with sparse GPU arrays:
+
+```julia
+sol = solve(prob, KrylovJL_GMRES())
+```
+
+Note that CUSPARSE also has some GPU-based preconditioners, such as a built-in `ilu`. However:
+
+```julia
+sol = solve(prob, KrylovJL_GMRES(precs = (A, p) -> (CUDA.CUSPARSE.ilu02!(A, 'O'), I)))
+```
+
+However, right now CUSPARSE is missing the right `ldiv!` implementation for this to work
+in general. See https://github.com/SciML/LinearSolve.jl/issues/341 for details.
diff --git a/docs/src/tutorials/linear.md b/docs/src/tutorials/linear.md
@@ -1,7 +1,7 @@
 # Getting Started with Solving Linear Systems in Julia
 
-A linear system $$Au=b$$ is specified by defining an `AbstractMatrix`. For the sake
-of simplicity, this tutorial will start by only showcasing concrete matrices.
+A linear system $$Au=b$$ is specified by defining an `AbstractMatrix` or `AbstractSciMLOperator`.
+For the sake of simplicity, this tutorial will start by only showcasing concrete matrices.
 
 The following defines a matrix and a `LinearProblem` which is subsequently solved
 by the default linear solver.
@@ -57,8 +57,23 @@ sol = solve(prob, KrylovJL_GMRES()) # Choosing algorithms is done the same way
 sol.u
 ```
 
+Similerly structure matrix types, like banded matrices, can be provided using special matrix
+types. While any `AbstractMatrix` type should be compatible via the general Julia interfaces,
+LinearSolve.jl specifically tests with the following cases:
+
+* [BandedMatrices.jl](https://github.com/JuliaLinearAlgebra/BandedMatrices.jl)
+* [BlockDiagonals.jl](https://github.com/JuliaArrays/BlockDiagonals.jl)
+* [CUDA.jl](https://cuda.juliagpu.org/stable/) (CUDA GPU-based dense and sparse matrices)
+* [FastAlmostBandedMatrices.jl](https://github.com/SciML/FastAlmostBandedMatrices.jl)
+* [Metal.jl](https://metal.juliagpu.org/stable/) (Apple M-series GPU-based dense matrices)
+
 ## Using Matrix-Free Operators
 
-`A`, or
-by providing a matrix-free operator for performing `A*x` operations via the
-function `A(u,p,t)` out-of-place and `A(du,u,p,t)` for in-place.
+In many cases where a sparse matrix gets really large, even the sparse representation
+cannot be stored in memory. However, in many such cases, such as with PDE discretizations,
+you may be able to write down a function that directly computes `A*x`. These "matrix-free"
+operators allow the user to define the `Ax=b` problem to be solved giving only the definition
+of `A*x` and allowing specific solvers (Krylov methods) to act without ever constructing
+the full matrix.
+
+**This will be documented in more detail in the near future**
diff --git a/src/factorization.jl b/src/factorization.jl
@@ -117,7 +117,7 @@ function SciMLBase.solve!(cache::LinearCache, alg::LUFactorization; kwargs...)
         end
         cache.cacheval = fact
 
-        if !LinearAlgebra.issuccess(fact)
+        if hasmethod(LinearAlgebra.issuccess, Tuple{typeof(fact)}) && !LinearAlgebra.issuccess(fact)
             return SciMLBase.build_linear_solution(
                 alg, cache.u, nothing, cache; retcode = ReturnCode.Failure)
         end