Statespace speed with jax and numpyro #375

rOybOiii · 2024-08-24T00:11:10Z

rOybOiii
Aug 24, 2024

Hi folks:

I've been experimenting with the pymc statespace. One of the downsides I read about is the incompatibility of statespace, at present, with faster samplers. Having said that, JAX could be fast if only I had the right setup, I've read.

I'm using an AMD GPU and a AMD 8 core CPU (16 hyperthreaded), but I'm also using a windows machine. Is there a sure way to speed up JAX, other than having a simple, well specified model?

For example, I've seen where using a special docker image running Linux with a special AMD gpu config could help. I can't afford to get an Nvidia card right now.

I'll point out that this is my current script config:

jax.config.update("jax_platform_name", "cpu")
import numpyro
numpyro.set_host_device_count(8)

Thank you so much for your time!

-Mike

jessegrabowski · 2025-12-30T16:09:03Z

jessegrabowski
Dec 30, 2025
Maintainer

Hey Mike! I don't usually monitor the discussion page so I missed this. I set up notifications though.

Statespace models get slow for two reasons:

The entire observed time series is a single observation. To compute the logp of the data once, you have to somehow compute everything jointly. Currently we use a loop, so you have to do T iterations.
At each iteration in this loop, you have to solve for the current estimated mean and covariance matrix (because we assume that each timestep is conditionally multivariate normal). The size of the matrix we need to is the number of hidden states in the system.

Now, that's all fine, but then add to this:

When we do bayesian inference, we need to compute the logp and the gradient of the logp many many many times.

Just as an example, the default NUTS settings allow for a max tree depth of 10. The tree depth controls how long the Hamiltonian physics simulation is allowed to carry on; at each tree node you're doing one logp and one gradient evaluation. So in the worst case, to get one sample, you have to do 1024 logp+gradient evaluations. Say you want 2000 samples, and you have 100 time steps, and you have 10 hidden states. So in the absolute worst case you're inverting a 10x10 matrix 2000 * 1024 * 100 = 200 billion times for the logp computation, but then the gradient of a solve involves a solve so that's another 200 billion inversions. And that's just the solves, there's also a bunch of matrix multiplications which are also about $O(n^3)$, so they're just as bad as the solves.

You might thing "wow that's a lot of linear algebra, a GPU should be great at that", but then the loop comes back to bite you. You can't parallelize the loop over the T timesteps, nor can you parallelize over the 2000 draws (sort of, you can have multiple chains of course but set that aside).

Ok that context aside, why does JAX help? It all boils down to the loop. JAX seems to have an extremely good implementation of scan, which is the differentiable loop primitive used in autodiff libraries, including Pytensor. In general JAX is much worse than e.g. Numba on CPU, because their focus is on GPU, but for some reason I don't fully understand, their scan just kicks ass. So JAX is better. But that performance improvement doesn't translate to GPU in my experience.

So if you want more speed, what can we do?

Just go for approximate inference. You can use pmx.fit_laplace or pmx.find_MAP to just get a point estimate and run with that. The model can still give you CIs for the inferred hidden states, just not for the latent parameters (in the case of MAP, with laplace you do get them). But our support for MAP is limited, because all the post-estimation helpers assume you have a distribution to work with: see Improve support for statespace post-estimation tasks when working with point estimates #583
It would be great if we could just recommend ADVI as the default way to fit statespace models. I think that's what real industry people do in practice. But PyMC ADVI doesn't really support JAX mode (which, recall, kicks ass for unknown reasons). So that needs to be refactored. There's some chit-chat about it here, but no serious work yet: VI refactor - using numpyro auto-guide design pymc#7799
Cover more common special cases using specialized algorithms. One example is Chandrasekhar recursion, which gives a speed boost for certain models. See here: https://www.statsmodels.org/stable/examples/notebooks/generated/statespace_chandrasekhar.html.
Use analytical methods to get more speed. @JeanVanDyk did a yoeman's effort computing analytical gradients for the kalman filter in this PR, but we haven't figured out how to use them yet. That will offer some extra speed if the closed forms are somehow "nicer" than the graph autodiff makes. We can also take advantage of filter convergence to do less work, see here: Use Kalman Filter convergence to better handle long time series #394
Better float32 support. This would require more precise filters, which basically means square-root filtering. We have a square-root filter, but it's not tested, see here: Bring back SquareRootFilter tests #487 . This might open the door to better GPU performance as well. The closed-form gradients might play a role here too, but that's just speculation.
Speaking of the square-root filter, it might simply be faster always because we typically write down covariance matrix priors in cholesky factorized form anyway (via e.g. LKJCholeskyCov). So currently, we do L ~ LKJ(.) then P = L @ L.T, then inside the filter we turn around and re-decompose L = chol(P). Using the square-root filter keeps everything in L form, so we never explicitly form P, which is nice.
Better initialization. Statespace models in general don't have that many parameters to estimate relative to the amount of data, but one exception is P0, the initial hidden state covariance. That has to be NxN, so you can end up with a huge number of nuisance parameters that slow down estimation. When a model is stationary we can eliminate this, but we could also cover more cases with block initialization, or by adding fake dampening to states to make them "approximately stationary". See: Add dummy dampening to non-stationary models to allow stationary initialization #486. I think there are also tricks to marginalize over this parameter which would be great to explore.
Support sparse matrices. In general, the state transition matrix $T$ is going to be very sparse, especially in autoregressive models with lots of lags. For example, an ARMA(p,q) model will have n = max(p, q+1) states, so T is n x n, but it will have only p + q + 1 nonzero elements. So if you have an AR(12) model, you end up with a 12x12 matrix with a sparsity factor of 13 / 144 = 0.09. This gets even worse if you have seasonal effects -- if you have daily data and want a yearly seasonal effect with only 1 lag of influence, you need at least AR_lags + season_length * season_lags = 0 + 365 * 1 = 365 states, with only 2 non-zero elements. Structural time series are also extremely sparse, because they have block-diagonal structure. The point is that if we can support sparse Kalman filters, we have a ton to gain.
This is highly speculative, but there is a deep connection between gaussian processes and statespace models. Actually you can show that the Kalman Filter is an iterative solver for a Gaussian Process (this claim should "feel" right, since the KF is doing a bunch of manipulation of conditional multivariate normals, using observed stuff to infer unobserved stuff). If it is possible to represent a statespace model as a covariance kernel, and if it's possible to compute the power spectral density of that kernel, then its possible to use the HSGP model to dispense with all the filtering and just do linear regression. This would be a huge win for GPU support.
General speedups via rewrites. We can do more to get better graphs, for example fusing block diagonals (PR: Add rewrite to fuse nested BlockDiag Ops pytensor#1671), or adding support for general block matrices (issue: Add helper for constructing block matrices pytensor#1100, then we would want a rewrite like Add rewrite to optimize block_diag(a, b) @ c pytensor#1044), have more specialized routines for dots of structured matrices (Add Ops for specialized dot products with structured matrices pytensor#1323), etc. This section can be summarized as "skill issue, lol"

So. That was a longer answer than perhaps you were hoping for. It also has essentially no actionable help for your situation, aside from the advice to switch to using approximate inference (which is indeed my suggestion for you). But maybe this big list will encourage you to contribute :)

0 replies

ricardoV94 · 2025-12-30T16:26:37Z

ricardoV94
Dec 30, 2025
Maintainer

You may want to try running with numba / nutpie. JAX isn't always that great in CPU. You can benchmark how long a logp_dlogp eval take and calibrate your expectations from there.

1 reply

jessegrabowski Dec 30, 2025
Maintainer

I haven't seen good benchmarks from numba in scan models. I do recommend nutpie over numpyro, though, which I forgot to mention.

juanitorduz · 2025-12-30T19:58:05Z

juanitorduz
Dec 30, 2025
Maintainer

There's some chit-chat about it here, but no serious work yet: pymc-devs/pymc#7799

I think this is a killer feature and the draft PR seems already pretty good. What do we need to push this forward? I would be happy to help :)

1 reply

jessegrabowski Dec 30, 2025
Maintainer

I was hoping to have a GaussianAutoGuide function that could take an arbitrary pymc model and return the simple guide form. I got caught up on all the complexities of model fgraphs -- that's the same thing that happened in this pr: pymc-devs/pymc#7785

I'd be more than happy to have help. Maybe there's a more attainable v0 that would get that merged before the entropic heat death of the universe (my usual time scale for PRs into pymc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Statespace speed with jax and numpyro #375

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Statespace speed with jax and numpyro #375

Uh oh!

Uh oh!

rOybOiii Aug 24, 2024

Replies: 3 comments · 2 replies

Uh oh!

jessegrabowski Dec 30, 2025 Maintainer

Uh oh!

Uh oh!

ricardoV94 Dec 30, 2025 Maintainer

Uh oh!

jessegrabowski Dec 30, 2025 Maintainer

Uh oh!

juanitorduz Dec 30, 2025 Maintainer

Uh oh!

Uh oh!

jessegrabowski Dec 30, 2025 Maintainer

rOybOiii
Aug 24, 2024

Replies: 3 comments 2 replies

jessegrabowski
Dec 30, 2025
Maintainer

ricardoV94
Dec 30, 2025
Maintainer

jessegrabowski Dec 30, 2025
Maintainer

juanitorduz
Dec 30, 2025
Maintainer

jessegrabowski Dec 30, 2025
Maintainer