Online estimation of covariance matrix #749

DanielWicz · 2021-10-29T09:34:29Z

DanielWicz
Oct 29, 2021

The problem - I'm getting data in a batch manner. So in every timestep i
I get data as
Bi={x1,x2...,xm}
,where m is batch size.

Every batch does not cover whole population, but only its part. So the constructed covariance matrix from one batch is highly biased. And the variance from batch to batch is high (but not very high).

The goal is to make an online-batch estimate of the covariance matrix. So that e.g. in every iteration i
I estimate the covariance matrix and in iteration i+1 I update it. And after i

iterations, the algorithm converges to the covariance matrix, that I should represent the unbiased covariance matrix of whole population.

Do you have guys some already-implemented algorithm to do it so ?

MaxHalford · 2021-10-29T09:49:19Z

MaxHalford
Oct 29, 2021
Maintainer

Hello :)

We have the logic to compute an online covariance in stats.Cov. You should be able to use that to calculate the covariance matrix:

import collections
import itertools
import numpy as np
from river import stats

np.random.seed(144_000)
X = np.random.random(size=(1000, 5))

cov = collections.defaultdict(stats.Cov)

for batch in np.split(X, 5):
    for x in batch:
        for i, j in itertools.combinations(range(len(x)), 2):
            cov[i, j].update(x[i], x[j])

print(cov)

defaultdict(river.stats.cov.Cov,
            {(0, 1): Cov: 0.002749,
             (0, 2): Cov: 0.003143,
             (0, 3): Cov: 0.002923,
             (0, 4): Cov: -0.000836,
             (1, 2): Cov: -0.001591,
             (1, 3): Cov: 0.000458,
             (1, 4): Cov: -0.002777,
             (2, 3): Cov: -0.003429,
             (2, 4): Cov: 0.001228,
             (3, 4): Cov: -0.00349})

You can pretty-print this with pandas:

import pandas as pd

print((
    pd.DataFrame(
        [
            {'i': i, 'j': j, 'cov': cov.get()}
            for (i, j), cov in cov.items()
        ] +
        [
            {'i': i, 'j': j, 'cov': cov.get()}
            for (j, i), cov in cov.items()
        ]
    )
    .pivot('i', 'j', 'cov')
))

j         0         1         2         3         4
i                                                  
0       NaN  0.002749  0.003143  0.002923 -0.000836
1  0.002749       NaN -0.001591  0.000458 -0.002777
2  0.003143 -0.001591       NaN -0.003429  0.001228
3  0.002923  0.000458 -0.003429       NaN -0.003490
4 -0.000836 -0.002777  0.001228 -0.003490       NaN

There's certainly a way to do with numpy with mini-batch formulas, but that's not part of River as of yet. The above will work fine if you're not working with big data.

5 replies

DanielWicz Oct 29, 2021
Author

Thanks for a quick reply

working with big data.

Do you plan to make it a batched version/numpy compatible or at least move the for loop into c/c++ ?

MaxHalford Oct 29, 2021
Maintainer

It's not on the roadmap, no.

MaxHalford Oct 29, 2021
Maintainer

Have you seen this?

DanielWicz Oct 29, 2021
Author

Have you seen this?

Yes, I saw - the implemented algorithm is not as numerically stable as yours. But thanks for looking for one.
PS. Do you know how to convert the cov. mat. to numpy matrix ? What I want to do is:
feed data batches -> calc cov. mat. -> Use it in another algorithm as a numpy matrix. Doing it through pandas (like above) may be pretty slow

eserie Oct 29, 2021

Hi,

You could do it in pandas no?

Where your code gives:

j         0         1         2         3         4
i                                                  
0       NaN  0.002749  0.003143  0.002923 -0.000836
1  0.002749       NaN -0.001591  0.000458 -0.002777
2  0.003143 -0.001591       NaN -0.003429  0.001228
3  0.002923  0.000458 -0.003429       NaN -0.003490
4 -0.000836 -0.002777  0.001228 -0.003490       NaN

A pandas implementation would give:

pd.DataFrame(X).ewm(alpha=1/1.e6).cov().loc[999]

  	0	1	2	3	4
0	0.081745	0.002748	0.003143	0.002923	-0.000836
1	0.002748	0.083221	-0.001592	0.000457	-0.002777
2	0.003143	-0.001592	0.081277	-0.003429	0.001228
3	0.002923	0.000457	-0.003429	0.089583	-0.003490
4	-0.000836	-0.002777	0.001228	-0.003490	0.082151

If you want to scale to big data (with potential usage of GPU/TPU architectures) and stay in numpy world, you can use JAX + WAX-ML.
Here is a code example:

from wax.modules import EWMCov
from wax.unroll import unroll
import jax
import numpy as onp

online_covariance = unroll(lambda x: EWMCov(alpha=1 / 1.0e6)(x, x))(jax.device_put(X))
print(f"Online covariance shape: {online_covariance.shape}")
pd.DataFrame(onp.array(online_covariance[-1]))

Online covariance shape: (1000, 5, 5)
	0	1	2	3	4
0	0.081633	0.002719	0.003037	0.002832	-0.000767
1	0.002719	0.083103	-0.001421	0.000628	-0.002814
2	0.003037	-0.001421	0.081268	-0.003217	0.001135
3	0.002832	0.000628	-0.003217	0.089631	-0.003615
4	-0.000767	-0.002814	0.001135	-0.003615	0.082033

A splitting per batch could be done as follow:

from wax.modules import EWMA

@unroll
def ewm_batch(X):
    @unroll
    def ewm_first(x):
        return EWMCov(1 / 100000)(x, x)
    return EWMA(1 / 100000)(ewm_first(X))


res = ewm_batch(jax.device_put(np.stack(np.split(X, 5))))
print(res.shape)
pd.DataFrame(res[-1, -1])

(5, 200, 5, 5)
	0	1	2	3	4
0	0.081586	0.002912	0.003182	0.002834	-0.000822
1	0.002912	0.083022	-0.001486	0.000519	-0.002762
2	0.003182	-0.001486	0.081099	-0.003533	0.001189
3	0.002834	0.000519	-0.003533	0.088827	-0.003653
4	-0.000822	-0.002762	0.001189	-0.003653	0.081679

Hope it could help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online estimation of covariance matrix #749

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Online estimation of covariance matrix #749

DanielWicz Oct 29, 2021

Replies: 1 comment · 5 replies

MaxHalford Oct 29, 2021 Maintainer

DanielWicz Oct 29, 2021 Author

MaxHalford Oct 29, 2021 Maintainer

MaxHalford Oct 29, 2021 Maintainer

DanielWicz Oct 29, 2021 Author

eserie Oct 29, 2021

DanielWicz
Oct 29, 2021

Replies: 1 comment 5 replies

MaxHalford
Oct 29, 2021
Maintainer

DanielWicz Oct 29, 2021
Author

MaxHalford Oct 29, 2021
Maintainer

MaxHalford Oct 29, 2021
Maintainer

DanielWicz Oct 29, 2021
Author