Skip to content

Mixture random cleanup #3364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Feb 16, 2019
Merged

Mixture random cleanup #3364

merged 16 commits into from
Feb 16, 2019

Conversation

lucianopaz
Copy link
Contributor

In PR #3293, I had tried to fix the Mixture.random method (issue #3270). I had mistakenly not added WIP to the PR and it was merged with the tests passing but with some rough edges. This PR addresses one important thing that was poorly implemented in PR #3293: vectorization of comp_dists.random.

The most important changes are:

  1. Compact comp_dists shape inference method.
  2. _comp_samples method was rewritten to get a uniform interface with distributions.distribution.generate_samples.
  3. The size, dist_shape and broadcast_shape are handled more clearly
  4. I fixed a defect in Multinomial.random, which may be the cause of sample_posterior_predictive returns the wrong dimension #3343

Points that I would like to discuss:

  1. I changed a particular test that was failing, test_sampling.py::TestSamplePPC::test_normal_scalar. The strange thing was that it was also failing for me on master. I made a slight edit to it and would like your opinion.
  2. The generator decorator function around comp_dist.random to provide the raw size instead of the size that is combined with dist_shape and broadcast_shape that generate_samples gives to the generators required me to ignore a pylint warning. Could you suggest an alternative that does not have to do that or something less hackish regarding the size parameter?
  3. Should we add a sample_prior and sample_posterior_predictive test for NormalMixture too?

This PR does not address another point that was lacking from #3293, which is that the LKJCholeskyCorr.random function is not tested. I will work on it and submit a separate PR.

@@ -267,7 +267,7 @@ def random(self, point=None, size=None):
else:
std_norm_shape = mu.shape
standard_normal = np.random.standard_normal(std_norm_shape)
return mu + np.tensordot(standard_normal, chol, axes=[[-1], [-1]])
return mu + np.einsum('...ij,...j->...i', chol, standard_normal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I had not understood tensordot's broadcasting rules well and I noticed it started giving weird shaped outputs, and einsum was the most intuitive way to get the right broadcasting.

self._broadcast_shape = self._comp_dist_shapes
self.is_multidim_comp = True
elif isinstance(comp_dists, Iterable):
if not all((isinstance(comp_dist, Distribution)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check should be move to the __init__, maybe it could also save a few lines in the if/else here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can do that. I thought that it would be a good check to make whenever comp_dists' value was set. I know that we only really change comp_dists during __init__, but I thought that maybe some users could potentially change its values in derived classes or things like that.

samples = []
for dist_shape, comp_dist in zip(comp_dist_shapes,
self.comp_dists):
def generator(*args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not joking that this is a horrible hack... I think it might be better to overwrite the comp_dist.random method (it's at least more explicit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean just changing the comp_dists[i].random method somewhere else? Maybe I could add a _generators attribute during comp_dists.setter, with a wrapper of comp_dists[i].random that takes a _raw_shape parameter too. I'll try it out.

if samples.shape[-1] == 1:
return samples[..., 0]
else:
return samples

def infer_comp_dist_shapes(self, point=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to call this once in __init__ and save the outputs as properties for the mixture class? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the comp_dist shapes may depend on the values passed with the point dictionary. It's the case for sample_posterior_predictive at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, good point.

…thing less hackish. At least we dont ignore lint anymore.
@twiecki twiecki added the WIP label Feb 4, 2019
)
# In the logp we assume the last axis holds the mixture components
# so we move the axis to the last dimension
samples = np.moveaxis(samples, 0, -1)
if samples.shape[-1] == 1:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement is legacy code that I don't really understand. Why should we do this test, wouldn't it be done by Mixture.random?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont remember as well, but at some point there is an error with shape = (100, 1) and shape = (100, ) - not even sure we have a test for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So error happens when we try to call random from the mixture and comp_dists has shape (100, 1) but Mixture has shape (100,)? I'm not sure how the error goes. Did it happen when comp_dists was a multidimensional distribution or when it was a list of distributions? I'd appreciate any thoughts on extra tests to write.

…and median to stack on the last axis (which should allow Mixture to work with multidimensional distributions). Changed docstring a bit. Minor change in _comp_samples to enfore dtype.
@lucianopaz
Copy link
Contributor Author

I had forgotten to mention another change this PR implements, that turns out to be big: the Mixture's default dtype depends on theano.config.floatX. This means that if the Mixture is continuous the default dtype will be floatX and if it is discrete it will be intX (float64->int32, float32->int16, float16->int8). This made some initial tests fail.

I now pushed a small set of changes that should allow Mixture to handle comp_dists to be a list of multidimensional distributions. I'll try to push some tests for this functionality soon.

@lucianopaz
Copy link
Contributor Author

I went down the road of adding support for multidimensional mixtures. I've run some local tests but still need to write down a more systematic testsuite for this functionality.

@@ -75,27 +87,98 @@ def test_normal_mixture(self):
np.sort(self.norm_mu),
rtol=0.1, atol=0.1)

def test_normal_mixture_nd(self):
nd, ncomp = 3, 5
@pytest.mark.parametrize('nd,ncomp',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is nice!

size=None)
comp_dist_shapes = test_sample.shape
broadcast_shape = comp_dist_shapes
includes_mixture = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not complete sure what is the function of includes_mixture here - I imagine is something to do with making sure the broadcasting of w is correct. Could you add some small docstrings to this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Yes, I'll add a docstring. includes_mixture maybe was not the best choice of name, and maybe is redundant. It's a bool that indicates whether the comp_dists last axis has the mixture components or not, but that's the same as testing if comp_dists is a distribution or a list. Hmm, I'll see how not to depend on it.

if isinstance(comp_dists, Distribution):
self._comp_dist_shapes = to_tuple(comp_dists.shape)
self._broadcast_shape = self._comp_dist_shapes
self.is_multidim_comp = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is a bit confusion - this is only to distinguish between the comp being a list or a distribution right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right. I'll change this

# Happens in __init__ when computing self.logp(comp_modes)
self_shape = None
comp_shape = tuple(comp_dists.shape)
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no else clause here? I think you should raise an exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! You're right! Thanks for catching that. I'll add an else clause, or restructure the conditions a bit.

@lucianopaz
Copy link
Contributor Author

lucianopaz commented Feb 13, 2019

While working on the multidimensional distributions support I came across something that I think is wrong in Mixture.random and wanted to ask your opinion on it, @junpenglao.

Consider the following example

import numpy as np
import pymc3 as pm


npop = 5
nd = 3
with pm.Model():
    mus = np.tile(np.arange(npop)[None, :], (nd, 1))
    m1 = pm.NormalMixture('m1',
                          w=np.ones(npop) / npop,
                          mu=mus,
                          sigma=1e-5,
                          comp_shape=(nd, npop),
                          shape=nd)
    m2 = pm.NormalMixture('m2',
                          w=np.ones((nd, npop)) / npop,
                          mu=mus,
                          sigma=1e-5,
                          comp_shape=(nd, npop),
                          shape=nd)

m1_val = m1.random(size=100)
m2_val = m2.random(size=100)

print('m1 rows come from the same compenent? {}'.
      format(all(np.all(np.diff(m1_val) < 1e-3, axis=-1))))
print('m2 rows come from the same compenent? {}'.
      format(all(np.all(np.diff(m2_val) < 1e-3, axis=-1))))

Both m1_val and m2_val have the correct shape (100, nd). My concern is, should we expect each of the elements of m1_val to come from a different mixture component, or should all the elements in a row come from the same distribution? For m2_val I would expect each element to be sampled from independent mixture components because weights w are specified independently for each nd element in the mixture. However, for m1_val, I think that this is not correct and all the elements in each row should come from the same mixture component.

I think that m1 and m2 should match to the following models with the latent component variable z1 and z2:

import numpy as np
from theano import tensor as tt
import pymc3 as pm


npop = 5
nd = 3
with pm.Model():
    mus = tt.as_tensor_variable(np.tile(np.arange(npop)[None, :], (nd, 1)))
    z1 = pm.Categorical('z1', p=np.ones(npop) / npop)
    m1 = pm.Normal('m1', mu=mus[..., z1], sigma=1e-5, shape=nd)
    z2 = pm.Categorical('z2', p=np.ones(npop) / npop, shape=nd)
    mu2 = tt.as_tensor_variable([mus[i, z2[i]] for i in range(nd)])
    m2 = pm.Normal('m2', mu=mu2, sigma=1e-5, shape=nd)

m1_val = m1.random(size=100)
m2_val = m2.random(size=100)

print('m1 rows come from the same compenent? {}'.
      format(all(np.all(np.diff(m1_val) < 1e-3, axis=-1))))
print('m2 rows come from the same compenent? {}'.
      format(all(np.all(np.diff(m2_val) < 1e-3, axis=-1))))

@junpenglao
Copy link
Member

junpenglao commented Feb 13, 2019

I see what you mean, but in each component (regardless of the dimension) it should always slice through all the dimension, which means z2 = pm.Categorical('z2', p=np.ones(npop) / npop, shape=nd) does not makes sense because you want to select the component while keeping each dimension (ie not sampling the dimension)

For example:

pm.Categorical.dist(p=np.ones(npop) / npop, shape=nd).random()
# ==> array([0, 2, 4])

means in one realization, you select component 1 dimension 1 as dimension 1, component 3 dimension 2 as dimension 2, component 5 dimension 3 as dimension 3. Which is a different model than what i would expect as a mixture model that each slice of the data (a 3 dimension vector in this case) are randomly selected from one component of the mixture.

@lucianopaz
Copy link
Contributor Author

So it is safe to assume that w should always be a vector with npop elements, right? If w were a matrix, it would mean that the mixture component weights could change across each dimension of the distribution, so the mixture component wouldn't slice through the other dimensions.

@lucianopaz
Copy link
Contributor Author

@junpenglao, test_mixture_random_shape, actually supplies a 2D array as the Mixture's w. I'm starting to think again that it does make sense for each dimension of the mixture to come from a different components in these cases, because the mixture weights could be different for each of the mixture's dimension elements.

@lucianopaz
Copy link
Contributor Author

With the latest commits, the below code snippets match the behavior for the mixture and latent label models

import numpy as np
from theano import tensor as tt
import pymc3 as pm


npop = 5
nd = 3
with pm.Model():
    mus = np.tile(np.arange(npop)[None, :], (nd, 1))
    m1 = pm.NormalMixture('m1',
                          w=np.ones(npop) / npop,
                          mu=mus,
                          sigma=1e-5,
                          comp_shape=(nd, npop),
                          shape=nd)
    m2 = pm.NormalMixture('m2',
                          w=np.ones((nd, npop)) / npop,
                          mu=mus,
                          sigma=1e-5,
                          comp_shape=(nd, npop),
                          shape=nd)

m1_val = m1.random(size=100)
m2_val = m2.random(size=100)

print('m1 rows come from the same compenent? {}'.
      format(all(np.all(np.diff(m1_val) < 1e-3, axis=-1))))  # True
print('m2 rows come from the same compenent? {}'.
      format(all(np.all(np.diff(m2_val) < 1e-3, axis=-1))))  # False

with pm.Model():
    mus = tt.as_tensor_variable(np.tile(np.arange(npop)[None, :], (nd, 1)))
    z1 = pm.Categorical('z1', p=np.ones(npop) / npop)
    m1 = pm.Normal('m1', mu=mus[..., z1], sigma=1e-5, shape=nd)
    z2 = pm.Categorical('z2', p=np.ones(npop) / npop, shape=nd)
    mu2 = tt.as_tensor_variable([mus[i, z2[i]] for i in range(nd)])
    m2 = pm.Normal('m2', mu=mu2, sigma=1e-5, shape=nd)

m1_val = m1.random(size=100)
m2_val = m2.random(size=100)

print('m1 rows come from the same compenent? {}'.
      format(all(np.all(np.diff(m1_val) < 1e-3, axis=-1))))  # True
print('m2 rows come from the same compenent? {}'.
      format(all(np.all(np.diff(m2_val) < 1e-3, axis=-1))))  # False

@junpenglao
Copy link
Member

@junpenglao, test_mixture_random_shape, actually supplies a 2D array as the Mixture's w. I'm starting to think again that it does make sense for each dimension of the mixture to come from a different components in these cases, because the mixture weights could be different for each of the mixture's dimension elements.

That's for a different use case, and it only makes sense for repeated measurement. So it would be cases like: 2 different 3 component mixture stack together. I think these cases are subtitle and potentially easy to be missed use, not sure what would be the best way to handle these (Oh god i so much wish we have the concept of batch shape and sample shape so we can distinguish this explicitly).

So to reason about the 2D w, think of this usecase:
1 mixture distribution with 3 Components: each is a 5D Gaussian. When we generate from this mixture we generate first label shape = (n, 1) or (n, ), which gives us something like array([0, 1, 1, 2, ...]), and we index to one 5D Gaussian and generate a vector with shape=(5). The expected shape would be: the mixture component is (5, 3), the weight is (3), the observation is (n, 5)

Now we have a another mixture, also with 3 Components that each is a 5D Gaussian, we can stack the two together. The expected shape would be: the mixture component is (5, 3, 2), the weight is (3, 2), the observation is (n, 5, 2)

@junpenglao
Copy link
Member

And dont get me started with mixture of mixture...

@lucianopaz
Copy link
Contributor Author

That's for a different use case, and it only makes sense for repeated measurement. So it would be cases like: 2 different 3 component mixture stack together. I think these cases are subtitle and potentially easy to be missed use, not sure what would be the best way to handle these (Oh god i so much wish we have the concept of batch shape and sample shape so we can distinguish this explicitly)

But isn't this exactly the same as the w2 categorical case? It is the same as stacking nd independent Mixtures, just like a TFP batch_shape. I agree that if we had that concept in pymc3, it would make this much easier but I think that, as it stands, if w is a vector, then the Mixture shape can be interpreted as an event_shape. That is, all the dimensions of the Mixture will come from the same mixture component. If w is an nd-array, the Mixture's shape that matches w.shape[:-1] should be considered as a batch_shape, because it should be treated the same as stacking independent mixtures together. What do you think?

@junpenglao
Copy link
Member

As long as the component are sampled as a whole, it is fine. In other words, latent label does not breaks up the components. I still thinks in your example above it doesnt makes sense as each dimension is from a different component.

-------
List of broadcasted sample arrays

Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an amazing doc-string.

@lucianopaz
Copy link
Contributor Author

Let me rephrase the model with the latent categories like this:

import numpy as np
from theano import tensor as tt
import pymc3 as pm


npop = 5
nd = 3

with pm.Model():
    mus = tt.as_tensor_variable(np.tile(np.arange(npop)[None, :], (nd, 1)))
    z1 = pm.Categorical('z1', p=np.ones(npop) / npop)
    m1 = pm.Normal('m1', mu=mus[..., z1], sigma=1e-5, shape=nd)
    z2s = [pm.Categorical('z2_{}'.format(i), p=np.ones(npop) / npop)
           for i in range(nd)]
    m2s = [pm.Normal('m2_{}'.format(i), mu=mus[i, z2s[i]], sigma=1e-5)
           for i in range(nd)]
    z2 = pm.Deterministic('z2', tt.stack(z2s))
    m2 = pm.Deterministic('m2', tt.stack(m2s))
    vals = pm.sample_prior_predictive(100, vars=['m1', 'm2'])
    m1_val = vals['m1']
    m2_val = vals['m2']

print('m1 rows come from the same compenent? {}'.
      format(all(np.all(np.diff(m1_val) < 1e-3, axis=-1))))  # True
print('m2 rows come from the same compenent? {}'.
      format(all(np.all(np.diff(m2_val) < 1e-3, axis=-1))))  # False

Here I explicitly use that m2 is a stack (or batch in TFP terms) of independent mixtures. I think that this is a valid interpretation for how the Mixture class is now handling multidimensional mixture weights w. I think that what is left for this PR to be ready for its final review are some more tests to verify that the random values drawn from the Mixture match what we would expect from the latent categorical model.

@junpenglao
Copy link
Member

So it is 3 mixture each with a 1d component... Well I guess this also could be a valid use case but my concern is that it would be too easy to be miss use. I will have another look at the docstring.

@lucianopaz lucianopaz changed the title [WIP] Mixture random cleanup Mixture random cleanup Feb 15, 2019
@lucianopaz
Copy link
Contributor Author

I pushed a few more tests and added some examples into Mixture's docstring. I think that this PR is now ready for a final review.

@lucianopaz lucianopaz removed the WIP label Feb 15, 2019
@junpenglao
Copy link
Member

LGTM. Great job @lucianopaz.

@junpenglao junpenglao merged commit e3fd56b into pymc-devs:master Feb 16, 2019
@twiecki
Copy link
Member

twiecki commented Feb 16, 2019

woohoo!

@lucianopaz lucianopaz deleted the mixture_random branch February 16, 2019 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants