Skip to content

proposal for mixture models #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
367 changes: 367 additions & 0 deletions designs/0034-mixtures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,367 @@
- Feature Name: 0034-mixtures
- Start Date: 2024-02-14
- RFC PR:
- Stan Issue:

# Summary

The mixtures feature is designed to allow users to specify mixture
models using a consistent density-based notation without having to
worry about marginalizing the discrete parameter manually or about
working on the log scale.


# Motivation

Users writing their own mixture models has been error prone. A
specific failure mode is vectorization and mixing over the whole
population versus mixing over individuals. The latter is almost
always intended, but the former is what a naive Stan program will
do.


# Guide-level explanation

In both the two-component and multi-component mixture syntax, a new
distribution function will be introduced, which is described in this
section.


## Two-component mixtures.

A two-component normal mixture model will be written in canonical form
as follows.

```stan
y ~ mixture(p,
normal(m[1], s[1]),
normal(m[2], s[2]));
```

Here, `p` is a probability in (0, 1) and `normal(m[1], s[1])` acts
like a lambda binding of the first argument (more on that below).
This notation would be syntactic sugar for the following expression

```stan
target += log_sum_exp(log(p) + normal_lpdf(y | m[1], s[1]),
log1m(p) + normal_lpdf(y | m[2], s[2]));
```

Because this follows distribution syntax, for consistency we would
also define the probability functions `mixture_lpdf` and `mixture_lpmf`
```stan
mixture_lpdf(y | p, lp1, lp2)
=def=
log_sum_exp(log(p) + lp1(y), log1m(p) + lp2(y));
```

Here we have have explicitly used application syntax for `lp1` and
`lp2`, even though it doesn't exist in Stan. For discrete
distributions, we will also need a `mixture_lpmf`.

If Stan had C++ style lambdas, a distribution without an outcome
variable would be defined as follows.

```stan
normal_lpdf(m[1], s[1])
=def=
[y].normal_lpdf(y | m[1], s[1])
```

Even though a distribution without an outcome behaves like a lambda in
the context of mixtures, this will *not* be generally available in
Stan, so that it would not be legal to write the following.

```stan
target += normal(m[1], s[1])(y); // ILLEGAL
```

## Require probability function arguments

Although it would be possible to allow arbitrary functions to
participate in mixtures, in the first release we propose to not allow
arbitrary functions and restrict the arguments after the first to be
probability functions (lpdfs or lpmfs). The workaround in this
case is to take the function and wrap it in an lpdf or lpmf
definition. We may add arbitrary function arguments at a later date
after we introduce lambdas into Stan.

For example, to define a zero-inflated Poisson as a mixture model,
the lpmf that puts all of its mass on 0 can be defined and then used
as one of the mixture components.

```stan
functions {
real zero_delta_lpmf(int y) {
return y == 0 ? 0 : -infinity();
}
}
...
real<lower=0, upper=1> p;
...
model {
y ~ mixture(p, zero_delta, poisson(lam));
```


## Mixing truncated distributions

When two truncated distributions are mixed, it is important to include
the normalization constants.


### Proposal for native truncation syntax
in the situation where `mu`, `sigma`, and `lambda` are parameters.

```stan
real<lower=0> alpha;

alpha ~ mixture(p,
normal(mu, sigma) T[0, ],
exponential(lambda));
```

Using truncation in this way will probably require extra work on
the parser side in order to make `normal(mu, sigma) T[0, ]` a node in
the syntax tree. Stan does *not* currently support statements of the
form

```stan
target += normal_lupdf(alpha | mu, sigma) T[0, ];
```

Instead, you have to do the following.

```stan
target += normal_lupdf(alpha | mu, sigma)
- normal_ccdf(0 | mu, sigma);
```

### Alternative without new truncation syntax

We can avoid having to have the `T[0, ]` syntax by instead requiring
the user to define their own function,

```stan
real normal_lb_lpdf(real alpha, real mu, real sigma, real lb) {
return normal_lupdf(alpha | mu, sigma)
- normal_ccdf(lb | mu, sigma);
}
```

With this, the above would be coded as

```stan
alpha ~ mixture(p,
normal_lb(mu, sigma, 0),
exponential(lambda));
```

We will probably start without implementing the new truncation
syntax.


## Mixing continuous and discrete distributions

It is incoherent to directly mix continuous and discrete
distributions. That is, we do *not* want to do something like the
following.

```stan
int<lower=0> y;

y ~ mixture(p,
poisson(lambda1),
exponential(lambda2));
```

The problem here is that there is no type to assign the mixture. In
these cases of mismatched types, we want to raise a compiler error.

Technically, the continuity can be eliminated by defining a new
discrete distribution that delegates to the exponential by promotion.

```stan
real exponential_int_lpdf(int y, real lambda) {
return exponential_lpdf(y | lambda);
}
```

To make this coherent, the sum of densities of valid `y` needs to be
finite. In this case, the requirement is $\sum_{n \in \mathbb{N}}
\textrm{exponential}(n | \lambda) < \infty.$ This is not something
that can be enforced through Stan, but it should be documented.

## Mixtures with more than two components

Up until now, we have assumed two mixture components and a probability
argument. In general, we want to allow more than two mixture
components. The proposal is to have the first argument be a syntax
whose size determines the number of remaining arguments. For
example, the following would be a legal way to define a 3-component mixture.

```stan
simplex[3] p;
...
y ~ mixture(p, normal(m[1], s[1]),
normal(m[2], s[2]),
student_t(4.5, m[3], s[3]));
```

Run-time error checking is largely a matter of making sure the simplex
argument is both a simplex and the right size for the number of
components, or if it's a probability, checking that there are two
components and that the value is in (0, 1).

In terms of compilation, `mixture` is another variadic function with type
checking.

## Comparison to existing `log_mix` function

This functionality is more general than the existing `log_mix`
functionality, which requires actual log density values as arguments.
In this case, the above mixture would be written as

```stan
simplex[3] p;
...
target += log_mix(p, normal_lpdf(y | m[1], s[1]),
normal_lpdf(y | m[2], s[2]),
student_t_lpdf(y | 4.5, m[3], s[3]));
```

The `log_mix` function is slightly different than the `mixture`
distribution and while it could be deprecated in favor of `mixture`,
it would also make sense to keep it.

Like this proposal, the `log_mix` function requires truncated
distributions or mixtures of discrete and continuous distributions to
be finessed through writing a user-defined function of the appropriate
type ending in `_lpdf` or `_lpmf`.


# Reference-level explanation

The implementation requires `mixture_lpmf` and `mixture_lpdf` to be
coded in C++. These will be variadic functions that accept either (a)
one probability argument and two probability functions without
variates, or (b) one simplex, and a number of probability functions
matching its size.

Derivatives can be handled by autodiff, or an "analytic" derivative
for `mixture_lpmf` and `mixture_lpdf` can be supplied.

## Interactions with other features

There are no interactions with other features.

## Boundary conditions and exceptions

The boundary conditions are simple. If zero mixture components are
supplied, it should throw an exception. This is better than returning
`-infinity` and letting the user try to debug the rejection.
If only a single mixture component is supplied with a size 1 simplex,
a warning should be emitted suggesting simplified usage.

## Include normalizing constants

The distributions have to be compiled with `propto=False` and this
needs to be made clear in the documentation. Statistically,
`propto=False` is required to get the correct result when the mixture
components don't have equal normalizing constants.

## Catching out of bounds exceptions in variates

Stan's probability functions are designed to throw
`std::invalid_argument` exceptions when they encounter illegal
arguments. In this case, we want to catch any exceptions that come
from the variate being mixed being out of bounds. That is, if
we use `exponential(2)` as one of our mixture components, then
we want a negative argument `exponential_lpdf(-1 | 2)` to return
`-infinity` rather than throw an exception.

To handle that, we propose to change the exception thrown in the math
library to a subclass of `std::invalid_argument` which we will call
`stan::math::invalid_variate`. This is a simple change in the math
lib for type of exception thrown and will not break backward
compatibility.

# Drawbacks

It's yet one more kind of ad hoc parsing that doesn't fit into the
type system until we have functions.

## Rationale and alternatives

This design is close to optimal as it matches the way a statistician
might write a mixture in mathematical notation.

The alternative is to just require users to keep using the `log_mix`
function or writing their own custom code.

The impact of inaction here is just that it's harder for users to
write readable and robust mixture models.


## Limitations

This proposal is not intended to address infinite mixtures, like a gamma-Poisson.

## Prior art

Prior art is basically just `log_mix` and some examples of how to code
by hand in the mixtures chapter of the *User's Guide*.

#### PyMC

PyMC provides a distribution class
[`pymc.Mixture`](https://www.pymc.io/projects/docs/en/stable/api/distributions/generated/pymc.Mixture.html)
Here's the first example from their documentation.

```python
with pm.Model() as model:
w = pm.Dirichlet("w", a=np.array([1, 1])) # 2 mixture weights

lam1 = pm.Exponential("lam1", lam=1)
lam2 = pm.Exponential("lam2", lam=1)

# As we just need the logp, rather than add a RV to the model, we need to call `.dist()`
# These two forms are equivalent, but the second benefits from vectorization
components = [
pm.Poisson.dist(mu=lam1),
pm.Poisson.dist(mu=lam2),
]
# `shape=(2,)` indicates 2 mixture components
components = pm.Poisson.dist(mu=pm.math.stack([lam1, lam2]), shape=(2,))

like = pm.Mixture("like", w=w, comp_dists=components,
observed=data)
```

This creates a two-component mixture of Poisson distributions.

#### Turing.jl

`Turing.jl` provides a class
[`MixtureModel`](https://turinglang.org/docs/tutorials/gaussian-mixture-models/).
Here's an example of its use.

```julia
w = [0.5, 0.5]
mu = [-3.5, 0.5]
mixturemodel = MixtureModel([MvNormal(Fill(mu_k, 2), I) for mu_k in mu], w)
```

This creates a two-component mixture of two-dimensional isotropic
Gaussians (`I` is the identity matrix imported from `LinearAlgebra`).

# Resolved questions

1. Should we allow function arguments or force user to write `_lpdf`
or `_lpmf` functions? No.

2. Should we allow one-component mixtures as the boundary condition? Yes.

3. Should `normal(m, s)` be interpreted as a function elsewhere in
Stan. No.