Skip to content

Homogenous data incorectly treated #4389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cluhmann opened this issue Dec 27, 2020 · 23 comments
Closed

Homogenous data incorectly treated #4389

cluhmann opened this issue Dec 27, 2020 · 23 comments

Comments

@cluhmann
Copy link
Member

Over on discourse, Argantonio65 reported a problem when trying out a toy coin-flipping example. In playing around with it a bit, it seems that a numpy array consisting exclusively of ones[zeros] is treated as a single (scalar) one[zero] regardless of the actual length of the numpy array. Adding even a single zero[one] seems to cause the true length to be acknowledged. Similarly, wrapping the array in a pm.Data object also seems to yield correct behavior.

Here is a condensed version illustrating the insensitivity to sample size when the data is just a vector of ones.

import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt

data = [
        np.ones(1),
        np.ones(10),
        np.ones(100)
       ]

with pm.Model() as model:
    p = pm.Beta('p', alpha=1, beta=1, shape=3)
    y1 = pm.Bernoulli('y1', p=p[0], observed=data[0])
    y2 = pm.Bernoulli('y2', p=p[1], observed=data[1])
    y3 = pm.Bernoulli('y3', p=p[2], observed=data[2])

    trace = pm.sample(8000, step = pm.Metropolis(),random_seed=333, chains=2)

    pm.traceplot(trace)
    plt.show()

This displays the following:

wrong

Changing the 3 likelihood lines so that they use pm.Data:

    y1 = pm.Bernoulli('y1', p=p[0], observed=pm.Data('d1', data[0]))
    y2 = pm.Bernoulli('y2', p=p[1], observed=pm.Data('d2', data[1]))
    y3 = pm.Bernoulli('y3', p=p[2], observed=pm.Data('d3', data[2]))

yields the expected behavior:

data object

Augmenting the data with a single tail/failure:

data[1][0] = 0
data[2][0] = 0

also yields the expected behavior.

augmented data

Versions and main components

  • PyMC3 Version:3.10
  • Theano Version:1.0.5
  • Theano-pymc Version:1.0.11
  • Python Version:3.8.5
  • Operating system:Linux
  • How did you install PyMC3: pip
@AlexAndorra
Copy link
Contributor

Thanks for reporting @cluhmann ! Wanna try doing a PR to solve it? It seems to me you're the man for the job 😉
Just in passing this vaguely reminds me of #3640 -- maybe two 🐦 with one 💎 ? 🤷‍♂️

@cluhmann
Copy link
Member Author

I can certainly look into it. 👍 The first step would obviously be tracking down the source of this problem. Then we can figure out whether these other data/shape issues are related.

@twiecki
Copy link
Member

twiecki commented Dec 29, 2020

That's bizarre, no idea what could be causing this but seems like theano might be the source. @brandonwillard any idea?

@brandonwillard
Copy link
Contributor

From that example alone, I don't see anything indicating a Theano-specific issue.

@brandonwillard
Copy link
Contributor

Try printing the log-likelihood graphs for the working and non-working models using theano.printing.debugprint.

Otherwise, what is pm.Data doing that isn't normally done?

@twiecki
Copy link
Member

twiecki commented Dec 29, 2020

Then it could be in the code that is processing what gets passed to observed. @cluhmann are you familiar with using a debugger and can jump in there?

@cluhmann
Copy link
Member Author

Can't say the debugger is my favourite tool to use, but I can see if I make any headway. I already confirmed that the observed data itself does not seem to change, so it seems unrelated to #3640. If I don't seem to be making much progress, I'll report back.

@cluhmann
Copy link
Member Author

No obvious causes, but maybe my digging can provide some clues for those more intimately familiar with the inner workings of pymc/theano.

To try and triangulate a bit, I focused on 3 different data scenarios:

with pm.Model() as model:
    numpyData = np.zeros(10)
    numpyData2 = np.zeros(10)
    numpyData2[0] = 1
    pmData = pm.Data('pmData', numpyData)

    # data is a numpy array exclusively filled with zeros
    npVar = pm.Bernoulli('npVar', p=p[0], observed=numpyData)
    # data is a numpy array with a single 1
    npVar2 = pm.Bernoulli('npVar2', p=p[1], observed=numpyData2)
    # data is a pm.Data object filled with zeros
    pmVar = pm.Bernoulli('pmVar', p=p[2], observed=pmData)

The first is the problematic version, the other two work.

    print(type(npVar))
    print(type(npVar2))
    print(type(pmVar))

<class 'pymc3.model.ObservedRV'>
<class 'pymc3.model.ObservedRV'>
<class 'pymc3.model.ObservedRV'>

So far so good.

    print(type(npVar.observations))
    print(type(npVar2.observations))
    print(type(pmVar.observations))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'theano.tensor.sharedvar.TensorSharedVariable'>

So the variable defined using pm.Data has observations that are just a theano shared variable, which makes sense because pm.Data is just a wrapper for TensorSharedVariable. I think this might be because theObservedRV constructor saves data to self.observations (here) before converting data to a tensor (a couple lines later). It's not clear to me why the result ofas_tensor() isn't saved as self.observations, but I may be missing something important.

    print(npVar.observations.dtype)
    print(npVar2.observations.dtype)
    print(pmVar.observations.dtype)

float64
float64
float64

Seems fine.

    print(npVar.dtype)
    print(npVar2.dtype)
    print(pmVar.dtype)

int64
int64
float64

Ok. A bit odd. Presumably this happens when building the ObservedRV out of the pm.Data rather the numpy array.

    print(npVar.shape.eval())
    print(npVar2.shape.eval())
    print(pmVar.shape.eval())

[10]
[10]
ERROR (theano.gof.opt): Optimization failure due to: LocalOptGroup(local_useless_fill,local_useless_alloc,local_subtensor_make_vector,local_useless_elemwise,local_useless_inc_subtensor,local_useless_slice,local_subtensor_of_alloc,local_useless_inc_subtensor_alloc,local_useless_rebroadcast,local_join_1,local_join_empty,local_join_make_vector,local_useless_switch,local_useless_tile,local_useless_split,local_useless_reshape,local_useless_elemwise_comparison,local_useless_reduce,local_view_op,local_merge_alloc,local_useless_topk)
ERROR (theano.gof.opt): node: ViewOp(Elemwise{Cast{int64}}.0)
ERROR (theano.gof.opt): TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
...
[10]

Ok, so obviously some difficulties with getting the shape of the pm.Data version of variable. But in the end all the shapes look to be equivalent. And finally:

    theano.printing.debugprint(npVar)
    theano.printing.debugprint(npVar2)
    theano.printing.debugprint(pmVar)

ViewOp [id A] 'npVar'
|TensorConstant{(10,) of 0} [id B]
ViewOp [id A] 'npVar2'
|TensorConstant{[1 0 0 0 0..0 0 0 0 0]} [id B]
ViewOp [id A] 'pmVar'
|Elemwise{Cast{int64}} [id B] ''
|pmData [id C]

Nothing here seems obviously off to me, though these debugging messages aren't exactly transparent. Is there anything here that sets off alarm bells for anyone? Or anything that suggests directions to go digging?

@twiecki
Copy link
Member

twiecki commented Dec 30, 2020

The only thing that jumps out to me is |TensorConstant{(10,) of 0} [id B] which seems like some optimization to store arrays of the same value in a more efficient way. So maybe this could be interpreted as just a length-1 array with a zero in it? I wonder if there is a way to force a TensorConstant (or other tensor array, like a Vector) to test that more directly.

@brandonwillard
Copy link
Contributor

brandonwillard commented Dec 30, 2020

TensorConstant{(10,) of 0} represents a NumPy array of zeros with length 10. It's just a printer shorthand.

For example, such terms are generated by expressions like the following:

>>> import numpy as np
>>> import theano.tensor as tt
>>> tt.as_tensor(np.r_[0, 0, 0])
TensorConstant{(3,) of 0}

@brandonwillard
Copy link
Contributor

I would say that the discrepancy between the dtypes is bad. Also, it looks like pmVar (i.e. the RV with pm.Data observations) is simply incorrect, because pmVar is a pm.Bernoulli, and those have a finite, discrete support.

This is also the reason why you're seeing that optimization error in Theano.

In other words, it looks like there's a bug in the pm.Data-as-observations process.

N.B.: the following will raise an exception for that error so that one can investigate further with their debugger of choice:

import theano


with theano.config.change_flags(on_opt_error="raise"):
    pmVar.shape.eval()

@ricardoV94
Copy link
Member

I don't know if this helps, but I noticed that this issue does not happen with the equivalent model using Binomial distributions.

with pm.Model() as model:
    pmData = pm.Data('pmData', numpyData)

    p = pm.Beta('p', alpha=1, beta=1, shape=3)

    # data is a numpy array exclusively filled with zeros
    npVar = pm.Bernoulli('npVar', p=p[0], observed=numpyData)
    # data is a numpy array with a single 1
    npVar2 = pm.Bernoulli('npVar2', p=p[1], observed=numpyData2)
    # data is a pm.Data object filled with zeros
    pmVar = pm.Bernoulli('pmVar', p=p[2], observed=pmData)

model.check_test_point()

p_logodds__   -4.16
npVar         -0.69
npVar2        -6.93
pmVar         -6.93
Name: Log-probability of test_point, dtype: float64
with pm.Model() as model:
    pmData = pm.Data('pmData', numpyData)

    p = pm.Beta('p', alpha=1, beta=1, shape=3)

    # data is a numpy array exclusively filled with zeros
    npVar = pm.Binomial('npVar', n=1, p=p[0], observed=numpyData)
    # data is a numpy array with a single 1
    npVar2 = pm.Binomial('npVar2', n=1, p=p[1], observed=numpyData2)
    # data is a pm.Data object filled with zeros
    pmVar = pm.Binomial('pmVar', n=1, p=p[2], observed=pmData)

model.check_test_point()

p_logodds__   -4.16
npVar         -6.93
npVar2        -6.93
pmVar         -6.93
Name: Log-probability of test_point, dtype: float64

@ricardoV94
Copy link
Member

ricardoV94 commented Jan 15, 2021

Also using .dist() outside of a model context raises Theano shape mismatch errors for the homogenous data but not the one with a single 1:

pm.Bernoulli.dist(p=.5).logp(numpyData).eval()

ERROR (theano.gof.opt): Optimization failure due to: constant_folding
ERROR (theano.gof.opt): node: Elemwise{switch,no_inplace}(TensorConstant{(10,) of 1}, TensorConstant{(1,) of -0..1805599453}, TensorConstant{(1,) of -inf})
ERROR (theano.gof.opt): TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/theano/gof/opt.py", line 2129, in process_node
    replacements = lopt.transform(node)
  File "/usr/local/lib/python3.6/dist-packages/theano/tensor/opt.py", line 7061, in constant_folding
    required = thunk()
  File "/usr/local/lib/python3.6/dist-packages/theano/gof/op.py", line 884, in rval
    thunk()
  File "/usr/local/lib/python3.6/dist-packages/theano/gof/cc.py", line 1845, in __call__
    raise exc_value.with_traceback(exc_trace)
ValueError: Input dimension mis-match. (input[0].shape[0] = 10, input[1].shape[0] = 1)

array([-0.69314718])

@ricardoV94
Copy link
Member

ricardoV94 commented Jan 15, 2021

Also the new logcdf method works fine. Does it mean it might be an issue with the logp method?

Edit: Found it! It's an issue with tt.switch:

print(tt.switch(numpyData, 1, 0).eval())
print(tt.switch(numpyData2, 1, 0).eval())

[0]
[1 0 0 0 0 0 0 0 0 0]

@ricardoV94
Copy link
Member

ricardoV94 commented Jan 15, 2021

This line (and the equivalent one for the logit_p version) is the culprit:

https://github.com/pymc-devs/pymc3/blob/acb8da03005c28f33f55395b8c18ab73aae771a6/pymc3/distributions/discrete.py#L452

should be tt.switch(tt.gt(value, 0), ...)

I wonder if this kind of issue might crop up in other places...

@cluhmann
Copy link
Member Author

Good find! Using integers and booleans is always dicey, but the tt.switch() behavior definitely seems odd.

@ricardoV94
Copy link
Member

@brandonwillard has already fixed the Theano issue! Do we need to do anything here?

An immediate solution that does not depends on synching to the next Theano release is to change those 2 lines in the logp to use tt.neq(value, 0).

Even if we do nothing, should we at least add a maintenance release note?

@AlexAndorra
Copy link
Contributor

@brandonwillard has already fixed the Theano issue!

🎉

If we just switch to the latest Theano-PyMC, it seems to me that we don't have anything to fix here, right?
I'd vote for that insofar as 3.11 will depend on the latest Theano-PyMC anyway -- isn't it @michaelosthege ?

@michaelosthege
Copy link
Member

@AlexAndorra yes, 3.11 will depend on Theano-PyMC 1.1.

If you want to add a line to the release notes about this, please do so in #4405 and link/close this issue in the commit message.

@AlexAndorra
Copy link
Contributor

Can't we just open a simple PR where just release notes are updated and link to the fixing PR on the Theano-PyMC repo?

@michaelosthege
Copy link
Member

Can't we just open a simple PR where just release notes are updated and link to the fixing PR on the Theano-PyMC repo?

Technially the upgrade to Theano-PyMC 1.1.0 is done in #4405: If you want me to, I can add the line into the release notes there. They will have to be updated anyway.

@AlexAndorra
Copy link
Contributor

If you're already set up to do it I'd appreciate that! I can do it too, but setting up the fork to push on other people's forks is always a bit annoying

@michaelosthege
Copy link
Member

closed by #4405

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants