Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 118: First pass at missing data support #107

Merged
merged 9 commits into from
Mar 6, 2024
Merged

Issue 118: First pass at missing data support #107

merged 9 commits into from
Mar 6, 2024

Conversation

seabbs
Copy link
Contributor

@seabbs seabbs commented Mar 1, 2024

This PR adds support for missing data in the observation model by moving from a vectorised to loop approach. It also renames the negative binomial observation model to match the distribution naming scheme.

@seabbs
Copy link
Contributor Author

seabbs commented Mar 5, 2024

@SamuelBrand1 any thoughts here? The issue I had here was in trying to give both a vectorised and non-vectorised likelihood option but perhaps just accepting it can't be vectorised is the way forward for now?

It all seems surprisingly clunky to me as well so perhaps I have just got the wrong end of the stick/.

@SamuelBrand1
Copy link
Collaborator

@SamuelBrand1 any thoughts here? The issue I had here was in trying to give both a vectorised and non-vectorised likelihood option but perhaps just accepting it can't be vectorised is the way forward for now?

It all seems surprisingly clunky to me as well so perhaps I have just got the wrong end of the stick/.

Yeah, the underlying maths here is that vectorised == multivariate distribution and non-vectorised == conditionally independent.

Having a vector with element type Union{Missing, Integer} doesn't make much sense to also use arraydist, whereas we are asking it to be multivariate and conditionally independent at the same time. The reason we want that is because we believe that the performance of the ReverseDiff AD is better that way.

So I'd suggest just trying out a non-vectorised approach and checking if it really does have a big performance hit. The AD systems evolve independent of the PPL and so what might have been true when they wrote their performance tips might not be true now. If it doesn't then a loop is more flexible, missing elements will just work as expected.

@SamuelBrand1
Copy link
Collaborator

SamuelBrand1 commented Mar 6, 2024

Another option is to make the concept of missing part of the generative process; e.g. there is a probability model for observed cases (we have that) and another that filters that and unobserved cases on a day are given value -1 (for example, and maintaining common eltype in what I imagine is an Integer array).

Then the usage would be missing -> -1 in data wrangling.

@seabbs seabbs marked this pull request as ready for review March 6, 2024 14:43
@seabbs
Copy link
Contributor Author

seabbs commented Mar 6, 2024

Another option is to make the concept of missing part of the generative process; e.g. there is a probability model for observed cases (we have that) and another that filters that and unobserved cases on a day are given value -1 (for example, and maintaining common eltype in what I imagine is an Integer array).

Then the usage would be missing -> -1 in data wrangling.

I really don't love this and it feels very clunky but I see it would work and enable vectorisation.

@seabbs seabbs enabled auto-merge March 6, 2024 14:44
@seabbs seabbs linked an issue Mar 6, 2024 that may be closed by this pull request
@seabbs seabbs changed the title First pass at missing data support Issue 118: First pass at missing data support Mar 6, 2024
@seabbs seabbs requested a review from SamuelBrand1 March 6, 2024 17:32
@seabbs
Copy link
Contributor Author

seabbs commented Mar 6, 2024

Note I have had to update the getting started examples to use generated_y_t vs y_t which I think is an internal Turing approach to account for having both observed data (i.e. y_t) and generated data (i.e generated_y_t) which can now happen due to the support for partial missingness?

@SamuelBrand1
Copy link
Collaborator

Do you have any sense that this is noticeably bad for inference time?

@seabbs
Copy link
Contributor Author

seabbs commented Mar 6, 2024

the only real evidence I have for this is running the getting started example. Comparing CI runs there is perhaps a small slowdown (https://github.com/CDCgov/Rt-without-renewal/actions) but it doesn't seem massive (though hard to know how that would scale given so little of that CI check is from inference).

For interest in epinowcast we have this nice benchmark CI that really helps with these kinds of questions: epinowcast/epinowcast#442 (comment)

@SamuelBrand1
Copy link
Collaborator

the only real evidence I have for this is running the getting started example. Comparing CI runs there is perhaps a small slowdown (https://github.com/CDCgov/Rt-without-renewal/actions) but it doesn't seem massive (though hard to know how that would scale given so little of that CI check is from inference).

For interest in epinowcast we have this nice benchmark CI that really helps with these kinds of questions: epinowcast/epinowcast#442 (comment)

Very nice. We can do that with BenchmarkTools I think

@SamuelBrand1
Copy link
Collaborator

Another option is to make the concept of missing part of the generative process; e.g. there is a probability model for observed cases (we have that) and another that filters that and unobserved cases on a day are given value -1 (for example, and maintaining common eltype in what I imagine is an Integer array).

Then the usage would be missing -> -1 in data wrangling.

I really don't love this and it feels very clunky but I see it would work and enable vectorisation.

We could have the not missing (or missing indices) as input-able data?

@SamuelBrand1
Copy link
Collaborator

@seabbs The example script doesn't use a missing / Int mix.

I've tried that out here, but then generated quantities starts to chuck an error. So sadly this doesn't work yet :-(.

@seabbs
Copy link
Contributor Author

seabbs commented Mar 6, 2024

The example script doesn't use a missing / Int mix.

That is why I was using it for benchmarking as it should show the difference in speed for complete data (which is what we care about for the benchmark).

I tested the generated quantities in tests but yes ideally we would have a small example to had partially complete data at some point (an issue and not a priority for now I think).

@seabbs
Copy link
Contributor Author

seabbs commented Mar 6, 2024

e tried that out here, but then generated quantities starts to chuck an error. So sadly this doesn't work yet :-(.

Where did you try this? Shouldn't it be here:

@SamuelBrand1
Copy link
Collaborator

e tried that out here, but then generated quantities starts to chuck an error. So sadly this doesn't work yet :-(.

Where did you try this? Shouldn't it be here:

Ooops I hadn't committed my change

@SamuelBrand1 SamuelBrand1 mentioned this pull request Mar 6, 2024
@seabbs
Copy link
Contributor Author

seabbs commented Mar 6, 2024

I added a small example here as part of the getting started example: e28cc8c

@SamuelBrand1
Copy link
Collaborator

I added a small example here as part of the getting started example: e28cc8c

Nice one. LGTM now.

Copy link
Collaborator

@SamuelBrand1 SamuelBrand1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

@seabbs seabbs merged commit 6fd57d8 into main Mar 6, 2024
10 checks passed
@seabbs seabbs deleted the missing-data branch March 6, 2024 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add missing data support
2 participants