Issue 118: First pass at missing data support #107

seabbs · 2024-03-01T17:32:01Z

This PR adds support for missing data in the observation model by moving from a vectorised to loop approach. It also renames the negative binomial observation model to match the distribution naming scheme.

seabbs · 2024-03-05T10:17:21Z

@SamuelBrand1 any thoughts here? The issue I had here was in trying to give both a vectorised and non-vectorised likelihood option but perhaps just accepting it can't be vectorised is the way forward for now?

It all seems surprisingly clunky to me as well so perhaps I have just got the wrong end of the stick/.

SamuelBrand1 · 2024-03-06T11:30:05Z

@SamuelBrand1 any thoughts here? The issue I had here was in trying to give both a vectorised and non-vectorised likelihood option but perhaps just accepting it can't be vectorised is the way forward for now?

It all seems surprisingly clunky to me as well so perhaps I have just got the wrong end of the stick/.

Yeah, the underlying maths here is that vectorised == multivariate distribution and non-vectorised == conditionally independent.

Having a vector with element type Union{Missing, Integer} doesn't make much sense to also use arraydist, whereas we are asking it to be multivariate and conditionally independent at the same time. The reason we want that is because we believe that the performance of the ReverseDiff AD is better that way.

So I'd suggest just trying out a non-vectorised approach and checking if it really does have a big performance hit. The AD systems evolve independent of the PPL and so what might have been true when they wrote their performance tips might not be true now. If it doesn't then a loop is more flexible, missing elements will just work as expected.

SamuelBrand1 · 2024-03-06T12:20:30Z

Another option is to make the concept of missing part of the generative process; e.g. there is a probability model for observed cases (we have that) and another that filters that and unobserved cases on a day are given value -1 (for example, and maintaining common eltype in what I imagine is an Integer array).

Then the usage would be missing -> -1 in data wrangling.

seabbs · 2024-03-06T14:44:12Z

Another option is to make the concept of missing part of the generative process; e.g. there is a probability model for observed cases (we have that) and another that filters that and unobserved cases on a day are given value -1 (for example, and maintaining common eltype in what I imagine is an Integer array).

Then the usage would be missing -> -1 in data wrangling.

I really don't love this and it feels very clunky but I see it would work and enable vectorisation.

seabbs · 2024-03-06T18:12:13Z

Note I have had to update the getting started examples to use generated_y_t vs y_t which I think is an internal Turing approach to account for having both observed data (i.e. y_t) and generated data (i.e generated_y_t) which can now happen due to the support for partial missingness?

SamuelBrand1 · 2024-03-06T18:27:00Z

Do you have any sense that this is noticeably bad for inference time?

seabbs · 2024-03-06T18:47:19Z

the only real evidence I have for this is running the getting started example. Comparing CI runs there is perhaps a small slowdown (https://github.com/CDCgov/Rt-without-renewal/actions) but it doesn't seem massive (though hard to know how that would scale given so little of that CI check is from inference).

For interest in epinowcast we have this nice benchmark CI that really helps with these kinds of questions: epinowcast/epinowcast#442 (comment)

SamuelBrand1 · 2024-03-06T19:25:30Z

the only real evidence I have for this is running the getting started example. Comparing CI runs there is perhaps a small slowdown (https://github.com/CDCgov/Rt-without-renewal/actions) but it doesn't seem massive (though hard to know how that would scale given so little of that CI check is from inference).

For interest in epinowcast we have this nice benchmark CI that really helps with these kinds of questions: epinowcast/epinowcast#442 (comment)

Very nice. We can do that with BenchmarkTools I think

SamuelBrand1 · 2024-03-06T19:48:22Z

Another option is to make the concept of missing part of the generative process; e.g. there is a probability model for observed cases (we have that) and another that filters that and unobserved cases on a day are given value -1 (for example, and maintaining common eltype in what I imagine is an Integer array).

Then the usage would be missing -> -1 in data wrangling.

I really don't love this and it feels very clunky but I see it would work and enable vectorisation.

We could have the not missing (or missing indices) as input-able data?

SamuelBrand1 · 2024-03-06T21:26:57Z

@seabbs The example script doesn't use a missing / Int mix.

I've tried that out here, but then generated quantities starts to chuck an error. So sadly this doesn't work yet :-(.

seabbs · 2024-03-06T21:55:57Z

The example script doesn't use a missing / Int mix.

That is why I was using it for benchmarking as it should show the difference in speed for complete data (which is what we care about for the benchmark).

I tested the generated quantities in tests but yes ideally we would have a small example to had partially complete data at some point (an issue and not a priority for now I think).

seabbs · 2024-03-06T21:57:38Z

e tried that out here, but then generated quantities starts to chuck an error. So sadly this doesn't work yet :-(.

Where did you try this? Shouldn't it be here:

Rt-without-renewal/EpiAware/docs/src/examples/getting_started.jl

Line 240 in 02fcf74

truth_data = generated_obs

SamuelBrand1 · 2024-03-06T22:00:02Z

e tried that out here, but then generated quantities starts to chuck an error. So sadly this doesn't work yet :-(.

Where did you try this? Shouldn't it be here:

Rt-without-renewal/EpiAware/docs/src/examples/getting_started.jl

Line 240 in 02fcf74

truth_data = generated_obs

Ooops I hadn't committed my change

seabbs · 2024-03-06T22:23:49Z

I added a small example here as part of the getting started example: e28cc8c

SamuelBrand1 · 2024-03-06T22:25:43Z

I added a small example here as part of the getting started example: e28cc8c

Nice one. LGTM now.

SamuelBrand1

Very nice.

seabbs force-pushed the missing-data branch from 3f74d0e to e7f0538 Compare March 6, 2024 11:26

seabbs added 4 commits March 6, 2024 13:50

first pass at missing data support

623109b

simplify obs

ac82737

add a test that checks fitting of y_t with and without missing data

0059927

add a test that checks fitting of y_t with and without missing data

0c03621

seabbs force-pushed the missing-data branch from 4c0fd2d to 0c03621 Compare March 6, 2024 13:50

relax type for generated y_t

f80d941

seabbs marked this pull request as ready for review March 6, 2024 14:43

seabbs enabled auto-merge March 6, 2024 14:44

seabbs added 2 commits March 6, 2024 14:48

check tests and make missing y_t an int

19a2c83

update test for y_t as integer

757e94e

seabbs linked an issue Mar 6, 2024 that may be closed by this pull request

Add missing data support #118

Closed

seabbs changed the title ~~First pass at missing data support~~ Issue 118: First pass at missing data support Mar 6, 2024

seabbs requested a review from SamuelBrand1 March 6, 2024 17:32

fix getting started example to use generated_y_t

02fcf74

SamuelBrand1 mentioned this pull request Mar 6, 2024

Observed data API #119

Open

add and test partial missing data in the getting started example

e28cc8c

SamuelBrand1 approved these changes Mar 6, 2024

View reviewed changes

seabbs merged commit 6fd57d8 into main Mar 6, 2024
10 checks passed

seabbs deleted the missing-data branch March 6, 2024 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 118: First pass at missing data support #107

Issue 118: First pass at missing data support #107

seabbs commented Mar 1, 2024 •

edited

Loading

seabbs commented Mar 5, 2024 •

edited

Loading

SamuelBrand1 commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024 •

edited

Loading

seabbs commented Mar 6, 2024

seabbs commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

seabbs commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

seabbs commented Mar 6, 2024

seabbs commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

seabbs commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

SamuelBrand1 left a comment

Issue 118: First pass at missing data support #107

Issue 118: First pass at missing data support #107

Conversation

seabbs commented Mar 1, 2024 • edited Loading

seabbs commented Mar 5, 2024 • edited Loading

SamuelBrand1 commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024 • edited Loading

seabbs commented Mar 6, 2024

seabbs commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

seabbs commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

seabbs commented Mar 6, 2024

seabbs commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

seabbs commented Mar 6, 2024

SamuelBrand1 commented Mar 6, 2024

SamuelBrand1 left a comment

Choose a reason for hiding this comment

seabbs commented Mar 1, 2024 •

edited

Loading

seabbs commented Mar 5, 2024 •

edited

Loading

SamuelBrand1 commented Mar 6, 2024 •

edited

Loading