Description
In the terminology of the paper, the steps of PClean's SMC initialization algorithm each correspond to a subproblem. Several subproblems together form an increment of the database. Currently, the number of particles is shrunk to 1 after every increment, before cloning back to the original number of particles for the following increment. This enables all particles to share memory for the "distant past" (on which they are forced to agree), and only store "diffs" to that history recording the values on which they may disagree. Without this, each particle would need to store its own copy of the latent database, which is not practical for very large datasets. However, this aspect of the algorithm should probably be configurable, via a collapse_particles=False
option in InferenceConfig
, which would have each particle store the entire latent database in memory. Future work could explore more memory-efficient representations (I'm sure there are interesting data structures for this sort of thing).