-
Notifications
You must be signed in to change notification settings - Fork 7
Data Input Files
The following image is obviously a whiteboard photo. It was from a discussion about some experiences, observations, and proposals with respect to using dataframes representing the full initial agent state as LASER model inputs.
The top section attempts to capture Jonathan's approach in which the LASER model ingests a pre-created dataframe of agents: N agents, each of which have some small number of properties, each of which have a certain value at each timestep, including at t=0. In fact, both models do this. In Jonathan's model, the task of turning parameters into a model population state dataframe is treated as a separate, explicit pre-processing step. In Christopher's version, it's more implicit in the case of a Synthetic Population. In Jonathan's model, which uses a EULA-ified cohort, the input model is split into the modeled agents and the "massively compressed EULA agents". So two files really.
Note that there are at least 2 and probably 3 types of input populations we want to support:
- Synthetic Populations
Synth-Pops are, as they sound, entirely made up populations with varying levels of fidelity to real world detail. In the most extreme, it's a population of N agents in a single node, with no defined age or sex or any other realistic demographics properties. Or the agents can be spread mathematically across M nodes. It is desirable to be able to do this by just setting a few parameters. And if the population does need some formulaic (non-data-driven) age structure, and expected lifespan, it can be nice if that can be done with parameters or code.
- Real World Populations (esp England & Wales, and Northern Nigeria right now)
These are populations of agents matching to some input datasets representing a real place in space and time. In this case we almost certainly do want a full dataframe of agents with properties as our model input.
- Serialized Model State Populations or Serialized Population Files (SPFs)
It can often be desirable to save the population state in the "middle" of a simulation, and use that output file as an input file to a subsequent simulation. If that literal output file can be used as a literal input file, that is a very robust design capability.
This is also represented to some degree in this diagram:
One of the points I'd like to emphasize is that "modeled population as dataframe-on-disk" is explicit in 2 and 3 and implicit in 1. Except that in 1 it doesn't have to be on-disk, it can be in memory. So the "Synthetic Population with a few parameters and no files" is really a special case. We do want to support it and make it easy but from a software point of view, the less bespoke it can be the better. My proposal here is that the model itself always ingest population dataframes, and we just have 2 or 3 input pipelines (pre-processing) ways of doing that:
- Load literal population from file-on-disk;
- Create in-memory dataframe from parameters and code.