Skip to content

Structured logging

Stefan Piatek edited this page Jul 1, 2020 · 6 revisions

Structured logging

to make use of structured logging, all logging though be done through tlo. (from tlo import logging)

Setting up an analysis script

The recommended way to configure a simulation is to follow the run() template.

Creating the simulation object with the seed and the logging configuration ensures that users don't have to worry about the order of configuration.

def run():
    # To reproduce the results, you need to set the seed for the Simulation instance. The Simulation
    # will seed the random number generators for each module when they are registered.
    # If a seed argument is not given, one is generated. It is output in the log and can be
    # used to reproduce results of a run
    seed = 567

    # By default, all output is recorded at the "INFO" level (and up) to standard out. You can
    # configure the behaviour by passing options to the `log_config` argument of
    # Simulation.
    log_config = {
        "filename": "labour_analysis",   # The prefix for the output file. A timestamp will be added to this.
        "directory": "./outputs",  # The default output path is `./outputs`. Change it here, if necessary
        "custom_levels": {  # Customise the output of specific loggers. They are applied in order:
            "*": logging.WARNING,  # Asterisk matches all loggers - we set the default level to WARNING
            "tlo.methods.labour": logging.INFO,
            "tlo.methods.newborn_outcomes": logging.DEBUG
        }
    }
    # For default configuration, not saving logs to a file
    # log_config = dict()

    # Basic arguments required for the simulation
    start_date = Date(2010, 1, 1)
    end_date = Date(2020, 1, 1)
    pop_size = 1000

    # This creates the Simulation instance for this run. Because we"ve passed the `seed` and
    # `log_config` arguments, these will override the default behaviour.
    sim = Simulation(start_date=start_date, seed=seed, log_config=log_config)

    # Path to the resource files used by the disease and intervention methods
    resources = "./resources"

    # Used to configure health system behaviour
    service_availability = ["*"]

    # We register all modules in a single call to the register method, calling once with multiple
    # objects. This is preferred to registering each module in multiple calls because we will be
    # able to handle dependencies if modules are registered together
    sim.register(
        demography.Demography(resourcefilepath=resources),
        enhanced_lifestyle.Lifestyle(resourcefilepath=resources),
        healthsystem.HealthSystem(resourcefilepath=resources, service_availability=service_availability),
        symptommanager.SymptomManager(resourcefilepath=resources),
        healthseekingbehaviour.HealthSeekingBehaviour(resourcefilepath=resources),
        contraception.Contraception(resourcefilepath=resources),
        labour.Labour(resourcefilepath=resources),
        pregnancy_supervisor.PregnancySupervisor(resourcefilepath=resources),
    )
    sim.make_initial_population(n=pop_size)
    sim.simulate(end_date=end_date)
    return sim


sim = run()
log_df = parse_log_file(sim.log_filepath)

Data types that can be logged

Recommended types

Dictionaries

These are the preferred datatype to be logged. Scalar data types will be preserved, supporting:

  • Numeric types
  • String types
  • Categories (will be converted to strings)
  • Missing values (NaN, NaT, None)
death_summary = {
    'died_count': died,
    'died_ratio': died / len(df),
    'most_common_symptom_on_death': rng.choice(
        ['fever', 'vomiting', 'stomachache', 'sore_throat', 'respiratory_symptoms', 'headache']
    )
}

logger.info(key='death',
            data=death_summary,
            description='Summary of deaths this month')
date died_count died_ratio most_common_symptom_on_death
0 2010-01-01 00:00:00 52 0.0052 respiratory_symptoms
1 2010-02-01 00:00:00 54 0.0054 respiratory_symptoms
2 2010-03-01 00:00:00 77 0.0077 stomachache
3 2010-04-01 00:00:00 41 0.0041 vomiting
4 2010-05-01 00:00:00 90 0.009 sore_throat

Dataframe with a single row

A dataframe with only one row can be logged without any changes required.

# get single row of dataframe (but not a series)
individual = df.head(1)

logger.debug(key='individual_check',
             data=individual,
             description='following an individual through simulation')
date is_alive date_of_birth date_of_death sex mother_id age_exact_years
0 2010-01-01 00:00:00 True 2006-08-09 00:00:00 F -1 3.39774
1 2010-02-01 00:00:00 True 2006-08-09 00:00:00 F -1 3.47988
2 2010-03-01 00:00:00 True 2006-08-09 00:00:00 F -1 3.55654
3 2010-04-01 00:00:00 True 2006-08-09 00:00:00 F -1 3.64142
4 2010-05-01 00:00:00 True 2006-08-09 00:00:00 F -1 3.72355

Fixed length lists

Lists that will always have the same number of items can be logged. To make them fit into the dataframe structure, each position in a list becomes a column.

top_3_symptoms: list = rng.choice(
    ['fever', 'vomiting', 'stomachache', 'sore_throat', 'respiratory_symptoms', 'headache'],
    3).tolist()

logger.info(key='top_symptoms',
            data=top_3_symptoms,
            description='Top 3 most common symptoms in the population')
date item_1 item_2 item_3
0 2010-01-01 00:00:00 sore_throat sore_throat fever
1 2010-02-01 00:00:00 fever stomachache headache
2 2010-03-01 00:00:00 stomachache respiratory_symptoms stomachache
3 2010-04-01 00:00:00 respiratory_symptoms stomachache fever
4 2010-05-01 00:00:00 fever stomachache stomachache

Series

A pandas series can be logged, it is handled just like a list

first_age_years = df.age_years.head(6) # series of 6 individual's age years
logger.info(key='age_years',
            data=first_age_years,
            description='First 6 individuals age in years')
date item_1 item_2 item_3 item_4 item_5 item_6
0 2010-01-01 00:00:00 3 53 11 19 31 4
1 2010-02-01 00:00:00 3 53 11 19 31 5
2 2010-03-01 00:00:00 3 53 11 19 32 5
3 2010-04-01 00:00:00 3 53 11 19 32 5
4 2010-05-01 00:00:00 3 53 11 19 32 5

Other (less recommended) types

You can log other types of data, but you might need to do work after parsing the logfile to convert the datatypes back to the right type.

Variable length list

If you're list varies in length, you log it by wrapping it in a dictionary

number_to_pick = rng.randint(1, 4)
symptoms_at_time = rng.choice(
    ['fever', 'vomiting', 'stomachache', 'sore_throat', 'respiratory_symptoms', 'headache'],
    number_to_pick).tolist()

logger.info(key='variable_selection',
            data={'list_head': symptoms_at_time},
            description='Symptoms an individual experiences at a given time')
date list_head
0 2010-01-01 00:00:00 ['respiratory_symptoms']
1 2010-02-01 00:00:00 ['respiratory_symptoms', 'stomachache', 'respiratory_symptoms']
2 2010-03-01 00:00:00 ['headache']
3 2010-04-01 00:00:00 ['stomachache']
4 2010-05-01 00:00:00 ['fever', 'sore_throat', 'fever']

Dataframe with more than one row

Logging an entire dataframe is possible, this converts it to a multi-index dataframe. The first index in the pair defines the timepoint that the dataframe was logged, the second index is from the dataframe being logged.

date is_alive date_of_birth date_of_death sex mother_id
(0, '0') 2010-01-01 00:00:00 True 2006-08-09T00:00:00 F -1
(0, '1') 2010-01-01 00:00:00 True 1956-10-13T00:00:00 M -1
(0, '2') 2010-01-01 00:00:00 True 1998-12-10T00:00:00 F -1
(1, '0') 2010-02-01 00:00:00 True 2006-08-09T00:00:00 F -1
(1, '1') 2010-02-01 00:00:00 True 1956-10-13T00:00:00 M -1
(1, '2') 2010-02-01 00:00:00 True 1998-12-10T00:00:00 F -1
(2, '0') 2010-03-01 00:00:00 True 2006-08-09T00:00:00 F -1
(2, '1') 2010-03-01 00:00:00 True 1956-10-13T00:00:00 M -1
(2, '2') 2010-03-01 00:00:00 True 1998-12-10T00:00:00 F -1
Clone this wiki locally