Skip to content

Azure batch

Stefan Piatek edited this page Aug 12, 2020 · 19 revisions

Azure batch

  • services:
    • azure batch account
    • azure storage account
    • azure container registry: hosting docker images
    • azure service principle: allow tasks to pull from azure container registry
    • ?data factory: could be useful for parameterised running, but expect just upload script with configuration?

Useful tools

  • BatchExplorer allows better interaction with pools, jobs, nodes and data storage

notes

Structure of running jobs:

  • Pools

    • Define VM configuration for a job
    • Best practice
      • Pools should have more than one compute node for redundancy on failure
      • Have jobs use pools dynamically, if moving jobs move them to new pool and once complete delete the old pool
      • Resize pools to zero every few months
  • Applications

  • Jobs

    • Set of tasks to be run
    • Best practice
      • 1000 tasks in one job is more efficient than 10 jobs with 100 tasks
      • Job has to be explicitly terminated to be completed, onAllTasksComplete property/maxWallClockTime does this
  • Tasks

    • individual scripts/commands
    • Best practice
      • task nodes are ephemeral so any data will be lost unless uploaded to storage via OutputFiles
      • retention time is a good idea for clarity and cleaning up data
      • Bulk submit collections of up to 100 tasks at a time
      • should build for some retry to withstand failure
  • Images

    • Custom images with OS
    • the storage blob containing the VM?
    • conda from: linux datascience vm
      • windows has python 3.7
      • linux has python 3.5, but could install fstrings

options for running your own packages

All of these are defined at the pool level.

  • Define start task
    • Each compute node runs this command as it joins the pool
    • Seems slow wasteful to run this for each node
  • create an application=package
    • zip file with all dependencies
    • can version these and define which version you want to run
    • Issue with default version of Python on azure batch linuxa
    • Seems like a pain to do and redo when updating requirements or applications
  • Use a custom image
    • limit of 2500 dedicated compute nodes or 1000 low priority nodes in a pool
    • can create a VHD and then import it for batch service mode
    • linux image builder or can use Packer directly to build a linux image for user subscription mode
    • Seems like a reasonable option if the framework is stable
  • Use containers
    • can prefetch container images to save on download
    • They suggest storing and tagging the image on azure container registry
      • Higher cost tier allows for private azure registry
    • Can also pull docker images from other repos
    • Most flexible option without having too much time sent on node setup
  • Can use docker images or any OCI images.
    • Is there a benefit for sinularity here?
  • VM without RDMA
    • Publisher: microsoft-azure-batch
    • Offer: centos-container
    • Offer: ubuntu-server-container
  • need to configure batch pool to run container workloads by ContainterConfiguration settings in the Pool's VirtualMachineConfiguration
  • prefetch containers - Use Azure container registry in teh same region as the pool
image_ref_to_use = batch.models.ImageReference(
    publisher='microsoft-azure-batch',
    offer='ubuntu-server-container',
    sku='16-04-lts',
    version='latest')

"""
Specify container configuration, fetching the official Ubuntu container image from Docker Hub.
"""

container_conf = batch.models.ContainerConfiguration(
    container_image_names=['custom_image'])

new_pool = batch.models.PoolAddParameter(
    id=pool_id,
    virtual_machine_configuration=batch.models.VirtualMachineConfiguration(
        image_reference=image_ref_to_use,
        container_configuration=container_conf,
        node_agent_sku_id='batch.node.ubuntu 16.04'),
    vm_size='STANDARD_D1_V2',
    target_dedicated_nodes=1)
...
  • maybe try batch shipyard exists for deploying HPC workloads,
    • nice monitoring, task factory based on parameter sweeps, random or custom python generators
    • might be a bit more than we need.

Orchestrating via python API

Running python scripts in batch

Running python script in azure

  • using the batch explorer tool, can find the data science desktop

data factories

  • select VM with start task for installing requirements
  • use and input and ouput storage blobs for input and output
  • create an azure data factory pipeline to run the python script on inputs and upload outputs

Running docker container, orchestrated by python API

  • Have deployed simple docker project https://github.com/stefpiatek/azure_batch-with_docker
    • uses azure container registry for hosting docker images
    • uploads multiple scripts and have a node run a script each
    • then post-processing task run on one node (would be aggregation of runs)

Azure pipelines for building and pushing to container registry

Suggested usage via tlo CLI

Breakdown of simulations:

  • A scenario is essentially an analysis script which defines which modules are included and their settings.
    • Metadata read into the script:
      • Includes which parameters will be overriden by a random draw from a distrubition.
      • Uses seed for simulation
  • Draw: All random draws for parameters that are overridden
    • 100s of draws for parameters
    • draw index: enumeration of the draw set for use in the script
  • Sample:
    • 1000 of seeds per draw
    • sample index: enumeration

Metadata files follow pattern {scenario}_draw-{draw_index}_sample-{sample_index}.json

Example scenario script:

  • Could make it into a class and have a separate method or attribute for config (e.g. log config, start date, end date, pop size, registering modules)
    • that way we control the order from the run method but maybe that's overkill
# -------------------------------------------------------------
# Name: contraception_example
# Created: 2020-08-11 11:45
# -------------------------------------------------------------
import numpy as np

from tlo import Date, Simulation, logging
from tlo.analysis.utils import parse_log_file
from tlo.methods import (
    contraception,
    demography,
    enhanced_lifestyle,
    healthseekingbehaviour,
    healthsystem,
    labour,
    pregnancy_supervisor,
    symptommanager,
)


# -------------------------------------------------------------
# Configure run
# -------------------------------------------------------------
def run(sim_seed, parameters):
    # By default, all output is recorded at the "INFO" level (and up) to standard out. You can
    # configure the behaviour by passing options to the `log_config` argument of
    # Simulation.
    log_config = {
        "filename": "contraception_example",  # The prefix for the output file. A timestamp will be added to this.
        "custom_levels": {  # Customise the output of specific loggers. They are applied in order:
            "tlo.methods.demography": logging.INFO,
            "tlo.methods.enhanced_lifestyle": logging.INFO
        }
    }
    # For default configuration, uncomment the next line
    # log_config = dict()

    # Basic arguments required for the simulation
    start_date = Date(2010, 1, 1)
    end_date = Date(2050, 1, 1)
    pop_size = 1000

    # This creates the Simulation instance for this run. Because we"ve passed the `seed` and
    # `log_config` arguments, these will override the default behaviour.
    sim = Simulation(start_date=start_date, seed=sim_seed, log_config=log_config)

    # Path to the resource files used by the disease and intervention methods
    resources = "./resources"

    # Used to configure health system behaviour
    service_availability = ["*"]

    # We register all modules in a single call to the register method, calling once with multiple
    # objects. This is preferred to registering each module in multiple calls because we will be
    # able to handle dependencies if modules are registered together
    sim.register(
        demography.Demography(resourcefilepath=resources),
        enhanced_lifestyle.Lifestyle(resourcefilepath=resources),
        healthsystem.HealthSystem(resourcefilepath=resources, disable=True, service_availability=service_availability),
        symptommanager.SymptomManager(resourcefilepath=resources),
        healthseekingbehaviour.HealthSeekingBehaviour(resourcefilepath=resources),
        contraception.Contraception(resourcefilepath=resources),
        labour.Labour(resourcefilepath=resources),
        pregnancy_supervisor.PregnancySupervisor(resourcefilepath=resources),
    )

    sim.override_parameters(parameters)

    sim.make_initial_population(n=pop_size)
    sim.simulate(end_date=end_date)
    return sim


# -------------------------------------------------------------
# Define how to draw override parameters
# -------------------------------------------------------------

def draw_parameters(rng: np.random.RandomState):
    return {
        demography.Demography: {
            'fraction_of_births_male': rng.randint(500, 520) / 1000
        },
        contraception.Contraception: {
            'r_init_year': 0.125,
            'r_discont_year': rng.exponential(0.1),
        },
    }


# -------------------------------------------------------------
# Interactive running, using a single draw and seed
# -------------------------------------------------------------
if __name__ == '__main__':
    # setup seed and override_parameters
    seed = 1
    rng = np.random.RandomState(seed)
    override_parameters = draw_parameters(rng)
    # run the simulation
    sim = run(seed, override_parameters)

    # read the results
    output = parse_log_file(sim.log_filepath)

Interface for interacting with scenario scripts

We create sample metadata

tlo create-samples contraception_example.py --seed 70 --draws 100 --samples-per-draw 1000

example sample metadata file: contraception_example_draw-5_sample-2.json

{"create_sample_seed": 70
 "sim_seed": 9313386,
 "path": "src/scripts/contraception_example.py",
 "draw_index": 5,
 "sample_index": 2,
 "override_parameters": {"tlo.methods.demography.Demography": {"fraction_of_births_male": 0.505},
                         "tlo.methods.contraception.Contraception": {"r_init_year": 0.125,
                                                                     "r_discont_year": 0.5872725857609721}
                         }
 }

You can then run a specific draw and sample, reading in the metadata jso and then running the simulation with this data.

tlo run-sample contraception_example.py contraception_example_5_2.json

Running on Azure

After you are happy with a local run, you commit the scenario python file and push this to your branch.

You can log in to a dedicated node in azure, pull your branch, generate samples and run one to make sure that this is working correctly.

When you are ready to run an entire scenario use the tlo CLI:

tlo run-azure-scenario contraception_example.py --seed 70 --draws 70 --samples-per-draw 1000 --branch stef/testing --commit-id 8b71b5be5f293387eb270cffe5a0925b0d97830f (if no branch, master is used, if no commit id, latest commit is used)

This uses the configuration data in your repository to:

  • create a job for the scenario
  • on startup
    • checkout the correct branch and commit
    • run tlo create-samples with the seed, draws and samples-per-draw
  • each node is assigned a task, or series of tasks (if we want a maximum number of nodes) to run an individual sample by the path to the json file
  • After all tasks are complete, postprocessing task pools/zips the sample json files and output dataframes
Clone this wiki locally