Skip to content

Azure batch

Stefan Piatek edited this page Aug 10, 2020 · 19 revisions

Azure batch

  • services:
    • azure batch account
    • azure storage account
    • azure container registry: hosting docker images
    • azure service principle: allow tasks to pull from azure container registry
    • ?data factory: could be useful for parameterised running, but expect just upload script with configuration?

Useful tools

notes

Structure of running jobs:

  • Pools

    • Define VM configuration for a job
    • Best practice
      • Pools should have more than one compute node for redundancy on failure
      • Have jobs use pools dynamically, if moving jobs move them to new pool and once complete delete the old pool
      • Resize pools to zero every few months
  • Applications

  • Jobs

    • Set of tasks to be run
    • Best practice
      • 1000 tasks in one job is more efficient than 10 jobs with 100 tasks
      • Job has to be explicitly terminated to be completed, onAllTasksComplete property/maxWallClockTime does this
  • Tasks

    • individual scripts/commands
    • Best practice
      • task nodes are ephemeral so any data will be lost unless uploaded to storage via OutputFiles
      • retention time is a good idea for clarity and cleaning up data
      • Bulk submit collections of up to 100 tasks at a time
      • should build for some retry to withstand failure
  • Images

    • Custom images with OS
    • the storage blob containing the VM?
    • conda from: linux datascience vm
      • windows has python 3.7
      • linux has python 3.5, but could install fstrings

options for running your own packages

All of these are defined at the pool level.

  • Define start task
    • Each compute node runs this command as it joins the pool
    • Seems slow wasteful to run this for each node
  • create an application=package
    • zip file with all dependencies
    • can version these and define which version you want to run
    • Issue with default version of Python on azure batch linuxa
    • Seems like a pain to do and redo when updating requirements or applications
  • Use a custom image
    • limit of 2500 dedicated compute nodes or 1000 low priority nodes in a pool
    • can create a VHD and then import it for batch service mode
    • linux image builder or can use Packer directly to build a linux image for user subscription mode
    • Seems like a reasonable option if the framework is stable
  • Use containers
    • can prefetch container images to save on download
    • They suggest storing and tagging the image on azure container registry
      • Higher cost tier allows for private azure registry
    • Can also pull docker images from other repos
    • Most flexible option without having too much time sent on node setup
  • Can use docker images or any OCI images.
    • Is there a benefit for sinularity here?
  • VM without RDMA
    • Publisher: microsoft-azure-batch
    • Offer: centos-container
    • Offer: ubuntu-server-container
  • need to configure batch pool to run container workloads by ContainterConfiguration settings in the Pool's VirtualMachineConfiguration
  • prefetch containers - Use Azure container registry in teh same region as the pool
image_ref_to_use = batch.models.ImageReference(
    publisher='microsoft-azure-batch',
    offer='ubuntu-server-container',
    sku='16-04-lts',
    version='latest')

"""
Specify container configuration, fetching the official Ubuntu container image from Docker Hub.
"""

container_conf = batch.models.ContainerConfiguration(
    container_image_names=['custom_image'])

new_pool = batch.models.PoolAddParameter(
    id=pool_id,
    virtual_machine_configuration=batch.models.VirtualMachineConfiguration(
        image_reference=image_ref_to_use,
        container_configuration=container_conf,
        node_agent_sku_id='batch.node.ubuntu 16.04'),
    vm_size='STANDARD_D1_V2',
    target_dedicated_nodes=1)
...
  • maybe try batch shipyard exists for deploying HPC workloads,
    • nice monitoring, task factory based on parameter sweeps, random or custom python generators
    • might be a bit more than we need.

Orchestrating via python API

Running python scripts in batch

Running python script in azure

  • using the batch explorer tool, can find the data science desktop

data factories

  • select VM with start task for installing requirements
  • use and input and ouput storage blobs for input and output
  • create an azure data factory pipeline to run the python script on inputs and upload outputs

Running docker container, orchestrated by python API

  • Have deployed simple docker project https://github.com/stefpiatek/azure_batch-with_docker
    • uses azure container registry for hosting docker images
    • uploads multiple scripts and have a node run a script each
    • then post-processing task run on one node (would be aggregation of runs)

Azure pipelines for building and pushing to container registry

Clone this wiki locally