-
Notifications
You must be signed in to change notification settings - Fork 15
Azure batch
Stefan Piatek edited this page Aug 10, 2020
·
19 revisions
- services:
- azure batch account
- azure storage account
- azure container registry: hosting docker images
- azure service principle: allow tasks to pull from azure container registry
- ?data factory: could be useful for parameterised running, but expect just upload script with configuration?
Structure of running jobs:
-
Pools
- Define VM configuration for a job
- Best practice
- Pools should have more than one compute node for redundancy on failure
- Have jobs use pools dynamically, if moving jobs move them to new pool and once complete delete the old pool
- Resize pools to zero every few months
-
Applications
-
Jobs
- Set of tasks to be run
- Best practice
- 1000 tasks in one job is more efficient than 10 jobs with 100 tasks
- Job has to be explicitly terminated to be completed, onAllTasksComplete property/maxWallClockTime does this
-
Tasks
- individual scripts/commands
- Best practice
- task nodes are ephemeral so any data will be lost unless uploaded to storage via OutputFiles
- retention time is a good idea for clarity and cleaning up data
- Bulk submit collections of up to 100 tasks at a time
- should build for some retry to withstand failure
-
Images
- Custom images with OS
- the storage blob containing the VM?
- conda from: linux datascience vm
- windows has python 3.7
- linux has python 3.5, but could install fstrings
All of these are defined at the pool level.
- Define start task
- Each compute node runs this command as it joins the pool
- Seems slow wasteful to run this for each node
-
create an application=package
- zip file with all dependencies
- can version these and define which version you want to run
- Issue with default version of Python on azure batch linuxa
- Seems like a pain to do and redo when updating requirements or applications
-
Use a custom image
- limit of 2500 dedicated compute nodes or 1000 low priority nodes in a pool
- can create a VHD and then import it for batch service mode
- linux image builder or can use Packer directly to build a linux image for user subscription mode
- Seems like a reasonable option if the framework is stable
-
Use containers
- can prefetch container images to save on download
- They suggest storing and tagging the image on azure container registry
- Higher cost tier allows for private azure registry
- Can also pull docker images from other repos
- Most flexible option without having too much time sent on node setup
- Can use docker images or any OCI images.
- Is there a benefit for sinularity here?
- VM without RDMA
- Publisher: microsoft-azure-batch
- Offer: centos-container
- Offer: ubuntu-server-container
- need to configure batch pool to run container workloads by ContainterConfiguration settings in the Pool's VirtualMachineConfiguration
- prefetch containers - Use Azure container registry in teh same region as the pool
image_ref_to_use = batch.models.ImageReference(
publisher='microsoft-azure-batch',
offer='ubuntu-server-container',
sku='16-04-lts',
version='latest')
"""
Specify container configuration, fetching the official Ubuntu container image from Docker Hub.
"""
container_conf = batch.models.ContainerConfiguration(
container_image_names=['custom_image'])
new_pool = batch.models.PoolAddParameter(
id=pool_id,
virtual_machine_configuration=batch.models.VirtualMachineConfiguration(
image_reference=image_ref_to_use,
container_configuration=container_conf,
node_agent_sku_id='batch.node.ubuntu 16.04'),
vm_size='STANDARD_D1_V2',
target_dedicated_nodes=1)
...
- maybe try batch shipyard exists for deploying HPC workloads,
- nice monitoring, task factory based on parameter sweeps, random or custom python generators
- might be a bit more than we need.
-
python batch examples
- ran the first few examples, straightforward
Running python script in azure
- using the batch explorer tool, can find the data science desktop
- select VM with start task for installing requirements
- use and input and ouput storage blobs for input and output
- create an azure data factory pipeline to run the python script on inputs and upload outputs
- Have deployed simple docker project https://github.com/stefpiatek/azure_batch-with_docker
- uses azure container registry for hosting docker images
- uploads multiple scripts and have a node run a script each
- then post-processing task run on one node (would be aggregation of runs)
- azure pipelines guide
- Need to have an azure DevOps organisation, need to be an admin of the Azure DevOps project
- Create pipeline using azure-piplines.yml (dev ops generates one for you)
- Can automatically generate your tag with the commit id
- Or only build when you've explicitly tagged in git
TLO Model Wiki