Define availability set and make all instances part of it #128

cmd-ntrf · 2020-12-15T17:14:25Z

I would like to configure Azure to use their infiniband nodes. Someone from Azure has given me a custom CentOS7 image that has the infiniband support baked in. I tested it and things seem to work fine. One piece of advice I was given though was:

 In order for the two VMs to be in the same IB fabric you have to first create an availability set with:

$ az vm availability-set create --name <as_name> --resource-group <rg-name> --location eastus --platform-fault-domain-count 1 --platform-update-domain-count 1

Then, when you create each VM, assign them to the availability set with the following option:

$ az vm create […] --availability-set <as_name>

Also do not forget to put them on the same network and subnet.

It doesn't look like availability_set_id is currently being used/configured. I'm not sure what the impact of this would be, but it seems like a reasonable thing to do by default for the execution nodes.

Originally posted by @ocaisa in #127 (comment)

The text was updated successfully, but these errors were encountered:

ocaisa · 2020-12-16T15:14:49Z

I implemented this for some (successful) infiniband tests on Azure, this required some easy change in azure/infrastructure.tf

+# Create an availability set for the execution nodes
+resource "azurerm_availability_set" "avset" {
+  name                = "${var.cluster_name}_availability_set"
+  location            = var.location
+  resource_group_name = local.resource_group_name
+  platform_update_domain_count = 1
+  platform_fault_domain_count  = 1
+}
+
...
@@ -326,15 +341,19 @@ resource "azurerm_linux_virtual_machine" "node" {
   location              = each.value["location"]
   resource_group_name   = local.resource_group_name
   network_interface_ids = [azurerm_network_interface.nodeNIC[each.key].id]
+  availability_set_id   = azurerm_availability_set.avset.id

cmd-ntrf · 2020-12-16T15:27:23Z

Virtual machine scale sets - In a virtual machine scale set, ensure that you limit the deployment to a single placement group for InfiniBand communication within the scale set.

For example, in a Resource Manager template, set the singlePlacementGroup property to true. Note that the maximum scale set size that can be spun up with singlePlacementGroup property to true is capped at 100 VMs by default. If your HPC job scale needs are higher than 100 VMs in a single tenant, you may request an increase, open an online customer support request at no charge. The limit on the number of VMs in a single scale set can be increased to 300.

Note that when deploying VMs using Availability Sets the maximum limit is at 200 VMs per Availability Set.

Ref: https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-hpc#cluster-configuration-options

cmd-ntrf · 2020-12-16T15:37:58Z

Because of the maximum limit of vms per availability set, we will want to define a count for the availability set ressource which is function of the number of compute nodes. Which brings the question of what should we do with heterogenous cluster: should all compute instances be part of the same availability set ? Or the availability set should be define per instance type?

ocaisa · 2020-12-17T10:13:11Z

I would say that by default they should be separate availability sets per instance type since it 's not that unlikely that you may run into restrictions on the Azure side, but the user should be able to override this (for example by explicitly naming the availability sets, and providing the same name for the different instance types). The use case I imagine is GPU and non-GPU nodes.

cmd-ntrf · 2020-12-17T15:59:02Z

AWS has the same concept named differently - cluster placement group.
Ref: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-instances

It would be worth looking at all cloud providers supported by MC and implement it at once for all clouds.

Make profile::base initialize the sudoer account with ssh keys

cmd-ntrf self-assigned this Dec 15, 2020

cmd-ntrf added azure enhancement New feature or request labels Dec 15, 2020

ocaisa mentioned this issue Jan 13, 2022

Azure: Add IB capabilities #200

Open

cmd-ntrf added a commit that referenced this issue Mar 15, 2022

Merge pull request #128 from ComputeCanada/public_keys

d1a7eab

Make profile::base initialize the sudoer account with ssh keys

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define availability set and make all instances part of it #128

Define availability set and make all instances part of it #128

cmd-ntrf commented Dec 15, 2020 •

edited

Loading

ocaisa commented Dec 16, 2020

cmd-ntrf commented Dec 16, 2020 •

edited

Loading

cmd-ntrf commented Dec 16, 2020

ocaisa commented Dec 17, 2020

cmd-ntrf commented Dec 17, 2020

Define availability set and make all instances part of it #128

Define availability set and make all instances part of it #128

Comments

cmd-ntrf commented Dec 15, 2020 • edited Loading

ocaisa commented Dec 16, 2020

cmd-ntrf commented Dec 16, 2020 • edited Loading

cmd-ntrf commented Dec 16, 2020

ocaisa commented Dec 17, 2020

cmd-ntrf commented Dec 17, 2020

cmd-ntrf commented Dec 15, 2020 •

edited

Loading

cmd-ntrf commented Dec 16, 2020 •

edited

Loading