Skip to content

Latest commit

 

History

History
161 lines (141 loc) · 15.3 KB

README.md

File metadata and controls

161 lines (141 loc) · 15.3 KB

Deploy HPCC Systems on Azure under Kubernetes

NOTE: Documentation of this Terraform for the developer, or others who are interested, can be found here.

This is a slightly-opinionated Terraform module for deploying an HPCC Systems cluster on Azure's Kubernetes service (aks). The goal is to provide a simple method for deploying a cluster from scratch, with only the most important options to consider.

The HPCC Systems cluster created by this module uses ephemeral storage, which is the default. This means the storage will be deleted when the cluster is deleted) But, you can also have Persistent Storage. See the section titled Persistent Storage, below.

Requirements

  • terraform This is a Terraform module, so you need to have terraform installed on your system. Instructions for downloading and installing terraform can be found at https://www.terraform.io/downloads.html. Do make sure you install a 64-bit version of terraform, as that is needed to accommodate some of the large random numbers used for IDs in the Terraform modules.

  • helm Helm is used to deploy the HPCC Systems processes under Kubernetes. Instructions for downloading and installing Helm are at https://helm.sh/docs/intro/install.

  • kubectl The Kubernetes client (kubectl) is also required so you can inspect and manage the Azure Kubernetes cluster. Instructions for download and installing that can be found at https://kubernetes.io/releases/download/. Make sure you have version 1.22.0 or later.

  • Azure CLI To work with Azure, you will need to install the Azure Command Line tools. Instructions can be found at https://docs.microsoft.com/en-us/cli/azure/install-azure-cli. Even if you think you won't be working with Azure, this module does leverage the command line tools to manipulate network security groups within Kubernetes clusters. TL;DR: Make sure you have the command line tools installed.

  • To successfully create everything you will need to have Azure's Contributor role plus access to Microsoft.Authorization/*/Write and Microsoft.Authorization/*/Delete permissions on your subscription. You may have to create a custom role for this. Of course, Azure's Owner role includes everything so if you're the subscription's owner then you're good to go.

  • You need a minimum of 28 vCPUs available on azure and aks_serv_node_size must be at least xlarge. The following az command will tell you the maximum number of vCPUs you can use. And, the 2nd az command, below, gives you the number of vCPUs you have already used in region eastus (replace eastus with the name of the region you are using). Furthermore, you can get the number of vCPUs available for you to use by subtracting the result of the 2nd az command, below, from the result of the first az command.

    • az vm list-usage --location "eastus" -o table|grep "Total Regional vCPUs"|sed "s/ */\t/g"|cut -f5
    • az vm list-usage --location "eastus" -o table|grep "Total Regional vCPUs"|sed "s/ */\t/g"|cut -f4
  • You need to make sure jq and kubelogin are installed on your linux machine. You can determine if they are by using the which command, e.g. which jq returns jqs path if it is installed. The following commands can be used to install 'jq and kubelogin, respectively:

    • sudo apt-get install jq
    • sudo az aks install-cli
  • If you run the terraform code on an azure VM, then the azure VM must have EncryptionAtHost enabled. You can do this by: 1) Stopping your azure VM; 2) click on Disk in the Overview of the azure VM; 3) click on the tab, Additional Settings; 4) selecting yes radio button under Encryption at host.

Installing/Using This Module

  1. If necessary, login to Azure.
    • From the command line, this is usually accomplished with the az login command.
  2. Clone this repo to your local system and change current directory.
    • git clone https://github.com/hpccsystems-solutions-lab/terraform-azurerm-hpcc-lite.git
    • cd terraform-azurerm-hpcc-lite
  3. Issue terraform init to initialize the Terraform modules.
  4. Issue terraform apply This command will do a terraform init, terraform plan and terraform apply for each of the subsystems needed, i.e. vnet, aks, storage, and hpcc (the storage subsystem is deployed only if you set external_storage_desired=true). The order that these subsystems are deploy is: vnet, aks, storage, and hpcc. For each subsystem, terraform creates a plan file which is stored in the directory: ~/tflogs (note: if this directory doesn't exist, it is created automatically).
  5. Decide how you want to supply option values to the module during invocation. There are three possibilities:
    1. Invoke the terraform apply command and enter values for each option as terraform prompts for it, then enter yes at the final prompt to begin building the cluster.
    2. Recommended: Create a lite.auto.tfvars file containing the values for each option, invoke terraform apply, then enter yes at the final prompt to begin building the cluster. The easiest way to create lite.auto.tfvars is to copy the example file, lite.auto.tfvars.example, and then edit the copy:
      • cp -v lite.auto.tfvars.example lite.auto.tfvars
    3. Use -var arguments on the command line when executing the terraform tool to set each of the values found in the .tfvars file. This method is useful if you are driving the creation of the cluster from a script.
  6. After the Kubernetes cluster is deployed, your local kubectl tool can be used to interact with it. At some point during the deployment kubectl will acquire the login credentials for the cluster and it will be the current context (so any kubectl commands you enter will be directed to that cluster by default).

At the end of a successful deployment these items are output for aks, hpcc, and vnet:

  • aks
    • Advisor recommendations or 'none', advisor_recommendations.
    • Location of the aks credentials, aks_login.
    • Name of the Azure Kubernetes Service, cluster_name.
    • Resource group where the cluster is deployed, cluster_resource_group_name.
  • hpcc
    • The URL used to access ECL Watch, eclwatch_url.
    • The deployment azure resource group, deployment_resource_group.
    • Whether there is external storage or not, external_storage_config_exists.
  • vnet
    • Advisor recommendations or 'none', advisor_recommendations.
    • ID of private subnet, private_subnet_id.
    • ID of public subnet, public_subnet_id.
    • ID of route table, route_table_id.
    • Route table name, route_table_name.
    • Virtual network resource group name, resource_group_name.
    • Virtual network name, vnet_name.

Available Options

Options have data types. The ones used in this module are:

  • string
    • Typical string enclosed by quotes
    • Example
      • "value"
  • number
    • Integer number; do not quote
    • Example
      • 1234
  • boolean
    • true or false (not quoted)
  • map of string
    • List of key/value pairs, delimited by commas
    • Both key and value should be a quoted string
    • Entire map is enclosed by braces
    • Example with two key/value pairs
      • {"key1" = "value1", "key2" = "value2"}
    • Empty value is {}
  • list of string
    • List of values, delimited by commas
    • A value is a quoted string
    • Entire list is enclosed in brackets
    • Example with two values
      • ["value1", "value2"]
    • Empty value is []

The following options should be set in your lite.auto.tfvars file (or entered interactively, if you choose to not create a file). Only a few of them have default values. The rest are required. The 'Updateable' column indicates whether, for any given option, it is possible to successfully apply the update against an already-running HPCC k8s cluster.

Option Type Description Updatable
a_record_name string Name of the A record, of following dns zone, where the ecl watch ip is placed This A record will be created and therefore should not exist in the following dns zone. Example entry: "my-product". This should be something project specific rather than something generic. Y
admin_username string Username of the administrator of this HPCC Systems cluster. Example entry: "jdoe" N
aks_admin_email string Email address of the administrator of this HPCC Systems cluster. Example entry: "[email protected]" Y
aks_admin_ip_cidr_map map of string Map of name => CIDR IP addresses that can administrate this AKS. Format is '{"name"="cidr" [, "name"="cidr"]*}'. The 'name' portion must be unique. To add no CIDR addresses, use '{}'. The corporate network and your current IP address will be added automatically, and these addresses will have access to the HPCC cluster as a user. Y
aks_admin_name string Name of the administrator of this HPCC Systems cluster. Example entry: "Jane Doe" Y
aks_azure_region string The Azure region abbreviation in which to create these resources. Example entry: "eastus" N
aks_dns_zone_name string Name of an existing dns zone. Example entry: "hpcczone.us-hpccsystems-dev.azure.lnrsg.io" N
aks_dns_zone_resource_group_name string Name of the resource group of the above dns zone. Example entry: "app-dns-prod-eastus2" N
aks_enable_roxie boolean Enable ROXIE? This will also expose port 8002 on the cluster. Example entry: false Y
aks_logging_monitoring_enabled boolean This variable enable you to ask for logging and monitoring of the Kubernetes and hpcc cluster (true means enable logging and monitoring, false means don't. N
aks_4nodepools boolean Determines whether 1 or 4 nodepools are use -- 4 used if true otherwise 2 used. (default is false). N
aks_nodepools_max_capacity string The maximum number of nodes of every hpcc nodepool. N
aks_roxie_node_size string The VM size for each roxie node in the HPCC Systems. Example format aks_roxie_node-size="xlarge". N
aks_serv_node_size string The VM size for each serv node in the HPCC Systems. Example format aks_serv_node-size="2xlarge". N
aks_spray_node_size string The VM size for each spray node in the HPCC Systems. Example format aks_spray_node-size="2xlarge". N
aks_thor_node_size string The VM size for each thor node in the HPCC Systems. Example format aks_thor_node-size="2xlarge". N
aks_capacity map of number The min and max number of nodes of each node pool in the HPCC Systems. Example format is '{ roxie_min = 1, roxie_max = 3, serv_min = 1, serv_max = 3, spray_min = 1, spray_max = 3, thor_min = 1, thor_max = 3}'. N
authn_htpasswd_filename string If you would like to use htpasswd to authenticate users to the cluster, enter the filename of the htpasswd file. This file should be uploaded to the Azure 'dllsshare' file share in order for the HPCC processes to find it. A corollary is that persistent storage is enabled. An empty string indicates that htpasswd is not to be used for authentication. Example entry: "htpasswd.txt" Y
enable_code_security boolean Enable code security? If true, only signed ECL code will be allowed to create embedded language functions, use PIPE(), etc. Example entry: false Y
enable_thor boolean If you want a thor cluster then 'enable_thor' must be set to true Otherwise it is set to false Y
external_storage_desired boolean If you want external storage instead of ephemeral storage then set this variable to true otherwise set it to false. Y
extra_tags map of string Map of name => value tags that can will be associated with the cluster. Format is '{"name"="value" [, "name"="value"]*}'. The 'name' portion must be unique. To add no tags, use '{}'. Y
hpcc_user_ip_cidr_list list of string List of explicit CIDR addresses that can access this HPCC Systems cluster. To allow public access, set value to ["0.0.0.0/0"] or []. Y
hpcc_version string The version of HPCC Systems to install. Only versions in nn.nn.nn format are supported. Y
my_azure_id string Your azure account object id. Find this on azure portal, by going to 'users' then search for your name and click on it. The account object id is called 'Object ID'. There is a link next to it that lets you copy it. N
storage_data_gb number The amount of storage reserved for data in gigabytes. Must be 1 or more. If a storage account is defined (see below) then this value is ignored. Y
storage_lz_gb number The amount of storage reserved for the landing zone in gigabytes. Must be 1 or more. If a storage account is defined (see below) then this value is ignored. Y
thor_max_jobs number The maximum number of simultaneous Thor jobs allowed. Must be 1 or more. Y
thor_num_workers number The number of Thor workers to allocate. Must be 1 or more. Y

Persistent Storage

To get persistent storage, i.e. storage that is not deleted when the HPCC cluster is deleted, set the variable, external_storage_desired, to true.

Useful Things

  • Useful az cli commands:
    • az account list --output table
      • Shows your current subscriptions, and determine which is the default
    • az account set --subscription "My_Subscription"
      • Sets the default subscription
  • Useful kubectl commands once the cluster is deployed:
    • kubectl get pods
      • Shows Kubernetes pods for the current cluster.
    • kubectl get services
      • Show the current services running on the pods on the current cluster.
    • kubectl config get-contexts
      • Show the saved kubectl contexts. A context contains login and reference information for a remote Kubernetes cluster. A kubectl command typically relays information about the current context.
    • kubectl config use-context <ContextName>
      • Make <ContextName> context the current context for future kubectl commands.
    • kubectl config unset contexts.<ContextName>
      • Delete context named <ContextName>.
      • Note that when you delete the current context, kubectl does not select another context as the current context. Instead, no context will be current. You must use kubectl config use-context <ContextName> to make another context current.
  • Note that terraform destroy does not delete the kubectl context. You need to use kubectl config unset contexts.<ContextName> to get rid of the context from your local system.
  • If a deployment fails and you want to start over, you have two options:
    • Immediately issue a terraform destroy command and let terraform clean up.
    • Clean up the resources by hand:
      • Delete the Azure resource group manually, such as through the Azure Portal.
        • Note that there are two resource groups, if the deployment got far enough. Examples:
          • app-thhpccplatform-sandbox-eastus-68255
          • mc_tf-zrms-default-aks-1
        • The first one contains the Kubernetes service that created the second one (services that support Kubernetes). So, if you delete only the first resource group, the second resource group will be deleted automatically.
      • Delete all terraform state files using rm *.tfstate*
    • Then, of course, fix whatever caused the deployment to fail.
  • If you want to completely reset terraform, issue rm -rf .terraform* *.tfstate* and then terraform init.