Skip to content

[WIP] gcve: Add tags for failure domain testing, reprovision for vsphere 8 and add inline comments #8148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 46 additions & 8 deletions infra/gcp/terraform/k8s-infra-gcp-gcve/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

The code in `k8s-infra-gcp-gcve` sets up the infra required to allow prow jobs to create VMs on vSphere, e.g. to allow testing of the [Cluster API provider vSphere (CAPV)](https://github.com/kubernetes-sigs/cluster-api-provider-vsphere).

![Overview](./docs/images/overview.jpg)
![Overview](./docs/images/GVCE.drawio.png)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove the -pc in the image too


Prow container settings are managed outside of this folder, but understanding high level components could
help to understand how the `k8s-infra-gcp-gcve` is set up and consumed.
Expand All @@ -14,26 +14,64 @@ More specifically, to allow prow jobs to create VM on vSphere, a few resources a
- a vSphere folder and a vSphere resources pool where to run VMs during a test.
- a reserved IP range to be used for the test e.g. for the kube vip load balancer in a CAPV cluster (VM instead will get IPs via DHCP).

Also, the network of the prow container is going to be paired to the VMware engine network, thus
Also, the network of the prow container is going to be peered to the VMware engine network, thus
allowing access to both the GCVE management network and the NSX-T network where all the VM are running.

The `k8s-infra-gcp-gcve` project sets up the infrastructure that actually runs the VMs created from the prow container. There are ther main components of this infrastracture:
The `k8s-infra-gcp-gcve` project sets up the infrastructure that actually runs the VMs created from the prow container.
These are the main components of this infrastructure:

The terraform manifest in this folder, which is applied by test-infra automation (Atlantis), uses the GCP terraform provider for creating.
The terraform manifest in this folder uses the GCP terraform provider for creating.
- A VMware Engine instance
- The network infrastructure required for vSphere and for allowing communication between vSphere and Prow container.
- The network used is `192.168.0.32/21`
- The network used is `192.168.32.0/21`
- Usable Host IP Range: `192.168.32.1 - 192.168.39.254`
- DHCP Range: `192.168.32.11 - 192.168.33.255`
- IPPool for 40 Projects having 16 IPs each: `192.168.35.0 - 192.168.37.127`
- The network infrastructure used for maintenance.

See [terraform](./docs/terraform.md) for prerequisites.

When ready:

```sh
terraform init
terraform plan # Check diff
terraform apply
```

See inline comments for more details.

The terraform manifest in the `/maintenance-jumphost` uses the GCP terraform provider to setup a jumphost VM to be used to set up vSphere or for maintenance pourposes. See
The terraform manifest in the `/maintenance-jumphost` uses the GCP terraform provider to setup a jumphost VM to be used to set up vSphere or for maintenance purposes. See
- [maintenance-jumphost](./maintenance-jumphost/README.md)

The terraform manifest in the `/vsphere` folder uses the vSphere and the NSX terraform providers to setup e.g. content libraries, templetes, folders,
The terraform manifest in the `/vsphere` folder uses the vSphere and the NSX terraform providers to setup e.g. content libraries, templates, folders,
resource pools and other vSphere components required when running tests. See:
- [vsphere](./vsphere/README.md)
- [vsphere](./vsphere/README.md)

# Working around network issues

There are two issues that can exist:

## Existing limitation: maximum number of 64 connections/requests to an internet endpoint

Workaround:

* Route traffic via 192.168.32.8 (see [NSX Gateway VM](./nsx-gateway/))
* which tunnels via the maintenance jumphost

## Packages are dropped without any hop

Example: `mtr -T -P 443 1.2.3.4` shows no hop at all (not even the gateway)

It could be that NSX-T started dropping traffic to that endpoint.

The workaround is documented in [Disabling NSX-T Firewalls](./vsphere/README.md#disabling-nsx-t-firewalls)

## Setting the right MTU

Setting the right MTU is important to not run into connectivity issues due to dropped packages.
For the workload network `k8s-ci` in NSX-T, the correct MTU is configured in [nsx-t.tf](./vsphere/nsx-t.tf) and was determined by using `ping` e.g. via [this bash script](https://gist.githubusercontent.com/penguin2716/e3c2186d0da6b96845fd54a275a2cd71/raw/e4b45c33c99c6c03b200186bf2cb6b1af3d806f5/find_max_mtu.sh).

# Uploading OVA's

TODO
2 changes: 1 addition & 1 deletion infra/gcp/terraform/k8s-infra-gcp-gcve/docs/boskos.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Boskos

Boskos support resources of type `gcve-vsphere-project` to allow each test run to use a subset of vSphere resources.
Boskos resources of type `gcve-vsphere-project` allow each test run to use a subset of vSphere resources.

Boskos configuration is split in three parts:

Expand Down
216 changes: 0 additions & 216 deletions infra/gcp/terraform/k8s-infra-gcp-gcve/docs/images/GVCE.drawio

This file was deleted.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
25 changes: 17 additions & 8 deletions infra/gcp/terraform/k8s-infra-gcp-gcve/docs/terraform.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

See [README.md](https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/terraform) for a general intro about using terraform in k8s.io.

In order to apply terraform manifests you must be enabled to use the "broadcom-451918" project, please reach to [owners](../OWNERS) in case of need.
In order to apply terraform manifests you must be enabled to use the "broadcom-451918" project, please reach out to [owners](../OWNERS) in case of need.

Quick reference:

Go to the folder of interest
- [maintenance-jumphost](../maintenance-jumphost/)
- [vsphere](../vsphere/)
- [maintenance-jumphost](../maintenance-jumphost/README.md)
- [vsphere](../vsphere/README.md)

Note: the terraform script in the top folder is usually managed by test-infra automation (Atlantis); we don't have to run it manually.

Expand All @@ -19,26 +19,35 @@ You can use terraform from your local workstation or via a docker container prov
docker run -it --rm -v $(pwd):/workspace --entrypoint=/bin/bash gcr.io/k8s-staging-infra-tools/k8s-infra:v20241217-f8b07a049
```

From your local workstatin / from inside the terraform container:
From your local workstation / from inside the terraform container:

Login to GCP to get an authentication token to use with terraform.

Copy link
Member

@sbueringer sbueringer Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did l.27 just work for you?

I needed some combination of the following commands to get it to work

gcloud config set project xxx
gcloud auth login

(not sure what the minimum set of commands is)

```bash
gcloud auth application-default login
gcloud auth login
gcloud config set project broadcom-451918
```

Ensure all the env variables expected by the terraform manifest you are planning to run are set:
- [vsphere](../vsphere/)
- [vsphere](../vsphere/README.md)

Ensure the right terraform version expected by the terraform manifest you are planning to run is installed (Note: this requires `tfswitch` which is pre-installed in the docker image. In case of version mismatches, terraform will make you know):

```bash
cd infra/gcp/terraform/k8s-infra-gcp-gcve/
tfswitch
```

Additionally, if applying the vsphere terraform manifest, use the following script to generate `/etc/hosts` entries for vSphere and NSX.

```sh
gcloud vmware private-clouds describe k8s-gcp-gcve-pc --location us-central1-a --format='json' | jq -r '.vcenter.internalIp + " " + .vcenter.fqdn +"\n" + .nsx.internalIp + " " + .nsx.fqdn'
gcloud vmware private-clouds describe k8s-gcp-gcve --location us-central1-a --format='json' | jq -r '.vcenter.internalIp + " " + .vcenter.fqdn +"\n" + .nsx.internalIp + " " + .nsx.fqdn'
```

Add those entry to `/etc/hosts`.
Add those entries to `/etc/hosts`.

At this point you are ready to start using `terraform init`, `terraform plan`, `terraform apply` etc.
At this point you are ready to start using `terraform init`, `terraform plan`, `terraform apply` etc.

Notes:
- Terraform state is stored in a gcs bucket with name `k8s-infra-tf-gcp-gcve`, with a folder for each one of the terraform scripts managed in the `k8s-infra-gcp-gcve` folder (gcve, gcve-vcenter, maintenance-jumphost).
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
# Wiregard
# Wireguard

Wiregard is used to get a secure and convenient access through the maintenace jump host VM.
Wireguard is used to get a secure and convenient access through the maintenace jump host VM.

In order to use wiregard you must be enabled to use the "broadcom-451918" project, please reach to [owners](../OWNERS) in case of need.
In order to use wireguard you must be enabled to use the "broadcom-451918" project, please reach out to [owners](../OWNERS) in case of need.

It is also required to first setup things both on on your local machine and on the GCP side
following instruction below.
It is also required to first setup things both on your local machine and on the GCP side
following the instruction below.

Install wireguard following one of the methods described in https://www.wireguard.com/install/.

Note: On OSX to use the commandline tool, installation via `brew` is necessary.

Generate a wireguard keypair using `wg`.

```sh
Expand Down Expand Up @@ -58,7 +60,7 @@ EOF

Then create new version of the `maintenance-vm-wireguard-config` by appending this entry at the end of the current value [here](https://console.cloud.google.com/security/secret-manager/secret/maintenance-vm-wireguard-config/versions?project=broadcom-451918).

Additionally, if the jumphost VM is up, you might want to add it to the wiregard configuration in the current VM (it is also possible to recreate the jumphost VM, but this is going to change the wireguard enpoint also for other users).
Additionally, if the jumphost VM is up, you might want to add it to the wireguard configuration in the current VM (it is also possible to recreate the jumphost VM, but this is going to change the wireguard enpoint also for other users).

```sh
gcloud compute ssh maintenance-jumphost --zone us-central1-f
Expand All @@ -79,22 +81,23 @@ MTU = 1360

[Peer]
PublicKey = $(gcloud secrets versions access --secret maintenance-vm-wireguard-pubkey latest)
AllowedIPs = 192.168.30.0/24, 192.168.32.0/21
AllowedIPs = 192.168.31.0/24, 192.168.32.0/21
Endpoint = $(gcloud compute instances list --format='get(networkInterfaces[0].accessConfigs[0].natIP)' --filter='name=maintenance-jumphost'):51820
PersistentKeepalive = 25
EOF
```

You can then either
- import this file to the wireguard UI (after this, you can remove the file from disk).

- import this file to the wireguard UI (after this, you can remove the file from disk) and activate or deactivate the connection.
- use the file with the wireguard CLI e.g. `wg-quick up ~/wg0.conf`, and when finished `wg-quick down ~/wg0.conf`

## Additional settings

Generate `/etc/hosts` entries for vSphere and NSX; this is required to run the vSphere terraform scripts and it will also make the vSphere and NSX UI to work smootly.

```sh
gcloud vmware private-clouds describe k8s-gcp-gcve-pc --location us-central1-a --format='json' | jq -r '.vcenter.internalIp + " " + .vcenter.fqdn +"\n" + .nsx.internalIp + " " + .nsx.fqdn'
gcloud vmware private-clouds describe k8s-gcp-gcve --location us-central1-a --format='json' | jq -r '.vcenter.internalIp + " " + .vcenter.fqdn +"\n" + .nsx.internalIp + " " + .nsx.fqdn'
```

Add those entry to `/etc/hosts`.
Add those entries to `/etc/hosts`.
1 change: 1 addition & 0 deletions infra/gcp/terraform/k8s-infra-gcp-gcve/iam.tf
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ See the License for the specific language governing permissions and
limitations under the License.
*/

# Ensures admin access for groups and secret access for prow.
module "iam" {
source = "terraform-google-modules/iam/google//modules/projects_iam"
version = "~> 8.1"
Expand Down
8 changes: 3 additions & 5 deletions infra/gcp/terraform/k8s-infra-gcp-gcve/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,17 @@ See the License for the specific language governing permissions and
limitations under the License.
*/

locals {
project_id = "broadcom-451918"
}

data "google_project" "project" {
project_id = local.project_id
project_id = var.project_id
}

# Enables all required APIs for this project.
resource "google_project_service" "project" {
project = data.google_project.project.id

for_each = toset([
"compute.googleapis.com",
"essentialcontacts.googleapis.com",
"secretmanager.googleapis.com",
"vmwareengine.googleapis.com"
])
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
The maintenance jump host is a VM hosting a wireguard instance for secure and convenient access
to vSphere and NSX from local machines.

Before using wiregard it is required to first setup things both on on your local machine and on the GCP side.
see [wireguard](../docs/wiregard.md)
Before using wireguard it is required to first setup things both on your local machine and on the GCP side.
see [wireguard](../docs/wireguard.md)

The maintenance jump host VM is not required to be always up & running and it can also be recreated if necessary; however, by doing so the IP address of the VM will change and all the
local machine config have to be updated accordingly.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,21 @@ See the License for the specific language governing permissions and
limitations under the License.
*/

locals {
project_id = "broadcom-451918"
variable "project_id" {
description = "The project ID to use for the gcve cluster."
default = "broadcom-451918"
type = string
}

# Read the secret from Secret Manager which contains the wireguard server configuration.
data "google_secret_manager_secret_version_access" "wireguard-config" {
project = var.project_id
secret = "maintenance-vm-wireguard-config"
}

# Create the maintenance jumphost which runs SSH and a wireguard server.
resource "google_compute_instance" "jumphost" {
project = local.project_id
project = var.project_id
name = "maintenance-jumphost"
machine_type = "f1-micro"
zone = "us-central1-f"
Expand All @@ -33,7 +42,7 @@ resource "google_compute_instance" "jumphost" {
network_interface {
network = "maintenance-vpc-network"
subnetwork = "maintenance-subnet"
subnetwork_project = local.project_id
subnetwork_project = var.project_id
access_config {
network_tier = "STANDARD"
}
Expand All @@ -43,8 +52,3 @@ resource "google_compute_instance" "jumphost" {
user-data = templatefile("${path.module}/cloud-config.yaml.tftpl", { wg0 = base64encode(data.google_secret_manager_secret_version_access.wireguard-config.secret_data) })
}
}

data "google_secret_manager_secret_version_access" "wireguard-config" {
project = local.project_id
secret = "maintenance-vm-wireguard-config"
}
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ This file defines:
*/

terraform {
required_version = "1.10.5"

backend "gcs" {
bucket = "k8s-infra-tf-gcp-gcve"
Expand Down
31 changes: 31 additions & 0 deletions infra/gcp/terraform/k8s-infra-gcp-gcve/nsx-gateway/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# NSX Gateway

TODO: describe what this does
TODO: link from top-level readme's

The wireguard config will look like

```ini
[Interface]
PrivateKey = ...
Address = 192.168.29.6/24
PostUp = iptables -t nat -I POSTROUTING -o wg0 -j MASQUERADE
PostDown = iptables -t nat -D POSTROUTING -o wg0 -j MASQUERADE

[Peer]
Endpoint = 192.168.28.3:51820
PublicKey = ...
PersistentKeepalive = 25
# all except private networks
AllowedIPs = 0.0.0.0/5, 8.0.0.0/7, 11.0.0.0/8, 12.0.0.0/6, 16.0.0.0/4, 32.0.0.0/3, 64.0.0.0/2, 128.0.0.0/3, 160.0.0.0/5, 168.0.0.0/6, 172.0.0.0/12, 172.32.0.0/11, 172.64.0.0/10, 172.128.0.0/9, 173.0.0.0/8, 174.0.0.0/7, 176.0.0.0/4, 192.0.0.0/9, 192.128.0.0/11, 192.160.0.0/13, 192.169.0.0/16, 192.170.0.0/15, 192.172.0.0/14, 192.176.0.0/12, 192.192.0.0/10, 193.0.0.0/8, 194.0.0.0/7, 196.0.0.0/6, 200.0.0.0/5, 208.0.0.0/4, 224.0.0.0/3
```

To get SSH access to the VM, redeploy using:

```sh
export TF_VAR_ssh_public_key="ssh-rsa ..."
terraform taint vsphere_virtual_machine.gateway_vm
terraform apply
```

Note: Redeployment causes connection issues for running CI jobs.
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#cloud-config

write_files:
- path: /etc/wireguard/wg0.conf
content: "${wg0}"
encoding: b64
permissions: "0600"

- path: /etc/sysctl.d/10-wireguard.conf
content: |
net.ipv4.ip_forward = 1

users:
- name: ubuntu
primary_group: ubuntu
sudo: ALL=(ALL) NOPASSWD:ALL
shell: /bin/bash
groups: sudo, wheel
ssh_import_id: None
lock_passwd: true
ssh_authorized_keys:
- "${ssh_public_key}"

runcmd:
- apt-get update
- apt install wireguard -q -y
- sysctl -p /etc/sysctl.d/10-wireguard.conf
- systemctl enable wg-quick@wg0
- systemctl start wg-quick@wg0
Loading