Auto-Scaling for Heimdall Proxies in VMware vSphere

This guide provides a lightweight approach to simulate auto-scaling for Heimdall Proxy virtual machines in a VMware vSphere environment using govc and Bash scripting. It covers VM setup, automation prerequisites, and a CPU-based scaling example to help optimize proxy performance during peak loads.

vSphere vs vCenter: A Simple Comparison Guide

vSphere is the overall VMware virtualization platform—a suite that includes components like ESXi (the hypervisor), vCenter Server, and the vSphere Client.
vCenter, on the other hand, is the centralized management tool within vSphere that lets you manage multiple ESXi hosts and virtual machines from one place.

In short:
vSphere = the whole platform,
vCenter = the control center for managing it.


Prerequisites

  • ESXi host installed on bare metal and reachable on the network.
  • ESXi Host Client URL: https://<ESXi-IP>/ui
    Example (lab): https://192.168.88.185/ui
    Credentials (example): root / <password>

  • vCenter Server Appliance (VCSA) deployed as a VM on the ESXi host.

  • vSphere Client URL (inventory/VM mgmt): https://<vcenter-ip>/ui
    Example (lab): https://192.168.88.176/ui
    SSO Admin (example): administrator@<SSO-domain> / <password>
    Lab example: administrator@heimdall.local / heimdall

đź’ˇ Tip: The VAMI (appliance admin portal) is at https://<vcenter-ip>:5480 for health, services, and updates (not for VM/host inventory).


Create a Datacenter and Add the ESXi Host

  1. Log in to the vSphere Client: https://<vcenter-ip>/ui
  2. Right‑click the vCenter object (top of the left tree) → New Datacenter.
    • Name: e.g., ha-datacenter
  3. Right‑click the new Datacenter → Add Host
  4. Enter the ESXi host IP (e.g., 192.168.88.185) and root credentials.
  5. Accept the SSL certificate and complete the wizard (license, lockdown mode, VM placement if prompted).

Validation: You should see the ESXi host under the Datacenter in Hosts and Clusters, and its VMs under VMs and Templates.


What is govc?

To interact with vSphere from the command line, you'll need a tool that can communicate with VMware environments.
govc is a lightweight CLI built on the govmomi Go library, designed for automating and scripting tasks in vSphere environments such as ESXi or vCenter.

Think of govc as kubectl, but for vSphere.


govc installation steps on Linux

Run the following commands to install govc on a Linux machine (e.g., Heimdall Manager):

# Step 1: Download the latest release
curl -LO https://github.com/vmware/govmomi/releases/latest/download/govc_Linux_x86_64.tar.gz

# Step 2: Extract the archive
tar -xvf govc_Linux_x86_64.tar.gz

# Step 3: Move the binary to your PATH
sudo mv govc /usr/local/bin/

# Step 4: Make it executable
chmod +x /usr/local/bin/govc

# Step 5: Verify the installation
govc version

Configure govc Environment Variables

govc reads its connection settings from environment variables. Export these once per shell session (or persist them in your shell profile).

# vCenter/ESXi endpoint
export GOVC_URL='https://<vcenter-ip>/sdk'

# SSO/Local credentials
export GOVC_USERNAME='administrator@<SSO-domain>'
export GOVC_PASSWORD='<password>'

# Accept self-signed certs (set to 0 if you have valid CA certs)
export GOVC_INSECURE=1

# Scope to your datacenter (recommended if you have more than one)
export GOVC_DATACENTER='<datacenter>'   # e.g., ha-datacenter

Expect: vCenter/ESXi details, datacenter summary, and inventory roots.

Verify the connection

govc about                 # shows vCenter/ESXi version
govc datacenter.info       # shows your DC details
govc ls /                  # lists inventory roots (DCs)
govc find -type m          # lists all VMs

Auto-scaling on vSphere

Auto‑scaling on vSphere can be achieved with custom scripts (e.g., Bash + govc) that monitor a VM’s resource usage and automatically clone or power on additional instances when defined thresholds are exceeded.

Example Auto-Scaling Logic

  1. Monitor CPU and/or memory usage of VM-A.
  2. If usage exceeds a defined threshold for X minutes:
    → Clone a new VM or power on a standby VM (e.g., VM-B) to help balance load.

Basic Auto-scaling (Bash): CPU-driven Clone

#!/bin/bash
set -euo pipefail

# --- govc environment (set these if the shell/service doesn't inherit your env) ---
# export GOVC_URL='https://<vcenter>/sdk'
# export GOVC_USERNAME='<user>'    # administrator@<SSO-domain>
# export GOVC_PASSWORD='<pass>'
# export GOVC_INSECURE=1           # if you don't have proper CA trust
# export GOVC_DATACENTER='DC'      # optional; /ha-datacenter on standalone ESXi

ACTIVE_PROXY_VM="HDProxy000"
TEMPLATE_PROXY_VM="HDProxyTemplate"
CPU_THRESHOLD=80
CHECK_INTERVAL=30   # seconds between checks when no clone action is taken

# Find the full inventory path (first match) for the ACTIVE VM (monitoring target)
VM_PATH="$(govc find -type m -name "$ACTIVE_PROXY_VM" | head -n1 || true)"
if [[ -z "$VM_PATH" ]]; then
  echo "$(date): ERROR: VM '$ACTIVE_PROXY_VM' not found."
  exit 1
fi

running=true
isClone=false
# 30 minutes total, displayed per CHECK_INTERVAL: 1800 / 30 = 60 ticks
countDown=$(( 1800 / CHECK_INTERVAL ))

trap 'running=false' INT TERM
trap 'echo "$(date): Stopped."' EXIT

while $running; do
  # If we recently cloned, run a countdown and skip cloning
  if [[ "$isClone" == true && $countDown -gt 0 ]]; then
    printf "%s: Clone completed earlier; allowing load to rebalance (~%dm %ds left).\n" \
      "$(date)" $(( (countDown*CHECK_INTERVAL)/60 )) $(( (countDown*CHECK_INTERVAL)%60 ))
    countDown=$((countDown - 1))
    sleep "$CHECK_INTERVAL"
    continue
  fi

  # If countdown finished, reset the flag
  if [[ "$isClone" == true && $countDown -le 0 ]]; then
    isClone=false
    countDown=$(( 1800 / CHECK_INTERVAL ))
  fi

  # Fetch the latest CPU usage sample
  cpu_pct="$(govc metric.sample -n=1 "$VM_PATH" cpu.usage.average 2>/dev/null \
             | awk '$2=="-" && $3=="cpu.usage.average"{print $(NF-1)}' \
             | awk '{printf "%.0f", $1}')"
  : "${cpu_pct:=0}"

  echo "$(date): CPU usage for $ACTIVE_PROXY_VM is ${cpu_pct}%"

  if [[ $cpu_pct -gt $CPU_THRESHOLD && "$isClone" == false ]]; then
    NEW_VM="HDProxy$(date +%s)"   # or switch to HDProxy### pattern
    echo "$(date): CPU > ${CPU_THRESHOLD}% — cloning $TEMPLATE_PROXY_VM -> $NEW_VM"
    govc vm.clone -vm "$TEMPLATE_PROXY_VM" -on=true "$NEW_VM"
    echo "$(date): Clone completed and powered on: $NEW_VM"

    # Start the 30-minute message window (no additional clones)
    isClone=true
    countDown=$(( 1800 / CHECK_INTERVAL ))
  else
    echo "$(date): CPU normal. No action taken. Checking again in ${CHECK_INTERVAL}s..."
    sleep "$CHECK_INTERVAL"
  fi
done

đź’ˇ Note:

  • ACTIVE_PROXY_VM="HDProxy000" is the running Heimdall Proxy VM whose CPU usage is continuously monitored.
  • TEMPLATE_PROXY_VM="HDProxyTemplate" is a powered-off Heimdall proxy clone source; when the threshold is exceeded, the script clones from this template and powers on the new VM.
  • When CPU on ACTIVE_PROXY_VM exceeds 80%, a new VM (e.g., HDProxy<timestamp>) is created via govc vm.clone -vm "$TEMPLATE_PROXY_VM" -on=true.
  • The script runs in a loop and can be stopped gracefully; trap 'running=false' INT TERM lets Ctrl+C or a service stop signal exit the loop cleanly.
  • After cloning, the script enters a 30-minute quiet window (no additional clones) and prints a status message each interval while load rebalances; monitoring then continues as normal.
  • This script simulates basic auto-scaling behavior using govc on vSphere.

Sample Output

root@heimdall:/opt/heimdall# ./newhdcronjob.sh 
Tue Aug  5 11:35:45 AM UTC 2025: CPU usage for HDProxy000 is 15%
Tue Aug  5 11:35:45 AM UTC 2025: CPU normal. No action taken. Checking again in 30s...
Tue Aug  5 11:35:48 AM UTC 2025: CPU usage for HDProxy000 is 15%
Tue Aug  5 11:35:48 AM UTC 2025: CPU normal. No action taken. Checking again in 30s...
Tue Aug  5 11:35:51 AM UTC 2025: CPU usage for HDProxy000 is 45%
Tue Aug  5 11:35:51 AM UTC 2025: CPU normal. No action taken. Checking again in 30s...
Tue Aug  5 11:35:54 AM UTC 2025: CPU usage for HDProxy000 is 55%
Tue Aug  5 11:35:54 AM UTC 2025: CPU normal. No action taken. Checking again in 30s...