Files
loc_az_hci/docs/architecture/GUEST_AGENT_IP_DISCOVERY.md
defiQUG c39465c2bd
Some checks failed
Test / test (push) Has been cancelled
Initial commit: loc_az_hci (smom-dbis-138 excluded via .gitignore)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-08 09:04:46 -08:00

6.1 KiB

Guest Agent IP Discovery - Architecture Guide

Date: 2025-11-27
Purpose: Document the guest-agent IP discovery pattern for all scripts

Overview

All SSH-using scripts now discover VM IPs dynamically from the QEMU Guest Agent instead of hard-coding IP addresses. This provides:

  • Flexibility: VMs can change IPs without breaking scripts
  • Maintainability: No IP addresses scattered throughout codebase
  • Reliability: Single source of truth (guest agent)
  • Scalability: Easy to add new VMs without updating IP lists

Architecture

Helper Library

Location: scripts/lib/proxmox_vm_helpers.sh

Key Functions:

  • get_vm_ip_from_guest_agent <vmid> - Get IP from guest agent
  • get_vm_ip_or_warn <vmid> <name> - Get IP with warning if unavailable
  • get_vm_ip_or_fallback <vmid> <name> <fallback> - Get IP with fallback
  • ensure_guest_agent_enabled <vmid> - Enable agent in VM config
  • wait_for_guest_agent <vmid> <timeout> - Wait for agent to be ready

VM Array Pattern

Before (hard-coded IPs):

VMS=(
    "100 cloudflare-tunnel 192.168.1.60"
    "101 k3s-master 192.168.1.188"
)

After (IP-free):

VMS=(
    "100 cloudflare-tunnel"
    "101 k3s-master"
)

Script Pattern

Before:

read -r vmid name ip <<< "$vm_spec"
ssh "${VM_USER}@${ip}" ...

After:

read -r vmid name <<< "$vm_spec"
ip="$(get_vm_ip_or_warn "$vmid" "$name" || true)"
[[ -z "$ip" ]] && continue
ssh "${VM_USER}@${ip}" ...

Bootstrap Problem

The Challenge

Guest-agent IP discovery only works after QEMU Guest Agent is installed and running in the VM.

Solution: Fallback Pattern

For bootstrap scripts (installing QGA itself), use fallback IPs:

# Fallback IPs for bootstrap
declare -A FALLBACK_IPS=(
  ["100"]="192.168.1.60"
  ["101"]="192.168.1.188"
)

# Get IP with fallback
ip="$(get_vm_ip_or_fallback "$vmid" "$name" "${FALLBACK_IPS[$vmid]:-}" || true)"

Bootstrap Flow

  1. First Pass: Use fallback IPs to install QGA
  2. After QGA: All subsequent scripts use guest-agent discovery
  3. No More Hard-coded IPs: Once QGA is installed everywhere

Updated Scripts

Refactored Scripts

  1. scripts/ops/ssh-test-all.sh - Example SSH test script
  2. scripts/deploy/configure-vm-services.sh - Service deployment
  3. scripts/deploy/add-ssh-keys-to-vms.sh - SSH key management
  4. scripts/deploy/verify-cloud-init.sh - Cloud-init verification
  5. scripts/infrastructure/install-qemu-guest-agent.sh - QGA installation (with fallback)

📋 Scripts to Update

All scripts that use hard-coded IPs should be updated:

  • scripts/troubleshooting/diagnose-vm-issues.sh
  • scripts/troubleshooting/test-all-access-paths.sh
  • scripts/deploy/deploy-vms-via-api.sh (IPs needed for creation, but can use discovery after)
  • And many more...

Usage Examples

Example 1: Simple SSH Script

#!/bin/bash
source "$PROJECT_ROOT/scripts/lib/proxmox_vm_helpers.sh"

VMS=(
  "100 cloudflare-tunnel"
  "101 k3s-master"
)

for vm_spec in "${VMS[@]}"; do
  read -r vmid name <<< "$vm_spec"
  ip="$(get_vm_ip_or_warn "$vmid" "$name" || true)"
  [[ -z "$ip" ]] && continue
  
  ssh "${VM_USER}@${ip}" "hostname"
done

Example 2: Bootstrap Script (with Fallback)

#!/bin/bash
source "$PROJECT_ROOT/scripts/lib/proxmox_vm_helpers.sh"

declare -A FALLBACK_IPS=(
  ["100"]="192.168.1.60"
)

for vm_spec in "${VMS[@]}"; do
  read -r vmid name <<< "$vm_spec"
  ip="$(get_vm_ip_or_fallback "$vmid" "$name" "${FALLBACK_IPS[$vmid]:-}" || true)"
  [[ -z "$ip" ]] && continue
  
  # Install QGA using discovered/fallback IP
  ssh "${VM_USER}@${ip}" "sudo apt install -y qemu-guest-agent"
done

Example 3: Service Deployment

#!/bin/bash
source "$PROJECT_ROOT/scripts/lib/proxmox_vm_helpers.sh"

declare -A VM_IPS

# Discover all IPs first
for vm_spec in "${VMS[@]}"; do
  read -r vmid name <<< "$vm_spec"
  ip="$(get_vm_ip_or_warn "$vmid" "$name" || true)"
  [[ -n "$ip" ]] && VM_IPS["$vmid"]="$ip"
done

# Use discovered IPs
if [[ -n "${VM_IPS[102]:-}" ]]; then
  deploy_gitea "${VM_IPS[102]}"
fi

Prerequisites

On Proxmox Host

  1. jq installed:

    apt update && apt install -y jq
    
  2. Helper library accessible:

    • Scripts run on Proxmox host: Direct access
    • Scripts run remotely: Copy helper or source via SSH

In VMs

  1. QEMU Guest Agent installed:

    sudo apt install -y qemu-guest-agent
    sudo systemctl enable --now qemu-guest-agent
    
  2. Agent enabled in VM config:

    qm set <vmid> --agent enabled=1
    

Migration Checklist

For each script that uses hard-coded IPs:

  • Remove IPs from VM array (keep only VMID and NAME)
  • Add source for helper library
  • Replace read -r vmid name ip with read -r vmid name
  • Add IP discovery: ip="$(get_vm_ip_or_warn "$vmid" "$name" || true)"
  • Add skip logic: [[ -z "$ip" ]] && continue
  • Test script with guest agent enabled
  • For bootstrap scripts, add fallback IPs

Benefits

  1. No IP Maintenance: IPs change? Scripts still work
  2. Single Source of Truth: Guest agent provides accurate IPs
  3. Easier Testing: Can test with different IPs without code changes
  4. Better Error Handling: Scripts gracefully handle missing guest agent
  5. Future-Proof: Works with DHCP, dynamic IPs, multiple interfaces

Troubleshooting

"No IP from guest agent"

Causes:

  • QEMU Guest Agent not installed in VM
  • Agent not enabled in VM config
  • VM not powered on
  • Agent service not running

Fix:

# In VM
sudo apt install -y qemu-guest-agent
sudo systemctl enable --now qemu-guest-agent

# On Proxmox host
qm set <vmid> --agent enabled=1

"jq command not found"

Fix:

apt update && apt install -y jq

Scripts run remotely (not on Proxmox host)

Options:

  1. Copy helper library to remote location
  2. Source via SSH:
    ssh proxmox-host "source /path/to/helpers.sh && get_vm_ip_or_warn 100 test"
    
  3. Use Proxmox API instead of qm commands

Status: Helper library created, key scripts refactored. Remaining scripts should follow the same pattern.