# NPMplus High Availability (HA) Setup Guide **Last Updated:** 2026-01-31 **Document Version:** 1.0 **Status:** Active Documentation --- **Date**: 2026-01-20 **Status**: Complete HA Architecture Guide **Purpose**: Comprehensive guide for deploying High Availability NPMplus architecture --- ## Overview This guide provides step-by-step instructions for deploying a highly available NPMplus setup to eliminate the single point of failure in the ingress architecture. ### Current Architecture - **Single NPMplus Instance**: VMID 10233 on r630-01 (192.168.11.166) - **Single Point of Failure**: All 19+ domains depend on one container - **No Redundancy**: Container failure = complete ingress outage ### Target HA Architecture - **Multiple NPMplus Instances**: Primary + Secondary (optionally Tertiary) - **Shared Storage**: Database and certificates synchronized - **Load Balancer**: Distributes traffic across instances - **Automatic Failover**: Health checks and automatic routing --- ## HA Architecture Options ### Option 1: Active-Passive with Keepalived (Recommended for Start) **Architecture**: ``` Internet ↓ Cloudflare DNS → 76.53.10.36 ↓ UDM Pro Port Forward (80/443) ↓ Keepalived Virtual IP (192.168.11.166) ├─ Primary NPMplus (VMID 10233) - Active └─ Secondary NPMplus (VMID 10234) - Standby ↓ Backend VMs ``` **Pros**: - Simple configuration - No changes to existing DNS/port forwarding - Automatic failover - Single active instance (easier certificate management) **Cons**: - Secondary instance idle (no load distribution) - Requires shared storage for certificates --- ### Option 2: Active-Active with HAProxy Load Balancer **Architecture**: ``` Internet ↓ Cloudflare DNS → 76.53.10.36 ↓ UDM Pro Port Forward (80/443) ↓ HAProxy (192.168.11.166) ├─ Primary NPMplus (VMID 10233) - Active └─ Secondary NPMplus (VMID 10234) - Active ↓ Backend VMs ``` **Pros**: - Load distribution across instances - Better resource utilization - Automatic failover - Can handle more traffic **Cons**: - More complex configuration - Requires shared storage for database and certificates - Need to handle SSL termination at HAProxy or NPMplus --- ### Option 3: Active-Active with Shared Database (Advanced) **Architecture**: ``` Internet ↓ Cloudflare DNS → 76.53.10.36 ↓ UDM Pro Port Forward (80/443) ↓ Keepalived Virtual IP (192.168.11.166) ├─ Primary NPMplus (VMID 10233) └─ Secondary NPMplus (VMID 10234) ↓ (Shared Resources) ├─ PostgreSQL/MariaDB Database (Shared) ├─ NFS/GlusterFS for Certificates (Shared) └─ Shared Configuration Storage ↓ Backend VMs ``` **Pros**: - True active-active (both instances serving traffic) - Shared database ensures configuration sync - Shared certificate storage **Cons**: - Most complex to implement - Requires external database - Requires shared file storage (NFS/GlusterFS) - NPMplus uses SQLite (would need migration) --- ## Recommended Approach: Active-Passive with Keepalived For the initial HA implementation, **Option 1 (Active-Passive with Keepalived)** is recommended because: 1. Minimal changes to existing architecture 2. Reuses existing NPMplus configuration 3. Easier to implement and test 4. Can be upgraded to active-active later This guide focuses on **Option 1**, with notes on how to upgrade to **Option 2** later. --- ## Prerequisites ### Infrastructure Requirements - **Primary Proxmox Host**: r630-01 (192.168.11.11) - Existing NPMplus - **Secondary Proxmox Host**: r630-02 (192.168.11.12) or ml110 (192.168.11.10) - For secondary NPMplus - **Shared Storage**: NFS or rsync-based synchronization for certificates - **Network**: Both hosts on same VLAN (192.168.11.0/24) ### Software Requirements - Keepalived (for virtual IP) - rsync or NFS (for certificate synchronization) - Monitoring tools (for health checks) ### Current NPMplus Details - **VMID**: 10233 - **Host**: r630-01 (192.168.11.11) - **Container IP**: 192.168.11.166 (eth0) - **Management Port**: 81 - **Database**: `/data/database.sqlite` - **Certificates**: `/data/tls/certbot/live/` --- ## Step-by-Step Implementation ### Phase 1: Prepare Secondary NPMplus Instance #### Step 1.1: Create Secondary NPMplus Container **Target**: VMID 10234 on r630-02 (192.168.11.12) ```bash # On Proxmox host (r630-02) CTID=10234 HOSTNAME="npmplus-secondary" IP="192.168.11.168" BRIDGE="vmbr0" # Download Alpine template pveam download local alpine-3.22-default_20241208_amd64.tar.xz # Create container pct create $CTID \ local:vztmpl/alpine-3.22-default_20241208_amd64.tar.xz \ --hostname $HOSTNAME \ --memory 1024 \ --cores 2 \ --rootfs local-lvm:5 \ --net0 name=eth0,bridge=$BRIDGE,ip=$IP/24,gw=192.168.11.1 \ --unprivileged 1 \ --features nesting=1 # Start container pct start $CTID # Wait for container to be ready sleep 10 ``` #### Step 1.2: Install NPMplus on Secondary Instance ```bash # SSH to Proxmox host ssh root@192.168.11.12 # Enter container pct exec 10234 -- ash # Install dependencies apk update apk add --no-cache tzdata gawk yq docker docker-compose curl bash rsync # Start Docker rc-service docker start rc-update add docker default # Wait for Docker sleep 5 # Fetch NPMplus compose file cd /opt curl -fsSL "https://raw.githubusercontent.com/ZoeyVid/NPMplus/refs/heads/develop/compose.yaml" -o compose.yaml # Update compose file with timezone and email TZ="America/New_York" ACME_EMAIL="nsatoshi2007@hotmail.com" yq -i " .services.npmplus.environment |= (map(select(. != \"TZ=*\" and . != \"ACME_EMAIL=*\")) + [\"TZ=$TZ\", \"ACME_EMAIL=$ACME_EMAIL\"]) " compose.yaml # Start NPMplus (DO NOT start services yet - will sync config first) docker compose up -d ``` #### Step 1.3: Configure Secondary Container Network ```bash # Secondary container should have static IP # VMID 10234: 192.168.11.167 (eth0) # Verify IP pct exec 10234 -- ip addr show eth0 ``` --- ### Phase 2: Set Up Certificate Synchronization #### Step 2.1: Create Certificate Sync Script **Location**: `scripts/npmplus/sync-certificates.sh` ```bash #!/bin/bash # Synchronize NPMplus certificates from primary to secondary set -euo pipefail PRIMARY_HOST="192.168.11.11" PRIMARY_VMID="10233" SECONDARY_HOST="192.168.11.12" SECONDARY_VMID="10234" CERT_PATH="/data/tls/certbot/live" # Colors GREEN='\033[0;32m' YELLOW='\033[1;33m' RED='\033[0;31m' NC='\033[0m' log_info() { echo -e "${GREEN}[INFO]${NC} $1"; } log_warn() { echo -e "${YELLOW}[WARN]${NC} $1"; } log_error() { echo -e "${RED}[ERROR]${NC} $1"; } log_info "Starting certificate synchronization..." # Sync certificates from primary to secondary rsync -avz --delete \ -e "ssh -o StrictHostKeyChecking=no" \ root@$PRIMARY_HOST:"/var/lib/vz/containers/$PRIMARY_VMID/var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/" \ root@$SECONDARY_HOST:"/var/lib/vz/containers/$SECONDARY_VMID/var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/" log_info "Certificate synchronization complete" ``` **Make executable**: ```bash chmod +x scripts/npmplus/sync-certificates.sh ``` #### Step 2.2: Set Up Automated Certificate Sync **Cron Job** (runs every 5 minutes): ```bash # On primary Proxmox host (r630-01) crontab -e # Add: */5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh >> /var/log/npmplus-cert-sync.log 2>&1 ``` --- ### Phase 3: Set Up Keepalived for Virtual IP #### Step 3.1: Install Keepalived on Proxmox Hosts ```bash # On both primary and secondary Proxmox hosts apt update apt install -y keepalived ``` #### Step 3.2: Configure Keepalived on Primary Host (r630-01) **File**: `/etc/keepalived/keepalived.conf` ```bash vrrp_script chk_npmplus { script "/usr/local/bin/check-npmplus-health.sh" interval 5 weight -10 fall 2 rise 2 } vrrp_instance VI_NPMPLUS { state MASTER interface vmbr0 virtual_router_id 51 priority 110 advert_int 1 authentication { auth_type PASS auth_pass npmplus_ha_2024 } virtual_ipaddress { 192.168.11.166/24 } track_script { chk_npmplus } notify_master "/usr/local/bin/keepalived-notify.sh master" notify_backup "/usr/local/bin/keepalived-notify.sh backup" notify_fault "/usr/local/bin/keepalived-notify.sh fault" } ``` #### Step 3.3: Configure Keepalived on Secondary Host (r630-02) **File**: `/etc/keepalived/keepalived.conf` ```bash vrrp_script chk_npmplus { script "/usr/local/bin/check-npmplus-health.sh" interval 5 weight -10 fall 2 rise 2 } vrrp_instance VI_NPMPLUS { state BACKUP interface vmbr0 virtual_router_id 51 priority 100 advert_int 1 authentication { auth_type PASS auth_pass npmplus_ha_2024 } virtual_ipaddress { 192.168.11.166/24 } track_script { chk_npmplus } notify_master "/usr/local/bin/keepalived-notify.sh master" notify_backup "/usr/local/bin/keepalived-notify.sh backup" notify_fault "/usr/local/bin/keepalived-notify.sh fault" } ``` #### Step 3.4: Create Health Check Script **File**: `/usr/local/bin/check-npmplus-health.sh` (on both hosts) ```bash #!/bin/bash # Check NPMplus health and return 0 if healthy, 1 if unhealthy PRIMARY_HOST="192.168.11.11" PRIMARY_VMID="10233" SECONDARY_HOST="192.168.11.12" SECONDARY_VMID="10234" HOSTNAME=$(hostname) if [ "$HOSTNAME" = "r630-01" ]; then VMID=$PRIMARY_VMID elif [ "$HOSTNAME" = "r630-02" ]; then VMID=$SECONDARY_VMID else exit 1 fi # Check if container is running if ! pct status $VMID 2>/dev/null | grep -q "running"; then exit 1 fi # Check if NPMplus container is healthy if ! pct exec $VMID -- docker ps --filter "name=npmplus" --format "{{.Status}}" | grep -q "healthy\|Up"; then exit 1 fi # Check if NPMplus web interface responds if ! pct exec $VMID -- curl -s -k -f -o /dev/null --max-time 5 https://localhost:81 >/dev/null 2>&1; then exit 1 fi # All checks passed exit 0 ``` **Make executable**: ```bash chmod +x /usr/local/bin/check-npmplus-health.sh ``` #### Step 3.5: Create Notification Script **File**: `/usr/local/bin/keepalived-notify.sh` (on both hosts) ```bash #!/bin/bash # Handle Keepalived state changes STATE=$1 LOGFILE="/var/log/keepalived-notify.log" TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S') case "$STATE" in "master") echo "[$TIMESTAMP] Transitioned to MASTER - This node now owns VIP 192.168.11.166" >> "$LOGFILE" # Optionally: Start services, send alerts, etc. ;; "backup") echo "[$TIMESTAMP] Transitioned to BACKUP - Standby mode" >> "$LOGFILE" ;; "fault") echo "[$TIMESTAMP] Transitioned to FAULT - Health check failed" >> "$LOGFILE" # Optionally: Send critical alerts ;; esac ``` **Make executable**: ```bash chmod +x /usr/local/bin/keepalived-notify.sh ``` #### Step 3.6: Start Keepalived ```bash # On both hosts systemctl enable keepalived systemctl start keepalived # Verify status systemctl status keepalived ip addr show vmbr0 | grep 192.168.11.166 ``` --- ### Phase 4: Sync Configuration to Secondary #### Step 4.1: Export Primary Configuration **Script**: `scripts/npmplus/export-primary-config.sh` ```bash #!/bin/bash # Export primary NPMplus configuration PRIMARY_HOST="192.168.11.11" PRIMARY_VMID="10233" BACKUP_DIR="/tmp/npmplus-config-backup-$(date +%Y%m%d_%H%M%S)" mkdir -p "$BACKUP_DIR" # Export database ssh root@$PRIMARY_HOST "pct exec $PRIMARY_VMID -- docker exec npmplus sqlite3 /data/database.sqlite '.dump'" > "$BACKUP_DIR/database.sql" # Export proxy hosts via API (if available) NPM_URL="https://192.168.11.166:81" NPM_EMAIL="nsatoshi2007@hotmail.com" NPM_PASSWORD="your-password" # Update from .env TOKEN_RESPONSE=$(curl -s -k -X POST "$NPM_URL/api/tokens" \ -H "Content-Type: application/json" \ -d "{\"identity\":\"$NPM_EMAIL\",\"secret\":\"$NPM_PASSWORD\"}") TOKEN=$(echo "$TOKEN_RESPONSE" | jq -r '.token') curl -s -k -X GET "$NPM_URL/api/nginx/proxy-hosts" \ -H "Authorization: Bearer $TOKEN" | jq '.' > "$BACKUP_DIR/proxy_hosts.json" curl -s -k -X GET "$NPM_URL/api/nginx/certificates" \ -H "Authorization: Bearer $TOKEN" | jq '.' > "$BACKUP_DIR/certificates.json" echo "Configuration exported to $BACKUP_DIR" ``` #### Step 4.2: Import Configuration to Secondary **Script**: `scripts/npmplus/import-secondary-config.sh` ```bash #!/bin/bash # Import configuration to secondary NPMplus SECONDARY_HOST="192.168.11.12" SECONDARY_VMID="10234" BACKUP_DIR="$1" # Path to backup directory from Step 4.1 if [ -z "$BACKUP_DIR" ] || [ ! -d "$BACKUP_DIR" ]; then echo "Usage: $0 " exit 1 fi # Import database (requires stopping NPMplus first) ssh root@$SECONDARY_HOST "pct exec $SECONDARY_VMID -- docker stop npmplus" # Copy database backup scp "$BACKUP_DIR/database.sql" root@$SECONDARY_HOST:/tmp/ # Import database ssh root@$SECONDARY_HOST "pct exec $SECONDARY_VMID -- bash -c ' cat /tmp/database.sql | docker exec -i npmplus sqlite3 /data/database.sqlite '" # Restart NPMplus ssh root@$SECONDARY_HOST "pct exec $SECONDARY_VMID -- docker start npmplus" # Wait for NPMplus to be ready sleep 10 echo "Configuration imported to secondary NPMplus" ``` --- ### Phase 5: Set Up Configuration Sync (Ongoing) #### Step 5.1: Create Configuration Sync Script **Script**: `scripts/npmplus/sync-config.sh` ```bash #!/bin/bash # Sync NPMplus configuration from primary to secondary PRIMARY_HOST="192.168.11.11" PRIMARY_VMID="10233" SECONDARY_HOST="192.168.11.12" SECONDARY_VMID="10234" NPM_URL="https://192.168.11.166:81" NPM_EMAIL="nsatoshi2007@hotmail.com" NPM_PASSWORD="${NPM_PASSWORD:-}" # From .env if [ -z "$NPM_PASSWORD" ]; then echo "ERROR: NPM_PASSWORD not set" exit 1 fi # Authenticate TOKEN_RESPONSE=$(curl -s -k -X POST "$NPM_URL/api/tokens" \ -H "Content-Type: application/json" \ -d "{\"identity\":\"$NPM_EMAIL\",\"secret\":\"$NPM_PASSWORD\"}") TOKEN=$(echo "$TOKEN_RESPONSE" | jq -r '.token') if [ -z "$TOKEN" ] || [ "$TOKEN" = "null" ]; then echo "ERROR: Authentication failed" exit 1 fi # Export from primary curl -s -k -X GET "$NPM_URL/api/nginx/proxy-hosts" \ -H "Authorization: Bearer $TOKEN" > /tmp/proxy_hosts_primary.json # Get secondary URL (will be different when not active) SECONDARY_URL="https://192.168.11.168:81" # For now, manual sync is required # In future: implement API-based sync or shared database echo "Manual configuration sync required" echo "Export from: $NPM_URL" echo "Import to: $SECONDARY_URL" ``` **Note**: Full automated configuration sync requires either: - Shared database (PostgreSQL/MariaDB migration) - API-based sync script (more complex) - Manual sync process for configuration changes **For now**: Configuration changes must be manually replicated to secondary. --- ### Phase 6: Testing and Validation #### Step 6.1: Test Virtual IP Failover ```bash # On primary host ip addr show vmbr0 | grep 192.168.11.166 # Should show: 192.168.11.166 # Simulate primary failure systemctl stop keepalived # Wait 5-10 seconds sleep 10 # Check secondary host ssh root@192.168.11.12 "ip addr show vmbr0 | grep 192.168.11.166" # Should now show: 192.168.11.166 (VIP moved to secondary) # Test connectivity curl -k https://192.168.11.166:81 # Should connect to secondary NPMplus # Restore primary systemctl start keepalived # Wait for failback sleep 10 ``` #### Step 6.2: Test Certificate Access ```bash # Verify certificates exist on secondary ssh root@192.168.11.12 "pct exec 10234 -- ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/" # Test SSL endpoint curl -vI https://explorer.d-bis.org # Should show valid certificate ``` #### Step 6.3: Test Proxy Host Functionality ```bash # Test each domain from external for domain in explorer.d-bis.org mim4u.org rpc-http-pub.d-bis.org; do echo "Testing $domain..." curl -I "https://$domain" 2>&1 | grep -E "HTTP|Server" done ``` --- ## Monitoring and Maintenance ### Health Monitoring **Script**: `scripts/npmplus/monitor-ha-status.sh` ```bash #!/bin/bash # Monitor HA status and send alerts if needed VIP="192.168.11.166" PRIMARY_HOST="192.168.11.11" SECONDARY_HOST="192.168.11.12" # Check who owns VIP VIP_OWNER=$(ssh root@$PRIMARY_HOST "ip addr show vmbr0 | grep $VIP" && echo "$PRIMARY_HOST" || \ ssh root@$SECONDARY_HOST "ip addr show vmbr0 | grep $VIP" && echo "$SECONDARY_HOST" || \ echo "UNKNOWN") echo "VIP $VIP owner: $VIP_OWNER" # Check Keepalived status on both hosts PRIMARY_STATUS=$(ssh root@$PRIMARY_HOST "systemctl is-active keepalived" 2>/dev/null || echo "unknown") SECONDARY_STATUS=$(ssh root@$SECONDARY_HOST "systemctl is-active keepalived" 2>/dev/null || echo "unknown") echo "Primary Keepalived: $PRIMARY_STATUS" echo "Secondary Keepalived: $SECONDARY_STATUS" # Alert if both are down if [ "$PRIMARY_STATUS" != "active" ] && [ "$SECONDARY_STATUS" != "active" ]; then echo "ALERT: Both Keepalived instances are down!" # Send alert (email, webhook, etc.) fi ``` **Cron Job**: ```bash */5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/monitor-ha-status.sh >> /var/log/npmplus-ha-monitor.log 2>&1 ``` --- ## Upgrading to Active-Active (Future) To upgrade from Active-Passive to Active-Active: ### Option A: HAProxy Load Balancer 1. Deploy HAProxy on dedicated VM/container (VMID 10235) 2. Configure HAProxy to balance between both NPMplus instances 3. Update UDM Pro port forwarding to point to HAProxy IP 4. Configure shared storage for certificates 5. Implement shared database (PostgreSQL migration) ### Option B: DNS Round-Robin 1. Assign multiple IPs to NPMplus instances 2. Configure DNS round-robin (not recommended for SSL termination) --- ## Troubleshooting ### Issue: VIP not moving to secondary **Symptoms**: Primary fails but secondary doesn't take over **Check**: ```bash # Check Keepalived logs journalctl -u keepalived -n 50 # Check health check script /usr/local/bin/check-npmplus-health.sh echo $? # Should return 0 if healthy # Check firewall (VRRP uses multicast) iptables -L | grep 224.0.0.0 ``` **Solution**: Ensure VRRP multicast traffic (224.0.0.0/8) is allowed between hosts. --- ### Issue: Certificates out of sync **Symptoms**: Secondary shows certificate errors **Solution**: ```bash # Manually sync certificates bash scripts/npmplus/sync-certificates.sh # Verify sync ssh root@192.168.11.12 "ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/" ``` --- ### Issue: Configuration mismatch **Symptoms**: Proxy hosts work on primary but not secondary **Solution**: ```bash # Export from primary bash scripts/npmplus/export-primary-config.sh # Import to secondary bash scripts/npmplus/import-secondary-config.sh /tmp/npmplus-config-backup-* ``` --- ## Rollback Plan If HA setup causes issues: 1. **Disable Keepalived on Secondary**: ```bash ssh root@192.168.11.12 "systemctl stop keepalived" systemctl disable keepalived ``` 2. **Ensure Primary Owns VIP**: ```bash systemctl restart keepalived ip addr show vmbr0 | grep 192.168.11.166 ``` 3. **Stop Secondary NPMplus** (optional): ```bash ssh root@192.168.11.12 "pct stop 10234" ``` 4. **Remove Secondary Container** (if not needed): ```bash ssh root@192.168.11.12 "pct destroy 10234" ``` --- ## Cost and Resource Impact ### Additional Resources Required - **Secondary NPMplus Container**: ~1 GB RAM, 5 GB disk, 2 CPU cores - **Keepalived**: Minimal overhead (< 10 MB RAM) - **Network**: VRRP multicast traffic (minimal) - **Storage**: Certificate sync storage (same as primary) ### Maintenance Overhead - **Certificate Sync**: Automated (every 5 minutes) - **Configuration Sync**: Manual (when changes made) - **Monitoring**: Automated (every 5 minutes) --- ## Next Steps 1. **Review and Approve HA Architecture** 2. **Schedule Maintenance Window** (if required) 3. **Create Secondary NPMplus Instance** (Phase 1) 4. **Set Up Certificate Sync** (Phase 2) 5. **Configure Keepalived** (Phase 3) 6. **Sync Configuration** (Phase 4) 7. **Test Failover** (Phase 6) 8. **Enable Monitoring** (Monitoring section) --- ## References - **Keepalived Documentation**: https://www.keepalived.org/manpage.html - **NPMplus GitHub**: https://github.com/ZoeyVid/NPMplus - **VRRP Protocol**: RFC 3768 - **Current Architecture**: `docs/04-configuration/DNS_NPMPLUS_VM_COMPREHENSIVE_ARCHITECTURE.md` --- **Last Updated**: 2026-01-20 **Status**: Ready for Implementation **Estimated Implementation Time**: 4-6 hours