proxmox/docs/04-configuration/VERIFICATION_GAPS_AND_TODOS.md

# Verification Scripts and Documentation - Gaps and TODOs

**Last Updated:** 2026-03-02
**Document Version:** 1.0
**Status:** Active Documentation

---

**Date**: 2026-01-20
**Status**: Gap Analysis Complete
**Purpose**: Identify all placeholders, missing components, and incomplete implementations

**Documentation note (2026-03-02):** Runbook placeholders (e.g. `your-token`, `your-password`) are intentional examples. In production, use values from `.env` only; do not commit secrets. [INGRESS_VERIFICATION_RUNBOOK.md](INGRESS_VERIFICATION_RUNBOOK.md) updated with a production note in Prerequisites. Other runbooks (NPMPLUS_BACKUP_RESTORE, SANKOFA_CUTOVER_PLAN) keep example placeholders; operators should source from .env when running commands.

---

## Critical Missing Components

### 1. Missing Script: `scripts/verify/backup-npmplus.sh`

**Status**: ✅ **CREATED** (scripts/verify/backup-npmplus.sh)
**Referenced in**:
- `docs/04-configuration/NPMPLUS_BACKUP_RESTORE.md` (lines 39, 150, 437, 480)

**Required Functionality**:
- Automated backup of NPMplus database (`/data/database.sqlite`)
- Export of proxy hosts via API
- Export of certificates via API
- Certificate file backup from disk
- Compression and timestamping
- Configurable backup destination

**Action Required**: Create the script with all backup procedures documented in `NPMPLUS_BACKUP_RESTORE.md`.

---

## Placeholders and TBD Values

### 2. Nginx Config Paths - TBD Values

**Location**: `scripts/verify/verify-backend-vms.sh`

**Status**: ✅ **RESOLVED** - Paths set in scripts/verify/verify-backend-vms.sh:
- VMID 10130: `/etc/nginx/sites-available/dbis-frontend`
- VMID 2400: `/etc/nginx/sites-available/thirdweb-rpc`

**Required Actions** (if paths differ on actual VMs):
1. **VMID 10130 (dbis-frontend)**:
   - Determine actual nginx config path
   - Common locations: `/etc/nginx/sites-available/dbis-frontend` or `/etc/nginx/sites-available/dbis-admin`
   - Update script with actual path
   - Verify config exists and is enabled

2. **VMID 2400 (thirdweb-rpc-1)**:
   - Determine actual nginx config path
   - Common locations: `/etc/nginx/sites-available/thirdweb-rpc` or `/etc/nginx/sites-available/rpc`
   - Update script with actual path
   - Verify config exists and is enabled

**Impact**: Script will skip nginx config verification for these VMs until resolved.

---

### 3. Sankofa Cutover Plan - Target Placeholders

**Location**: `docs/04-configuration/SANKOFA_CUTOVER_PLAN.md`

**Placeholders to Replace** (once Sankofa services are deployed):
- `<TARGET_IP>` (appears 10 times)
- `<TARGET_PORT>` (appears 10 times)
- `⚠️ TBD` values in table (lines 60-64)

**Domain-Specific Targets Needed**:
| Domain | Current (Wrong) | Target (TBD) |
|--------|----------------|--------------|
| `sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
| `www.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
| `phoenix.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
| `www.phoenix.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
| `the-order.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |

**Action Required**: Update placeholders with actual Sankofa service IPs and ports once deployed.

---

## Documentation Placeholders

### 4. Generic Placeholders in Runbooks

**Location**: Multiple files

**Replacements Needed**:

#### `INGRESS_VERIFICATION_RUNBOOK.md`:
- Line 23: `CLOUDFLARE_API_TOKEN="your-token"` → Should reference `.env` file
- Line 25: `CLOUDFLARE_EMAIL="your-email"` → Should reference `.env` file
- Line 26: `CLOUDFLARE_API_KEY="your-key"` → Should reference `.env` file
- Line 31: `NPM_PASSWORD="your-password"` → Should reference `.env` file
- Lines 91, 101, 213: Similar placeholders in examples

**Note**: These are intentional examples, but should be clearly marked as such and reference `.env` file usage.

#### `NPMPLUS_BACKUP_RESTORE.md`:
- Line 84: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)
- Line 304: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)

#### `SANKOFA_CUTOVER_PLAN.md`:
- Line 125: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)
- Line 178: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)

**Status (2026-03-02):** Addressed. INGRESS_VERIFICATION_RUNBOOK.md now includes a production note in Prerequisites. VERIFICATION_GAPS_AND_TODOS documents that runbooks use example placeholders and production should source from .env.

---

### 5. Source of Truth JSON - Verifier Field

**Location**: `docs/04-configuration/INGRESS_SOURCE_OF_TRUTH.json` (line 5)

**Current**: `"verifier": "operator-name"`

**Expected**: Should be dynamically set by script using `$USER` or actual operator name.

**Status**: ✅ **HANDLED** - The `generate-source-of-truth.sh` script uses `env.USER // "unknown"` which is correct. The example JSON file is just a template.

**Action Required**: None - script implementation is correct.

---

## Implementation Gaps

### 6. Source of Truth Generation - File Path Dependencies

**Location**: `scripts/verify/generate-source-of-truth.sh`

**Potential Issues**:
- Script expects specific output file names from verification scripts
- If verification scripts don't run first, JSON will be empty or have defaults
- No validation that source files exist before parsing

**Expected File Dependencies**:
```bash
$EVIDENCE_DIR/dns-verification-*/all_dns_records.json
$EVIDENCE_DIR/udm-pro-verification-*/verification_results.json
$EVIDENCE_DIR/npmplus-verification-*/proxy_hosts.json
$EVIDENCE_DIR/npmplus-verification-*/certificates.json
$EVIDENCE_DIR/backend-vms-verification-*/all_vms_verification.json
$EVIDENCE_DIR/e2e-verification-*/all_e2e_results.json
```

**Action Required**:
- Add file existence checks before parsing
- Provide clear error messages if dependencies are missing
- Add option to generate partial source-of-truth if some verifications haven't run

---

### 7. Backend VM Verification - Service-Specific Checks

**Location**: `scripts/verify/verify-backend-vms.sh`

**Gaps Identified**:
1. **Besu RPC VMs (2101, 2201)**:
   - Script checks for RPC endpoints but doesn't verify Besu-specific health checks
   - Should test actual RPC calls (e.g., `eth_chainId`) not just HTTP status
   - WebSocket port (8546) verification is minimal

2. **Node.js API VMs (10150, 10151)**:
   - Only checks port 3000 is listening
   - Doesn't verify API health endpoint exists
   - Should test actual API endpoint (e.g., `/health` or `/api/health`)

3. **Blockscout VM (5000)**:
   - Checks nginx on port 80 and Blockscout on port 4000
   - Should verify Blockscout API is responding (e.g., `/api/health`)

**Action Required**:
- Add service-specific health check functions
- Implement actual RPC/API endpoint testing beyond port checks
- Document expected health check endpoints per service type

---

### 8. End-to-End Routing - WebSocket Testing

**Location**: `scripts/verify/verify-end-to-end-routing.sh`

**Current Implementation**:
- Basic WebSocket connectivity check using TCP connection test
- Manual `wscat` test recommended but not automated
- No actual WebSocket handshake or message exchange verification

**Gap**:
- WebSocket tests are minimal (just TCP connection)
- No verification that WebSocket protocol upgrade works correctly
- No test of actual RPC WebSocket messages

**Action Required**:
- Add automated WebSocket handshake test (if `wscat` is available)
- Or add clear documentation that WebSocket testing requires manual verification
- Consider adding automated WebSocket test script if `wscat` or `websocat` is installed

---

## Configuration Gaps

### 9. Environment Variable Documentation

**Missing**: Comprehensive `.env.example` file listing all required variables

**Required Variables** (from scripts):
```bash
# Cloudflare
CLOUDFLARE_API_TOKEN=
CLOUDFLARE_EMAIL=
CLOUDFLARE_API_KEY=
CLOUDFLARE_ZONE_ID_D_BIS_ORG=
CLOUDFLARE_ZONE_ID_MIM4U_ORG=
CLOUDFLARE_ZONE_ID_SANKOFA_NEXUS=
CLOUDFLARE_ZONE_ID_DEFI_ORACLE_IO=

# Public IP
PUBLIC_IP=76.53.10.36

# NPMplus
NPM_URL=https://192.168.11.166:81
NPM_EMAIL=nsatoshi2007@hotmail.com
NPM_PASSWORD=
NPM_PROXMOX_HOST=192.168.11.11
NPM_VMID=10233

# Proxmox Hosts (for testing)
PROXMOX_HOST_FOR_TEST=192.168.11.11
```

**Action Required**: Create `.env.example` file in project root with all required variables.

---

### 10. Script Dependencies Documentation

**Missing**: List of required system dependencies

**Required Tools** (used across scripts):
- `bash` (4.0+)
- `curl` (for API calls)
- `jq` (for JSON parsing)
- `dig` (for DNS resolution)
- `openssl` (for SSL certificate inspection)
- `ssh` (for remote execution)
- `ss` (for port checking)
- `systemctl` (for service status)
- `sqlite3` (for database backup)

**Optional Tools**:
- `wscat` or `websocat` (for WebSocket testing)

**Action Required**:
- Add dependencies section to `INGRESS_VERIFICATION_RUNBOOK.md`
- Create `scripts/verify/README.md` with installation instructions
- Add dependency check function to `run-full-verification.sh`

---

## Data Completeness Gaps

### 11. Source of Truth JSON - Hardcoded Values

**Location**: `scripts/verify/generate-source-of-truth.sh` (lines 169-177)

**Current**: NPMplus container info is hardcoded:
```json
"container": {
    "vmid": 10233,
    "host": "r630-01",
    "host_ip": "192.168.11.11",
    "internal_ips": {
        "eth0": "192.168.11.166",
        "eth1": "192.168.11.167"
    },
    "management_ui": "https://192.168.11.166:81",
    "status": "running"
}
```

**Gap**: Status should be dynamically determined from verification results.

**Action Required**:
- Make container status dynamic based on `export-npmplus-config.sh` results
- Verify IP addresses are correct (especially `eth1`)
- Document if `eth1` is actually used or is a placeholder

---

### 12. DNS Verification - Zone ID Lookup

**Location**: `scripts/verify/export-cloudflare-dns-records.sh`

**Current**: Attempts to fetch zone IDs if not provided in `.env`, but has fallback to empty string.

**Potential Issue**: If zone ID lookup fails and `.env` doesn't have zone IDs, script will fail silently or skip zones.

**Action Required**:
- Add validation that zone IDs are set (either from `.env` or from API lookup)
- Fail clearly if zone ID cannot be determined
- Provide helpful error message with instructions

---

## Documentation Completeness

### 13. Missing Troubleshooting Sections

**Location**: `docs/04-configuration/INGRESS_VERIFICATION_RUNBOOK.md`

**Current**: Basic troubleshooting section exists (lines 427-468) but could be expanded.

**Missing Topics**:
- What to do if verification scripts fail partially
- How to interpret "unknown" status vs "needs-fix" status
- How to manually verify items that scripts can't automate
- Common Cloudflare API errors and solutions
- Common NPMplus API authentication issues
- SSH connection failures to Proxmox hosts

**Action Required**: Expand troubleshooting section with more scenarios.

---

### 14. Missing Rollback Procedures

**Location**: `docs/04-configuration/SANKOFA_CUTOVER_PLAN.md`

**Current**: Basic rollback steps exist (lines 330-342) but could be more detailed.

**Missing**:
- Automated rollback script reference
- Exact commands to restore previous NPMplus configuration
- How to verify rollback was successful
- Recovery time expectations

**Action Required**:
- Create `scripts/verify/rollback-sankofa-routing.sh` (optional but recommended)
- Or expand manual rollback steps with exact API calls

---

## Priority Summary

### 🔴 Critical (Must Fix Before Production Use)
1. ✅ **Create `scripts/verify/backup-npmplus.sh`** - Referenced but missing
2. ✅ **Resolve TBD nginx config paths** (VMID 10130, 2400) - Blocks verification
3. ✅ **Add file dependency validation** in `generate-source-of-truth.sh`

### 🟡 Important (Should Fix Soon)
4. **Add `.env.example` file** with all required variables
5. **Add dependency checks** to verification scripts
6. **Expand service-specific health checks** for Besu, Node.js, Blockscout
7. **Document WebSocket testing limitations** or automate it

### 🟢 Nice to Have (Can Wait)
8. **Expand troubleshooting section** with more scenarios
9. **Create rollback script** for Sankofa cutover
10. **Add dependency installation guide** to runbook
11. **Make container status dynamic** in source-of-truth generation

---

## Notes

- **Placeholders in examples**: Most "your-password", "your-token" placeholders in documentation are intentional examples and acceptable, but should clearly reference `.env` file usage.
- **Sankofa placeholders**: `<TARGET_IP>` and `<TARGET_PORT>` are expected placeholders until Sankofa services are deployed. These should be updated during cutover.
- **TBD config paths**: These need to be discovered by running verification and inspecting actual VMs.

---

---

## Additional Items Completed

### 15. NPMplus High Availability (HA) Setup Guide ✅ ADDED

**Status**: ✅ **DOCUMENTATION COMPLETE** - Implementation pending
**Location**: `docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md`

**What Was Added**:
- Complete HA architecture guide (Active-Passive with Keepalived)
- Step-by-step implementation instructions (6 phases)
- Helper scripts: `sync-certificates.sh`, `monitor-ha-status.sh`
- Testing and validation procedures
- Troubleshooting guide
- Rollback plan
- Future upgrade path to Active-Active

**Scripts Created**:
- `scripts/npmplus/sync-certificates.sh` - Synchronize certificates from primary to secondary
- `scripts/npmplus/monitor-ha-status.sh` - Monitor HA status and send alerts

**Impact**: Eliminates single point of failure for NPMplus, enables automatic failover.

---

## NPMplus HA Implementation Tasks

### Phase 1: Prepare Secondary NPMplus Instance

#### Task 1.1: Create Secondary NPMplus Container
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 30 minutes

**Actions Required**:
- [ ] Download Alpine 3.22 template on r630-02
- [ ] Create container VMID 10234 with:
  - Hostname: `npmplus-secondary`
  - IP: `192.168.11.167/24`
  - Memory: 1024 MB
  - Cores: 2
  - Disk: 5 GB
  - Features: nesting=1, unprivileged=1
- [ ] Start container and verify it's running
- [ ] Document container creation in deployment log

**Commands**:
```bash
# On r630-02
CTID=10234
HOSTNAME="npmplus-secondary"
IP="192.168.11.167"
BRIDGE="vmbr0"

pveam download local alpine-3.22-default_20241208_amd64.tar.xz

pct create $CTID \
    local:vztmpl/alpine-3.22-default_20241208_amd64.tar.xz \
    --hostname $HOSTNAME \
    --memory 1024 \
    --cores 2 \
    --rootfs local-lvm:5 \
    --net0 name=eth0,bridge=$BRIDGE,ip=$IP/24,gw=192.168.11.1 \
    --unprivileged 1 \
    --features nesting=1

pct start $CTID
```

---

#### Task 1.2: Install NPMplus on Secondary Instance
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 45 minutes

**Actions Required**:
- [ ] SSH to r630-02 and enter container
- [ ] Install dependencies: `tzdata`, `gawk`, `yq`, `docker`, `docker-compose`, `curl`, `bash`, `rsync`
- [ ] Start and enable Docker service
- [ ] Download NPMplus compose.yaml from GitHub
- [ ] Configure timezone: `America/New_York`
- [ ] Configure ACME email: `nsatoshi2007@hotmail.com`
- [ ] Start NPMplus container (but don't configure yet - will sync first)
- [ ] Wait for NPMplus to be healthy
- [ ] Retrieve admin password and document it

**Commands**:
```bash
ssh root@192.168.11.12
pct exec 10234 -- ash

apk update
apk add --no-cache tzdata gawk yq docker docker-compose curl bash rsync

rc-service docker start
rc-update add docker default
sleep 5

cd /opt
curl -fsSL "https://raw.githubusercontent.com/ZoeyVid/NPMplus/refs/heads/develop/compose.yaml" -o compose.yaml

TZ="America/New_York"
ACME_EMAIL="nsatoshi2007@hotmail.com"

yq -i "
  .services.npmplus.environment |=
    (map(select(. != \"TZ=*\" and . != \"ACME_EMAIL=*\")) +
    [\"TZ=$TZ\", \"ACME_EMAIL=$ACME_EMAIL\"])
" compose.yaml

docker compose up -d
```

---

#### Task 1.3: Configure Secondary Container Network
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 10 minutes

**Actions Required**:
- [ ] Verify static IP assignment: `192.168.11.167`
- [ ] Verify gateway: `192.168.11.1`
- [ ] Test network connectivity to primary host
- [ ] Test network connectivity to backend VMs
- [ ] Document network configuration

**Commands**:
```bash
pct exec 10234 -- ip addr show eth0
pct exec 10234 -- ping -c 3 192.168.11.11
pct exec 10234 -- ping -c 3 192.168.11.166
```

---

### Phase 2: Set Up Certificate Synchronization

#### Task 2.1: Create Certificate Sync Script
**Status**: ✅ **COMPLETED**
**Location**: `scripts/npmplus/sync-certificates.sh`
**Note**: Script already created, needs testing

**Actions Required**:
- [ ] Test certificate sync script manually
- [ ] Verify certificates sync correctly
- [ ] Verify script handles errors gracefully
- [ ] Document certificate paths for both primary and secondary

---

#### Task 2.2: Set Up Automated Certificate Sync
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 15 minutes

**Actions Required**:
- [ ] Add cron job on primary Proxmox host (r630-01)
- [ ] Configure to run every 5 minutes
- [ ] Set up log rotation for `/var/log/npmplus-cert-sync.log`
- [ ] Test cron job execution
- [ ] Monitor logs for successful syncs
- [ ] Verify certificate count matches between primary and secondary

**Commands**:
```bash
# On r630-01
crontab -e

# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh >> /var/log/npmplus-cert-sync.log 2>&1

# Test manually first
bash /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh
```

---

### Phase 3: Set Up Keepalived for Virtual IP

#### Task 3.1: Install Keepalived on Proxmox Hosts
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 10 minutes

**Actions Required**:
- [ ] Install Keepalived on r630-01 (primary)
- [ ] Install Keepalived on r630-02 (secondary)
- [ ] Verify Keepalived installation
- [ ] Check firewall rules for VRRP (multicast 224.0.0.0/8)

**Commands**:
```bash
# On both hosts
apt update
apt install -y keepalived

# Verify installation
keepalived --version
```

---

#### Task 3.2: Configure Keepalived on Primary Host (r630-01)
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 20 minutes

**Actions Required**:
- [ ] Create `/etc/keepalived/keepalived.conf` with MASTER configuration
- [ ] Set virtual_router_id: 51
- [ ] Set priority: 110
- [ ] Configure auth_pass (use secure password)
- [ ] Configure virtual_ipaddress: 192.168.11.166/24
- [ ] Reference health check script path
- [ ] Reference notification script path
- [ ] Verify configuration syntax
- [ ] Document Keepalived configuration

**Files to Create**:
- `/etc/keepalived/keepalived.conf` (see HA guide for full config)
- `/usr/local/bin/check-npmplus-health.sh` (Task 3.4)
- `/usr/local/bin/keepalived-notify.sh` (Task 3.5)

---

#### Task 3.3: Configure Keepalived on Secondary Host (r630-02)
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 20 minutes

**Actions Required**:
- [ ] Create `/etc/keepalived/keepalived.conf` with BACKUP configuration
- [ ] Set virtual_router_id: 51 (must match primary)
- [ ] Set priority: 100 (lower than primary)
- [ ] Configure auth_pass (must match primary)
- [ ] Configure virtual_ipaddress: 192.168.11.166/24
- [ ] Reference health check script path
- [ ] Reference notification script path
- [ ] Verify configuration syntax
- [ ] Document Keepalived configuration

**Files to Create**:
- `/etc/keepalived/keepalived.conf` (see HA guide for full config)
- `/usr/local/bin/check-npmplus-health.sh` (Task 3.4)
- `/usr/local/bin/keepalived-notify.sh` (Task 3.5)

---

#### Task 3.4: Create Health Check Script
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 30 minutes

**Actions Required**:
- [ ] Create `/usr/local/bin/check-npmplus-health.sh` on both hosts
- [ ] Script should:
  - Detect hostname to determine which VMID to check
  - Check if container is running
  - Check if NPMplus Docker container is healthy
  - Check if NPMplus web interface responds (port 81)
  - Return exit code 0 if healthy, 1 if unhealthy
- [ ] Make script executable: `chmod +x`
- [ ] Test script manually on both hosts
- [ ] Verify script detects failures correctly

**File**: `/usr/local/bin/check-npmplus-health.sh`
**Details**: See HA guide for full script content

---

#### Task 3.5: Create Keepalived Notification Script
**Status**: ⏳ **PENDING**
**Priority**: 🟡 **Important**
**Estimated Time**: 15 minutes

**Actions Required**:
- [ ] Create `/usr/local/bin/keepalived-notify.sh` on both hosts
- [ ] Script should handle states: master, backup, fault
- [ ] Log state changes to `/var/log/keepalived-notify.log`
- [ ] Optional: Send alerts (email, webhook) on fault state
- [ ] Make script executable: `chmod +x`
- [ ] Test script with each state manually

**File**: `/usr/local/bin/keepalived-notify.sh`
**Details**: See HA guide for full script content

---

#### Task 3.6: Start and Enable Keepalived
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 15 minutes

**Actions Required**:
- [ ] Enable Keepalived service on both hosts
- [ ] Start Keepalived on both hosts
- [ ] Verify Keepalived is running
- [ ] Verify primary host owns VIP (192.168.11.166)
- [ ] Verify secondary host is in BACKUP state
- [ ] Monitor Keepalived logs for any errors
- [ ] Document VIP ownership verification

**Commands**:
```bash
# On both hosts
systemctl enable keepalived
systemctl start keepalived

# Verify status
systemctl status keepalived

# Check VIP ownership (should be on primary)
ip addr show vmbr0 | grep 192.168.11.166

# Check logs
journalctl -u keepalived -f
```

---

### Phase 4: Sync Configuration to Secondary

#### Task 4.1: Export Primary Configuration
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 30 minutes

**Actions Required**:
- [ ] Create export script: `scripts/npmplus/export-primary-config.sh`
- [ ] Export NPMplus SQLite database to SQL dump
- [ ] Export proxy hosts via API (JSON)
- [ ] Export certificates via API (JSON)
- [ ] Create timestamped backup directory
- [ ] Verify all exports completed successfully
- [ ] Document backup location and contents

**Script to Create**: `scripts/npmplus/export-primary-config.sh`
**Details**: See HA guide for full script content

---

#### Task 4.2: Import Configuration to Secondary
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 45 minutes

**Actions Required**:
- [ ] Create import script: `scripts/npmplus/import-secondary-config.sh`
- [ ] Stop NPMplus container on secondary (if running)
- [ ] Copy database SQL dump to secondary
- [ ] Import database dump into secondary NPMplus
- [ ] Restart NPMplus container on secondary
- [ ] Wait for NPMplus to be healthy
- [ ] Verify proxy hosts are configured
- [ ] Verify certificates are accessible
- [ ] Document any manual configuration steps needed

**Script to Create**: `scripts/npmplus/import-secondary-config.sh`
**Details**: See HA guide for full script content

**Note**: Some configuration may need manual replication via API or UI.

---

### Phase 5: Set Up Ongoing Configuration Sync

#### Task 5.1: Create Configuration Sync Script
**Status**: ⏳ **PENDING**
**Priority**: 🟡 **Important**
**Estimated Time**: 45 minutes

**Actions Required**:
- [ ] Create sync script: `scripts/npmplus/sync-config.sh`
- [ ] Authenticate to NPMplus API (primary)
- [ ] Export proxy hosts configuration
- [ ] Implement API-based sync or document manual sync process
- [ ] Add script to automation (if automated sync is possible)
- [ ] Document manual sync procedures for configuration changes

**Script to Create**: `scripts/npmplus/sync-config.sh`
**Note**: Full automated sync requires shared database or complex API sync. For now, manual sync may be required.

---

### Phase 6: Testing and Validation

#### Task 6.1: Test Virtual IP Failover
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 30 minutes

**Actions Required**:
- [ ] Verify primary owns VIP before test
- [ ] Simulate primary failure (stop Keepalived or NPMplus container)
- [ ] Verify VIP moves to secondary within 5-10 seconds
- [ ] Test connectivity to VIP from external source
- [ ] Restore primary and verify failback
- [ ] Document failover time (should be < 10 seconds)
- [ ] Test multiple failover scenarios
- [ ] Document test results

**Test Scenarios**:
1. Stop Keepalived on primary
2. Stop NPMplus container on primary
3. Stop entire Proxmox host (if possible in test environment)
4. Network partition (if possible in test environment)

---

#### Task 6.2: Test Certificate Access
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 30 minutes

**Actions Required**:
- [ ] Verify certificates exist on secondary (after sync)
- [ ] Test SSL endpoint from external: `curl -vI https://explorer.d-bis.org`
- [ ] Verify certificate is valid and trusted
- [ ] Test multiple domains with SSL
- [ ] Verify certificate expiration dates match
- [ ] Test certificate auto-renewal on secondary (when primary renews)
- [ ] Document certificate test results

**Commands**:
```bash
# Verify certificates on secondary
ssh root@192.168.11.12 "pct exec 10234 -- ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"

# Test SSL endpoint
curl -vI https://explorer.d-bis.org
curl -vI https://mim4u.org
curl -vI https://rpc-http-pub.d-bis.org
```

---

#### Task 6.3: Test Proxy Host Functionality
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 45 minutes

**Actions Required**:
- [ ] Test each domain from external after failover
- [ ] Verify HTTP to HTTPS redirects work
- [ ] Verify WebSocket connections work (for RPC endpoints)
- [ ] Verify API endpoints respond correctly
- [ ] Test all 19+ domains
- [ ] Document any domains that don't work correctly
- [ ] Test with secondary as active instance
- [ ] Test failback to primary

**Test Domains**:
- All d-bis.org domains (9 domains)
- All mim4u.org domains (4 domains)
- All sankofa.nexus domains (5 domains)
- defi-oracle.io domain (1 domain)

---

### Monitoring and Maintenance

#### Task 7.1: Set Up HA Status Monitoring
**Status**: ✅ **COMPLETED** (script created, needs deployment)
**Priority**: 🟡 **Important**
**Location**: `scripts/npmplus/monitor-ha-status.sh`

**Actions Required**:
- [ ] Add cron job for HA status monitoring (every 5 minutes)
- [ ] Configure log rotation for `/var/log/npmplus-ha-monitor.log`
- [ ] Test monitoring script manually
- [ ] Optional: Integrate with alerting system (email, webhook)
- [ ] Document alert thresholds and escalation procedures
- [ ] Test alert generation

**Commands**:
```bash
# On primary Proxmox host
crontab -e

# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/monitor-ha-status.sh >> /var/log/npmplus-ha-monitor.log 2>&1
```

---

#### Task 7.2: Document Manual Failover Procedures
**Status**: ⏳ **PENDING**
**Priority**: 🟡 **Important**
**Estimated Time**: 30 minutes

**Actions Required**:
- [ ] Document step-by-step manual failover procedure
- [ ] Document how to force failover to secondary
- [ ] Document how to force failback to primary
- [ ] Document troubleshooting steps for common issues
- [ ] Create runbook for operations team
- [ ] Test manual failover procedures
- [ ] Review and approve documentation

**Location**: Add to `docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md` troubleshooting section

---

#### Task 7.3: Test All Failover Scenarios
**Status**: ⏳ **PENDING**
**Priority**: 🟡 **Important**
**Estimated Time**: 2 hours

**Actions Required**:
- [ ] Test automatic failover (primary failure)
- [ ] Test automatic failback (primary recovery)
- [ ] Test manual failover (force to secondary)
- [ ] Test manual failback (force to primary)
- [ ] Test partial failure (Keepalived down but NPMplus up)
- [ ] Test network partition scenarios
- [ ] Test during high traffic (if possible)
- [ ] Document all test results
- [ ] Identify and fix any issues found

---

## HA Implementation Summary

### Total Estimated Time
- **Phase 1**: 1.5 hours (container creation and NPMplus installation)
- **Phase 2**: 30 minutes (certificate sync setup)
- **Phase 3**: 2 hours (Keepalived configuration and scripts)
- **Phase 4**: 1.5 hours (configuration export/import)
- **Phase 5**: 45 minutes (ongoing sync setup)
- **Phase 6**: 2 hours (testing and validation)
- **Monitoring**: 1 hour (monitoring setup and documentation)

**Total**: ~9 hours of implementation time

### Prerequisites Checklist
- [ ] Secondary Proxmox host available (r630-02 or ml110)
- [ ] Network connectivity between hosts verified
- [ ] Sufficient resources on secondary host (1 GB RAM, 5 GB disk, 2 CPU cores)
- [ ] SSH access configured between hosts (key-based auth recommended)
- [ ] Maintenance window scheduled
- [ ] Backup of primary NPMplus completed
- [ ] Team notified of maintenance window

### Risk Mitigation
- [ ] Rollback plan documented and tested
- [ ] Primary NPMplus backup verified before changes
- [ ] Test environment available (if possible)
- [ ] Monitoring in place before production deployment
- [ ] Emergency contact list available

---

**Last Updated**: 2026-01-20
**Next Review**: After addressing critical items