Files
proxmox/docs/04-configuration/VERIFICATION_GAPS_AND_TODOS.md
defiQUG b3a8fe4496
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
chore: sync all changes to Gitea
- Config, docs, scripts, and backup manifests
- Submodule refs unchanged (m = modified content in submodules)

Made-with: Cursor
2026-03-02 11:37:34 -08:00

933 lines
29 KiB
Markdown

# Verification Scripts and Documentation - Gaps and TODOs
**Last Updated:** 2026-03-02
**Document Version:** 1.0
**Status:** Active Documentation
---
**Date**: 2026-01-20
**Status**: Gap Analysis Complete
**Purpose**: Identify all placeholders, missing components, and incomplete implementations
**Documentation note (2026-03-02):** Runbook placeholders (e.g. `your-token`, `your-password`) are intentional examples. In production, use values from `.env` only; do not commit secrets. [INGRESS_VERIFICATION_RUNBOOK.md](INGRESS_VERIFICATION_RUNBOOK.md) updated with a production note in Prerequisites. Other runbooks (NPMPLUS_BACKUP_RESTORE, SANKOFA_CUTOVER_PLAN) keep example placeholders; operators should source from .env when running commands.
---
## Critical Missing Components
### 1. Missing Script: `scripts/verify/backup-npmplus.sh`
**Status**: ✅ **CREATED** (scripts/verify/backup-npmplus.sh)
**Referenced in**:
- `docs/04-configuration/NPMPLUS_BACKUP_RESTORE.md` (lines 39, 150, 437, 480)
**Required Functionality**:
- Automated backup of NPMplus database (`/data/database.sqlite`)
- Export of proxy hosts via API
- Export of certificates via API
- Certificate file backup from disk
- Compression and timestamping
- Configurable backup destination
**Action Required**: Create the script with all backup procedures documented in `NPMPLUS_BACKUP_RESTORE.md`.
---
## Placeholders and TBD Values
### 2. Nginx Config Paths - TBD Values
**Location**: `scripts/verify/verify-backend-vms.sh`
**Status**: ✅ **RESOLVED** - Paths set in scripts/verify/verify-backend-vms.sh:
- VMID 10130: `/etc/nginx/sites-available/dbis-frontend`
- VMID 2400: `/etc/nginx/sites-available/thirdweb-rpc`
**Required Actions** (if paths differ on actual VMs):
1. **VMID 10130 (dbis-frontend)**:
- Determine actual nginx config path
- Common locations: `/etc/nginx/sites-available/dbis-frontend` or `/etc/nginx/sites-available/dbis-admin`
- Update script with actual path
- Verify config exists and is enabled
2. **VMID 2400 (thirdweb-rpc-1)**:
- Determine actual nginx config path
- Common locations: `/etc/nginx/sites-available/thirdweb-rpc` or `/etc/nginx/sites-available/rpc`
- Update script with actual path
- Verify config exists and is enabled
**Impact**: Script will skip nginx config verification for these VMs until resolved.
---
### 3. Sankofa Cutover Plan - Target Placeholders
**Location**: `docs/04-configuration/SANKOFA_CUTOVER_PLAN.md`
**Placeholders to Replace** (once Sankofa services are deployed):
- `<TARGET_IP>` (appears 10 times)
- `<TARGET_PORT>` (appears 10 times)
- `⚠️ TBD` values in table (lines 60-64)
**Domain-Specific Targets Needed**:
| Domain | Current (Wrong) | Target (TBD) |
|--------|----------------|--------------|
| `sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
| `www.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
| `phoenix.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
| `www.phoenix.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
| `the-order.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
**Action Required**: Update placeholders with actual Sankofa service IPs and ports once deployed.
---
## Documentation Placeholders
### 4. Generic Placeholders in Runbooks
**Location**: Multiple files
**Replacements Needed**:
#### `INGRESS_VERIFICATION_RUNBOOK.md`:
- Line 23: `CLOUDFLARE_API_TOKEN="your-token"` → Should reference `.env` file
- Line 25: `CLOUDFLARE_EMAIL="your-email"` → Should reference `.env` file
- Line 26: `CLOUDFLARE_API_KEY="your-key"` → Should reference `.env` file
- Line 31: `NPM_PASSWORD="your-password"` → Should reference `.env` file
- Lines 91, 101, 213: Similar placeholders in examples
**Note**: These are intentional examples, but should be clearly marked as such and reference `.env` file usage.
#### `NPMPLUS_BACKUP_RESTORE.md`:
- Line 84: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)
- Line 304: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)
#### `SANKOFA_CUTOVER_PLAN.md`:
- Line 125: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)
- Line 178: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)
**Status (2026-03-02):** Addressed. INGRESS_VERIFICATION_RUNBOOK.md now includes a production note in Prerequisites. VERIFICATION_GAPS_AND_TODOS documents that runbooks use example placeholders and production should source from .env.
---
### 5. Source of Truth JSON - Verifier Field
**Location**: `docs/04-configuration/INGRESS_SOURCE_OF_TRUTH.json` (line 5)
**Current**: `"verifier": "operator-name"`
**Expected**: Should be dynamically set by script using `$USER` or actual operator name.
**Status**: ✅ **HANDLED** - The `generate-source-of-truth.sh` script uses `env.USER // "unknown"` which is correct. The example JSON file is just a template.
**Action Required**: None - script implementation is correct.
---
## Implementation Gaps
### 6. Source of Truth Generation - File Path Dependencies
**Location**: `scripts/verify/generate-source-of-truth.sh`
**Potential Issues**:
- Script expects specific output file names from verification scripts
- If verification scripts don't run first, JSON will be empty or have defaults
- No validation that source files exist before parsing
**Expected File Dependencies**:
```bash
$EVIDENCE_DIR/dns-verification-*/all_dns_records.json
$EVIDENCE_DIR/udm-pro-verification-*/verification_results.json
$EVIDENCE_DIR/npmplus-verification-*/proxy_hosts.json
$EVIDENCE_DIR/npmplus-verification-*/certificates.json
$EVIDENCE_DIR/backend-vms-verification-*/all_vms_verification.json
$EVIDENCE_DIR/e2e-verification-*/all_e2e_results.json
```
**Action Required**:
- Add file existence checks before parsing
- Provide clear error messages if dependencies are missing
- Add option to generate partial source-of-truth if some verifications haven't run
---
### 7. Backend VM Verification - Service-Specific Checks
**Location**: `scripts/verify/verify-backend-vms.sh`
**Gaps Identified**:
1. **Besu RPC VMs (2101, 2201)**:
- Script checks for RPC endpoints but doesn't verify Besu-specific health checks
- Should test actual RPC calls (e.g., `eth_chainId`) not just HTTP status
- WebSocket port (8546) verification is minimal
2. **Node.js API VMs (10150, 10151)**:
- Only checks port 3000 is listening
- Doesn't verify API health endpoint exists
- Should test actual API endpoint (e.g., `/health` or `/api/health`)
3. **Blockscout VM (5000)**:
- Checks nginx on port 80 and Blockscout on port 4000
- Should verify Blockscout API is responding (e.g., `/api/health`)
**Action Required**:
- Add service-specific health check functions
- Implement actual RPC/API endpoint testing beyond port checks
- Document expected health check endpoints per service type
---
### 8. End-to-End Routing - WebSocket Testing
**Location**: `scripts/verify/verify-end-to-end-routing.sh`
**Current Implementation**:
- Basic WebSocket connectivity check using TCP connection test
- Manual `wscat` test recommended but not automated
- No actual WebSocket handshake or message exchange verification
**Gap**:
- WebSocket tests are minimal (just TCP connection)
- No verification that WebSocket protocol upgrade works correctly
- No test of actual RPC WebSocket messages
**Action Required**:
- Add automated WebSocket handshake test (if `wscat` is available)
- Or add clear documentation that WebSocket testing requires manual verification
- Consider adding automated WebSocket test script if `wscat` or `websocat` is installed
---
## Configuration Gaps
### 9. Environment Variable Documentation
**Missing**: Comprehensive `.env.example` file listing all required variables
**Required Variables** (from scripts):
```bash
# Cloudflare
CLOUDFLARE_API_TOKEN=
CLOUDFLARE_EMAIL=
CLOUDFLARE_API_KEY=
CLOUDFLARE_ZONE_ID_D_BIS_ORG=
CLOUDFLARE_ZONE_ID_MIM4U_ORG=
CLOUDFLARE_ZONE_ID_SANKOFA_NEXUS=
CLOUDFLARE_ZONE_ID_DEFI_ORACLE_IO=
# Public IP
PUBLIC_IP=76.53.10.36
# NPMplus
NPM_URL=https://192.168.11.166:81
NPM_EMAIL=nsatoshi2007@hotmail.com
NPM_PASSWORD=
NPM_PROXMOX_HOST=192.168.11.11
NPM_VMID=10233
# Proxmox Hosts (for testing)
PROXMOX_HOST_FOR_TEST=192.168.11.11
```
**Action Required**: Create `.env.example` file in project root with all required variables.
---
### 10. Script Dependencies Documentation
**Missing**: List of required system dependencies
**Required Tools** (used across scripts):
- `bash` (4.0+)
- `curl` (for API calls)
- `jq` (for JSON parsing)
- `dig` (for DNS resolution)
- `openssl` (for SSL certificate inspection)
- `ssh` (for remote execution)
- `ss` (for port checking)
- `systemctl` (for service status)
- `sqlite3` (for database backup)
**Optional Tools**:
- `wscat` or `websocat` (for WebSocket testing)
**Action Required**:
- Add dependencies section to `INGRESS_VERIFICATION_RUNBOOK.md`
- Create `scripts/verify/README.md` with installation instructions
- Add dependency check function to `run-full-verification.sh`
---
## Data Completeness Gaps
### 11. Source of Truth JSON - Hardcoded Values
**Location**: `scripts/verify/generate-source-of-truth.sh` (lines 169-177)
**Current**: NPMplus container info is hardcoded:
```json
"container": {
"vmid": 10233,
"host": "r630-01",
"host_ip": "192.168.11.11",
"internal_ips": {
"eth0": "192.168.11.166",
"eth1": "192.168.11.167"
},
"management_ui": "https://192.168.11.166:81",
"status": "running"
}
```
**Gap**: Status should be dynamically determined from verification results.
**Action Required**:
- Make container status dynamic based on `export-npmplus-config.sh` results
- Verify IP addresses are correct (especially `eth1`)
- Document if `eth1` is actually used or is a placeholder
---
### 12. DNS Verification - Zone ID Lookup
**Location**: `scripts/verify/export-cloudflare-dns-records.sh`
**Current**: Attempts to fetch zone IDs if not provided in `.env`, but has fallback to empty string.
**Potential Issue**: If zone ID lookup fails and `.env` doesn't have zone IDs, script will fail silently or skip zones.
**Action Required**:
- Add validation that zone IDs are set (either from `.env` or from API lookup)
- Fail clearly if zone ID cannot be determined
- Provide helpful error message with instructions
---
## Documentation Completeness
### 13. Missing Troubleshooting Sections
**Location**: `docs/04-configuration/INGRESS_VERIFICATION_RUNBOOK.md`
**Current**: Basic troubleshooting section exists (lines 427-468) but could be expanded.
**Missing Topics**:
- What to do if verification scripts fail partially
- How to interpret "unknown" status vs "needs-fix" status
- How to manually verify items that scripts can't automate
- Common Cloudflare API errors and solutions
- Common NPMplus API authentication issues
- SSH connection failures to Proxmox hosts
**Action Required**: Expand troubleshooting section with more scenarios.
---
### 14. Missing Rollback Procedures
**Location**: `docs/04-configuration/SANKOFA_CUTOVER_PLAN.md`
**Current**: Basic rollback steps exist (lines 330-342) but could be more detailed.
**Missing**:
- Automated rollback script reference
- Exact commands to restore previous NPMplus configuration
- How to verify rollback was successful
- Recovery time expectations
**Action Required**:
- Create `scripts/verify/rollback-sankofa-routing.sh` (optional but recommended)
- Or expand manual rollback steps with exact API calls
---
## Priority Summary
### 🔴 Critical (Must Fix Before Production Use)
1.**Create `scripts/verify/backup-npmplus.sh`** - Referenced but missing
2.**Resolve TBD nginx config paths** (VMID 10130, 2400) - Blocks verification
3.**Add file dependency validation** in `generate-source-of-truth.sh`
### 🟡 Important (Should Fix Soon)
4. **Add `.env.example` file** with all required variables
5. **Add dependency checks** to verification scripts
6. **Expand service-specific health checks** for Besu, Node.js, Blockscout
7. **Document WebSocket testing limitations** or automate it
### 🟢 Nice to Have (Can Wait)
8. **Expand troubleshooting section** with more scenarios
9. **Create rollback script** for Sankofa cutover
10. **Add dependency installation guide** to runbook
11. **Make container status dynamic** in source-of-truth generation
---
## Notes
- **Placeholders in examples**: Most "your-password", "your-token" placeholders in documentation are intentional examples and acceptable, but should clearly reference `.env` file usage.
- **Sankofa placeholders**: `<TARGET_IP>` and `<TARGET_PORT>` are expected placeholders until Sankofa services are deployed. These should be updated during cutover.
- **TBD config paths**: These need to be discovered by running verification and inspecting actual VMs.
---
---
## Additional Items Completed
### 15. NPMplus High Availability (HA) Setup Guide ✅ ADDED
**Status**: ✅ **DOCUMENTATION COMPLETE** - Implementation pending
**Location**: `docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md`
**What Was Added**:
- Complete HA architecture guide (Active-Passive with Keepalived)
- Step-by-step implementation instructions (6 phases)
- Helper scripts: `sync-certificates.sh`, `monitor-ha-status.sh`
- Testing and validation procedures
- Troubleshooting guide
- Rollback plan
- Future upgrade path to Active-Active
**Scripts Created**:
- `scripts/npmplus/sync-certificates.sh` - Synchronize certificates from primary to secondary
- `scripts/npmplus/monitor-ha-status.sh` - Monitor HA status and send alerts
**Impact**: Eliminates single point of failure for NPMplus, enables automatic failover.
---
## NPMplus HA Implementation Tasks
### Phase 1: Prepare Secondary NPMplus Instance
#### Task 1.1: Create Secondary NPMplus Container
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 30 minutes
**Actions Required**:
- [ ] Download Alpine 3.22 template on r630-02
- [ ] Create container VMID 10234 with:
- Hostname: `npmplus-secondary`
- IP: `192.168.11.167/24`
- Memory: 1024 MB
- Cores: 2
- Disk: 5 GB
- Features: nesting=1, unprivileged=1
- [ ] Start container and verify it's running
- [ ] Document container creation in deployment log
**Commands**:
```bash
# On r630-02
CTID=10234
HOSTNAME="npmplus-secondary"
IP="192.168.11.167"
BRIDGE="vmbr0"
pveam download local alpine-3.22-default_20241208_amd64.tar.xz
pct create $CTID \
local:vztmpl/alpine-3.22-default_20241208_amd64.tar.xz \
--hostname $HOSTNAME \
--memory 1024 \
--cores 2 \
--rootfs local-lvm:5 \
--net0 name=eth0,bridge=$BRIDGE,ip=$IP/24,gw=192.168.11.1 \
--unprivileged 1 \
--features nesting=1
pct start $CTID
```
---
#### Task 1.2: Install NPMplus on Secondary Instance
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 45 minutes
**Actions Required**:
- [ ] SSH to r630-02 and enter container
- [ ] Install dependencies: `tzdata`, `gawk`, `yq`, `docker`, `docker-compose`, `curl`, `bash`, `rsync`
- [ ] Start and enable Docker service
- [ ] Download NPMplus compose.yaml from GitHub
- [ ] Configure timezone: `America/New_York`
- [ ] Configure ACME email: `nsatoshi2007@hotmail.com`
- [ ] Start NPMplus container (but don't configure yet - will sync first)
- [ ] Wait for NPMplus to be healthy
- [ ] Retrieve admin password and document it
**Commands**:
```bash
ssh root@192.168.11.12
pct exec 10234 -- ash
apk update
apk add --no-cache tzdata gawk yq docker docker-compose curl bash rsync
rc-service docker start
rc-update add docker default
sleep 5
cd /opt
curl -fsSL "https://raw.githubusercontent.com/ZoeyVid/NPMplus/refs/heads/develop/compose.yaml" -o compose.yaml
TZ="America/New_York"
ACME_EMAIL="nsatoshi2007@hotmail.com"
yq -i "
.services.npmplus.environment |=
(map(select(. != \"TZ=*\" and . != \"ACME_EMAIL=*\")) +
[\"TZ=$TZ\", \"ACME_EMAIL=$ACME_EMAIL\"])
" compose.yaml
docker compose up -d
```
---
#### Task 1.3: Configure Secondary Container Network
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 10 minutes
**Actions Required**:
- [ ] Verify static IP assignment: `192.168.11.167`
- [ ] Verify gateway: `192.168.11.1`
- [ ] Test network connectivity to primary host
- [ ] Test network connectivity to backend VMs
- [ ] Document network configuration
**Commands**:
```bash
pct exec 10234 -- ip addr show eth0
pct exec 10234 -- ping -c 3 192.168.11.11
pct exec 10234 -- ping -c 3 192.168.11.166
```
---
### Phase 2: Set Up Certificate Synchronization
#### Task 2.1: Create Certificate Sync Script
**Status**: ✅ **COMPLETED**
**Location**: `scripts/npmplus/sync-certificates.sh`
**Note**: Script already created, needs testing
**Actions Required**:
- [ ] Test certificate sync script manually
- [ ] Verify certificates sync correctly
- [ ] Verify script handles errors gracefully
- [ ] Document certificate paths for both primary and secondary
---
#### Task 2.2: Set Up Automated Certificate Sync
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 15 minutes
**Actions Required**:
- [ ] Add cron job on primary Proxmox host (r630-01)
- [ ] Configure to run every 5 minutes
- [ ] Set up log rotation for `/var/log/npmplus-cert-sync.log`
- [ ] Test cron job execution
- [ ] Monitor logs for successful syncs
- [ ] Verify certificate count matches between primary and secondary
**Commands**:
```bash
# On r630-01
crontab -e
# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh >> /var/log/npmplus-cert-sync.log 2>&1
# Test manually first
bash /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh
```
---
### Phase 3: Set Up Keepalived for Virtual IP
#### Task 3.1: Install Keepalived on Proxmox Hosts
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 10 minutes
**Actions Required**:
- [ ] Install Keepalived on r630-01 (primary)
- [ ] Install Keepalived on r630-02 (secondary)
- [ ] Verify Keepalived installation
- [ ] Check firewall rules for VRRP (multicast 224.0.0.0/8)
**Commands**:
```bash
# On both hosts
apt update
apt install -y keepalived
# Verify installation
keepalived --version
```
---
#### Task 3.2: Configure Keepalived on Primary Host (r630-01)
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 20 minutes
**Actions Required**:
- [ ] Create `/etc/keepalived/keepalived.conf` with MASTER configuration
- [ ] Set virtual_router_id: 51
- [ ] Set priority: 110
- [ ] Configure auth_pass (use secure password)
- [ ] Configure virtual_ipaddress: 192.168.11.166/24
- [ ] Reference health check script path
- [ ] Reference notification script path
- [ ] Verify configuration syntax
- [ ] Document Keepalived configuration
**Files to Create**:
- `/etc/keepalived/keepalived.conf` (see HA guide for full config)
- `/usr/local/bin/check-npmplus-health.sh` (Task 3.4)
- `/usr/local/bin/keepalived-notify.sh` (Task 3.5)
---
#### Task 3.3: Configure Keepalived on Secondary Host (r630-02)
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 20 minutes
**Actions Required**:
- [ ] Create `/etc/keepalived/keepalived.conf` with BACKUP configuration
- [ ] Set virtual_router_id: 51 (must match primary)
- [ ] Set priority: 100 (lower than primary)
- [ ] Configure auth_pass (must match primary)
- [ ] Configure virtual_ipaddress: 192.168.11.166/24
- [ ] Reference health check script path
- [ ] Reference notification script path
- [ ] Verify configuration syntax
- [ ] Document Keepalived configuration
**Files to Create**:
- `/etc/keepalived/keepalived.conf` (see HA guide for full config)
- `/usr/local/bin/check-npmplus-health.sh` (Task 3.4)
- `/usr/local/bin/keepalived-notify.sh` (Task 3.5)
---
#### Task 3.4: Create Health Check Script
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 30 minutes
**Actions Required**:
- [ ] Create `/usr/local/bin/check-npmplus-health.sh` on both hosts
- [ ] Script should:
- Detect hostname to determine which VMID to check
- Check if container is running
- Check if NPMplus Docker container is healthy
- Check if NPMplus web interface responds (port 81)
- Return exit code 0 if healthy, 1 if unhealthy
- [ ] Make script executable: `chmod +x`
- [ ] Test script manually on both hosts
- [ ] Verify script detects failures correctly
**File**: `/usr/local/bin/check-npmplus-health.sh`
**Details**: See HA guide for full script content
---
#### Task 3.5: Create Keepalived Notification Script
**Status**: ⏳ **PENDING**
**Priority**: 🟡 **Important**
**Estimated Time**: 15 minutes
**Actions Required**:
- [ ] Create `/usr/local/bin/keepalived-notify.sh` on both hosts
- [ ] Script should handle states: master, backup, fault
- [ ] Log state changes to `/var/log/keepalived-notify.log`
- [ ] Optional: Send alerts (email, webhook) on fault state
- [ ] Make script executable: `chmod +x`
- [ ] Test script with each state manually
**File**: `/usr/local/bin/keepalived-notify.sh`
**Details**: See HA guide for full script content
---
#### Task 3.6: Start and Enable Keepalived
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 15 minutes
**Actions Required**:
- [ ] Enable Keepalived service on both hosts
- [ ] Start Keepalived on both hosts
- [ ] Verify Keepalived is running
- [ ] Verify primary host owns VIP (192.168.11.166)
- [ ] Verify secondary host is in BACKUP state
- [ ] Monitor Keepalived logs for any errors
- [ ] Document VIP ownership verification
**Commands**:
```bash
# On both hosts
systemctl enable keepalived
systemctl start keepalived
# Verify status
systemctl status keepalived
# Check VIP ownership (should be on primary)
ip addr show vmbr0 | grep 192.168.11.166
# Check logs
journalctl -u keepalived -f
```
---
### Phase 4: Sync Configuration to Secondary
#### Task 4.1: Export Primary Configuration
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 30 minutes
**Actions Required**:
- [ ] Create export script: `scripts/npmplus/export-primary-config.sh`
- [ ] Export NPMplus SQLite database to SQL dump
- [ ] Export proxy hosts via API (JSON)
- [ ] Export certificates via API (JSON)
- [ ] Create timestamped backup directory
- [ ] Verify all exports completed successfully
- [ ] Document backup location and contents
**Script to Create**: `scripts/npmplus/export-primary-config.sh`
**Details**: See HA guide for full script content
---
#### Task 4.2: Import Configuration to Secondary
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 45 minutes
**Actions Required**:
- [ ] Create import script: `scripts/npmplus/import-secondary-config.sh`
- [ ] Stop NPMplus container on secondary (if running)
- [ ] Copy database SQL dump to secondary
- [ ] Import database dump into secondary NPMplus
- [ ] Restart NPMplus container on secondary
- [ ] Wait for NPMplus to be healthy
- [ ] Verify proxy hosts are configured
- [ ] Verify certificates are accessible
- [ ] Document any manual configuration steps needed
**Script to Create**: `scripts/npmplus/import-secondary-config.sh`
**Details**: See HA guide for full script content
**Note**: Some configuration may need manual replication via API or UI.
---
### Phase 5: Set Up Ongoing Configuration Sync
#### Task 5.1: Create Configuration Sync Script
**Status**: ⏳ **PENDING**
**Priority**: 🟡 **Important**
**Estimated Time**: 45 minutes
**Actions Required**:
- [ ] Create sync script: `scripts/npmplus/sync-config.sh`
- [ ] Authenticate to NPMplus API (primary)
- [ ] Export proxy hosts configuration
- [ ] Implement API-based sync or document manual sync process
- [ ] Add script to automation (if automated sync is possible)
- [ ] Document manual sync procedures for configuration changes
**Script to Create**: `scripts/npmplus/sync-config.sh`
**Note**: Full automated sync requires shared database or complex API sync. For now, manual sync may be required.
---
### Phase 6: Testing and Validation
#### Task 6.1: Test Virtual IP Failover
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 30 minutes
**Actions Required**:
- [ ] Verify primary owns VIP before test
- [ ] Simulate primary failure (stop Keepalived or NPMplus container)
- [ ] Verify VIP moves to secondary within 5-10 seconds
- [ ] Test connectivity to VIP from external source
- [ ] Restore primary and verify failback
- [ ] Document failover time (should be < 10 seconds)
- [ ] Test multiple failover scenarios
- [ ] Document test results
**Test Scenarios**:
1. Stop Keepalived on primary
2. Stop NPMplus container on primary
3. Stop entire Proxmox host (if possible in test environment)
4. Network partition (if possible in test environment)
---
#### Task 6.2: Test Certificate Access
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 30 minutes
**Actions Required**:
- [ ] Verify certificates exist on secondary (after sync)
- [ ] Test SSL endpoint from external: `curl -vI https://explorer.d-bis.org`
- [ ] Verify certificate is valid and trusted
- [ ] Test multiple domains with SSL
- [ ] Verify certificate expiration dates match
- [ ] Test certificate auto-renewal on secondary (when primary renews)
- [ ] Document certificate test results
**Commands**:
```bash
# Verify certificates on secondary
ssh root@192.168.11.12 "pct exec 10234 -- ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"
# Test SSL endpoint
curl -vI https://explorer.d-bis.org
curl -vI https://mim4u.org
curl -vI https://rpc-http-pub.d-bis.org
```
---
#### Task 6.3: Test Proxy Host Functionality
**Status**: ⏳ **PENDING**
**Priority**: 🔴 **Critical**
**Estimated Time**: 45 minutes
**Actions Required**:
- [ ] Test each domain from external after failover
- [ ] Verify HTTP to HTTPS redirects work
- [ ] Verify WebSocket connections work (for RPC endpoints)
- [ ] Verify API endpoints respond correctly
- [ ] Test all 19+ domains
- [ ] Document any domains that don't work correctly
- [ ] Test with secondary as active instance
- [ ] Test failback to primary
**Test Domains**:
- All d-bis.org domains (9 domains)
- All mim4u.org domains (4 domains)
- All sankofa.nexus domains (5 domains)
- defi-oracle.io domain (1 domain)
---
### Monitoring and Maintenance
#### Task 7.1: Set Up HA Status Monitoring
**Status**: ✅ **COMPLETED** (script created, needs deployment)
**Priority**: 🟡 **Important**
**Location**: `scripts/npmplus/monitor-ha-status.sh`
**Actions Required**:
- [ ] Add cron job for HA status monitoring (every 5 minutes)
- [ ] Configure log rotation for `/var/log/npmplus-ha-monitor.log`
- [ ] Test monitoring script manually
- [ ] Optional: Integrate with alerting system (email, webhook)
- [ ] Document alert thresholds and escalation procedures
- [ ] Test alert generation
**Commands**:
```bash
# On primary Proxmox host
crontab -e
# Add:
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/monitor-ha-status.sh >> /var/log/npmplus-ha-monitor.log 2>&1
```
---
#### Task 7.2: Document Manual Failover Procedures
**Status**: ⏳ **PENDING**
**Priority**: 🟡 **Important**
**Estimated Time**: 30 minutes
**Actions Required**:
- [ ] Document step-by-step manual failover procedure
- [ ] Document how to force failover to secondary
- [ ] Document how to force failback to primary
- [ ] Document troubleshooting steps for common issues
- [ ] Create runbook for operations team
- [ ] Test manual failover procedures
- [ ] Review and approve documentation
**Location**: Add to `docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md` troubleshooting section
---
#### Task 7.3: Test All Failover Scenarios
**Status**: ⏳ **PENDING**
**Priority**: 🟡 **Important**
**Estimated Time**: 2 hours
**Actions Required**:
- [ ] Test automatic failover (primary failure)
- [ ] Test automatic failback (primary recovery)
- [ ] Test manual failover (force to secondary)
- [ ] Test manual failback (force to primary)
- [ ] Test partial failure (Keepalived down but NPMplus up)
- [ ] Test network partition scenarios
- [ ] Test during high traffic (if possible)
- [ ] Document all test results
- [ ] Identify and fix any issues found
---
## HA Implementation Summary
### Total Estimated Time
- **Phase 1**: 1.5 hours (container creation and NPMplus installation)
- **Phase 2**: 30 minutes (certificate sync setup)
- **Phase 3**: 2 hours (Keepalived configuration and scripts)
- **Phase 4**: 1.5 hours (configuration export/import)
- **Phase 5**: 45 minutes (ongoing sync setup)
- **Phase 6**: 2 hours (testing and validation)
- **Monitoring**: 1 hour (monitoring setup and documentation)
**Total**: ~9 hours of implementation time
### Prerequisites Checklist
- [ ] Secondary Proxmox host available (r630-02 or ml110)
- [ ] Network connectivity between hosts verified
- [ ] Sufficient resources on secondary host (1 GB RAM, 5 GB disk, 2 CPU cores)
- [ ] SSH access configured between hosts (key-based auth recommended)
- [ ] Maintenance window scheduled
- [ ] Backup of primary NPMplus completed
- [ ] Team notified of maintenance window
### Risk Mitigation
- [ ] Rollback plan documented and tested
- [ ] Primary NPMplus backup verified before changes
- [ ] Test environment available (if possible)
- [ ] Monitoring in place before production deployment
- [ ] Emergency contact list available
---
**Last Updated**: 2026-01-20
**Next Review**: After addressing critical items