Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
- Config, docs, scripts, and backup manifests - Submodule refs unchanged (m = modified content in submodules) Made-with: Cursor
933 lines
29 KiB
Markdown
933 lines
29 KiB
Markdown
# Verification Scripts and Documentation - Gaps and TODOs
|
|
|
|
**Last Updated:** 2026-03-02
|
|
**Document Version:** 1.0
|
|
**Status:** Active Documentation
|
|
|
|
---
|
|
|
|
**Date**: 2026-01-20
|
|
**Status**: Gap Analysis Complete
|
|
**Purpose**: Identify all placeholders, missing components, and incomplete implementations
|
|
|
|
**Documentation note (2026-03-02):** Runbook placeholders (e.g. `your-token`, `your-password`) are intentional examples. In production, use values from `.env` only; do not commit secrets. [INGRESS_VERIFICATION_RUNBOOK.md](INGRESS_VERIFICATION_RUNBOOK.md) updated with a production note in Prerequisites. Other runbooks (NPMPLUS_BACKUP_RESTORE, SANKOFA_CUTOVER_PLAN) keep example placeholders; operators should source from .env when running commands.
|
|
|
|
---
|
|
|
|
## Critical Missing Components
|
|
|
|
### 1. Missing Script: `scripts/verify/backup-npmplus.sh`
|
|
|
|
**Status**: ✅ **CREATED** (scripts/verify/backup-npmplus.sh)
|
|
**Referenced in**:
|
|
- `docs/04-configuration/NPMPLUS_BACKUP_RESTORE.md` (lines 39, 150, 437, 480)
|
|
|
|
**Required Functionality**:
|
|
- Automated backup of NPMplus database (`/data/database.sqlite`)
|
|
- Export of proxy hosts via API
|
|
- Export of certificates via API
|
|
- Certificate file backup from disk
|
|
- Compression and timestamping
|
|
- Configurable backup destination
|
|
|
|
**Action Required**: Create the script with all backup procedures documented in `NPMPLUS_BACKUP_RESTORE.md`.
|
|
|
|
---
|
|
|
|
## Placeholders and TBD Values
|
|
|
|
### 2. Nginx Config Paths - TBD Values
|
|
|
|
**Location**: `scripts/verify/verify-backend-vms.sh`
|
|
|
|
**Status**: ✅ **RESOLVED** - Paths set in scripts/verify/verify-backend-vms.sh:
|
|
- VMID 10130: `/etc/nginx/sites-available/dbis-frontend`
|
|
- VMID 2400: `/etc/nginx/sites-available/thirdweb-rpc`
|
|
|
|
**Required Actions** (if paths differ on actual VMs):
|
|
1. **VMID 10130 (dbis-frontend)**:
|
|
- Determine actual nginx config path
|
|
- Common locations: `/etc/nginx/sites-available/dbis-frontend` or `/etc/nginx/sites-available/dbis-admin`
|
|
- Update script with actual path
|
|
- Verify config exists and is enabled
|
|
|
|
2. **VMID 2400 (thirdweb-rpc-1)**:
|
|
- Determine actual nginx config path
|
|
- Common locations: `/etc/nginx/sites-available/thirdweb-rpc` or `/etc/nginx/sites-available/rpc`
|
|
- Update script with actual path
|
|
- Verify config exists and is enabled
|
|
|
|
**Impact**: Script will skip nginx config verification for these VMs until resolved.
|
|
|
|
---
|
|
|
|
### 3. Sankofa Cutover Plan - Target Placeholders
|
|
|
|
**Location**: `docs/04-configuration/SANKOFA_CUTOVER_PLAN.md`
|
|
|
|
**Placeholders to Replace** (once Sankofa services are deployed):
|
|
- `<TARGET_IP>` (appears 10 times)
|
|
- `<TARGET_PORT>` (appears 10 times)
|
|
- `⚠️ TBD` values in table (lines 60-64)
|
|
|
|
**Domain-Specific Targets Needed**:
|
|
| Domain | Current (Wrong) | Target (TBD) |
|
|
|--------|----------------|--------------|
|
|
| `sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
|
|
| `www.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
|
|
| `phoenix.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
|
|
| `www.phoenix.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
|
|
| `the-order.sankofa.nexus` | 192.168.11.140:80 | `<TARGET_IP>:<TARGET_PORT>` |
|
|
|
|
**Action Required**: Update placeholders with actual Sankofa service IPs and ports once deployed.
|
|
|
|
---
|
|
|
|
## Documentation Placeholders
|
|
|
|
### 4. Generic Placeholders in Runbooks
|
|
|
|
**Location**: Multiple files
|
|
|
|
**Replacements Needed**:
|
|
|
|
#### `INGRESS_VERIFICATION_RUNBOOK.md`:
|
|
- Line 23: `CLOUDFLARE_API_TOKEN="your-token"` → Should reference `.env` file
|
|
- Line 25: `CLOUDFLARE_EMAIL="your-email"` → Should reference `.env` file
|
|
- Line 26: `CLOUDFLARE_API_KEY="your-key"` → Should reference `.env` file
|
|
- Line 31: `NPM_PASSWORD="your-password"` → Should reference `.env` file
|
|
- Lines 91, 101, 213: Similar placeholders in examples
|
|
|
|
**Note**: These are intentional examples, but should be clearly marked as such and reference `.env` file usage.
|
|
|
|
#### `NPMPLUS_BACKUP_RESTORE.md`:
|
|
- Line 84: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)
|
|
- Line 304: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)
|
|
|
|
#### `SANKOFA_CUTOVER_PLAN.md`:
|
|
- Line 125: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)
|
|
- Line 178: `NPM_PASSWORD="your-password"` → Example placeholder (acceptable)
|
|
|
|
**Status (2026-03-02):** Addressed. INGRESS_VERIFICATION_RUNBOOK.md now includes a production note in Prerequisites. VERIFICATION_GAPS_AND_TODOS documents that runbooks use example placeholders and production should source from .env.
|
|
|
|
---
|
|
|
|
### 5. Source of Truth JSON - Verifier Field
|
|
|
|
**Location**: `docs/04-configuration/INGRESS_SOURCE_OF_TRUTH.json` (line 5)
|
|
|
|
**Current**: `"verifier": "operator-name"`
|
|
|
|
**Expected**: Should be dynamically set by script using `$USER` or actual operator name.
|
|
|
|
**Status**: ✅ **HANDLED** - The `generate-source-of-truth.sh` script uses `env.USER // "unknown"` which is correct. The example JSON file is just a template.
|
|
|
|
**Action Required**: None - script implementation is correct.
|
|
|
|
---
|
|
|
|
## Implementation Gaps
|
|
|
|
### 6. Source of Truth Generation - File Path Dependencies
|
|
|
|
**Location**: `scripts/verify/generate-source-of-truth.sh`
|
|
|
|
**Potential Issues**:
|
|
- Script expects specific output file names from verification scripts
|
|
- If verification scripts don't run first, JSON will be empty or have defaults
|
|
- No validation that source files exist before parsing
|
|
|
|
**Expected File Dependencies**:
|
|
```bash
|
|
$EVIDENCE_DIR/dns-verification-*/all_dns_records.json
|
|
$EVIDENCE_DIR/udm-pro-verification-*/verification_results.json
|
|
$EVIDENCE_DIR/npmplus-verification-*/proxy_hosts.json
|
|
$EVIDENCE_DIR/npmplus-verification-*/certificates.json
|
|
$EVIDENCE_DIR/backend-vms-verification-*/all_vms_verification.json
|
|
$EVIDENCE_DIR/e2e-verification-*/all_e2e_results.json
|
|
```
|
|
|
|
**Action Required**:
|
|
- Add file existence checks before parsing
|
|
- Provide clear error messages if dependencies are missing
|
|
- Add option to generate partial source-of-truth if some verifications haven't run
|
|
|
|
---
|
|
|
|
### 7. Backend VM Verification - Service-Specific Checks
|
|
|
|
**Location**: `scripts/verify/verify-backend-vms.sh`
|
|
|
|
**Gaps Identified**:
|
|
1. **Besu RPC VMs (2101, 2201)**:
|
|
- Script checks for RPC endpoints but doesn't verify Besu-specific health checks
|
|
- Should test actual RPC calls (e.g., `eth_chainId`) not just HTTP status
|
|
- WebSocket port (8546) verification is minimal
|
|
|
|
2. **Node.js API VMs (10150, 10151)**:
|
|
- Only checks port 3000 is listening
|
|
- Doesn't verify API health endpoint exists
|
|
- Should test actual API endpoint (e.g., `/health` or `/api/health`)
|
|
|
|
3. **Blockscout VM (5000)**:
|
|
- Checks nginx on port 80 and Blockscout on port 4000
|
|
- Should verify Blockscout API is responding (e.g., `/api/health`)
|
|
|
|
**Action Required**:
|
|
- Add service-specific health check functions
|
|
- Implement actual RPC/API endpoint testing beyond port checks
|
|
- Document expected health check endpoints per service type
|
|
|
|
---
|
|
|
|
### 8. End-to-End Routing - WebSocket Testing
|
|
|
|
**Location**: `scripts/verify/verify-end-to-end-routing.sh`
|
|
|
|
**Current Implementation**:
|
|
- Basic WebSocket connectivity check using TCP connection test
|
|
- Manual `wscat` test recommended but not automated
|
|
- No actual WebSocket handshake or message exchange verification
|
|
|
|
**Gap**:
|
|
- WebSocket tests are minimal (just TCP connection)
|
|
- No verification that WebSocket protocol upgrade works correctly
|
|
- No test of actual RPC WebSocket messages
|
|
|
|
**Action Required**:
|
|
- Add automated WebSocket handshake test (if `wscat` is available)
|
|
- Or add clear documentation that WebSocket testing requires manual verification
|
|
- Consider adding automated WebSocket test script if `wscat` or `websocat` is installed
|
|
|
|
---
|
|
|
|
## Configuration Gaps
|
|
|
|
### 9. Environment Variable Documentation
|
|
|
|
**Missing**: Comprehensive `.env.example` file listing all required variables
|
|
|
|
**Required Variables** (from scripts):
|
|
```bash
|
|
# Cloudflare
|
|
CLOUDFLARE_API_TOKEN=
|
|
CLOUDFLARE_EMAIL=
|
|
CLOUDFLARE_API_KEY=
|
|
CLOUDFLARE_ZONE_ID_D_BIS_ORG=
|
|
CLOUDFLARE_ZONE_ID_MIM4U_ORG=
|
|
CLOUDFLARE_ZONE_ID_SANKOFA_NEXUS=
|
|
CLOUDFLARE_ZONE_ID_DEFI_ORACLE_IO=
|
|
|
|
# Public IP
|
|
PUBLIC_IP=76.53.10.36
|
|
|
|
# NPMplus
|
|
NPM_URL=https://192.168.11.166:81
|
|
NPM_EMAIL=nsatoshi2007@hotmail.com
|
|
NPM_PASSWORD=
|
|
NPM_PROXMOX_HOST=192.168.11.11
|
|
NPM_VMID=10233
|
|
|
|
# Proxmox Hosts (for testing)
|
|
PROXMOX_HOST_FOR_TEST=192.168.11.11
|
|
```
|
|
|
|
**Action Required**: Create `.env.example` file in project root with all required variables.
|
|
|
|
---
|
|
|
|
### 10. Script Dependencies Documentation
|
|
|
|
**Missing**: List of required system dependencies
|
|
|
|
**Required Tools** (used across scripts):
|
|
- `bash` (4.0+)
|
|
- `curl` (for API calls)
|
|
- `jq` (for JSON parsing)
|
|
- `dig` (for DNS resolution)
|
|
- `openssl` (for SSL certificate inspection)
|
|
- `ssh` (for remote execution)
|
|
- `ss` (for port checking)
|
|
- `systemctl` (for service status)
|
|
- `sqlite3` (for database backup)
|
|
|
|
**Optional Tools**:
|
|
- `wscat` or `websocat` (for WebSocket testing)
|
|
|
|
**Action Required**:
|
|
- Add dependencies section to `INGRESS_VERIFICATION_RUNBOOK.md`
|
|
- Create `scripts/verify/README.md` with installation instructions
|
|
- Add dependency check function to `run-full-verification.sh`
|
|
|
|
---
|
|
|
|
## Data Completeness Gaps
|
|
|
|
### 11. Source of Truth JSON - Hardcoded Values
|
|
|
|
**Location**: `scripts/verify/generate-source-of-truth.sh` (lines 169-177)
|
|
|
|
**Current**: NPMplus container info is hardcoded:
|
|
```json
|
|
"container": {
|
|
"vmid": 10233,
|
|
"host": "r630-01",
|
|
"host_ip": "192.168.11.11",
|
|
"internal_ips": {
|
|
"eth0": "192.168.11.166",
|
|
"eth1": "192.168.11.167"
|
|
},
|
|
"management_ui": "https://192.168.11.166:81",
|
|
"status": "running"
|
|
}
|
|
```
|
|
|
|
**Gap**: Status should be dynamically determined from verification results.
|
|
|
|
**Action Required**:
|
|
- Make container status dynamic based on `export-npmplus-config.sh` results
|
|
- Verify IP addresses are correct (especially `eth1`)
|
|
- Document if `eth1` is actually used or is a placeholder
|
|
|
|
---
|
|
|
|
### 12. DNS Verification - Zone ID Lookup
|
|
|
|
**Location**: `scripts/verify/export-cloudflare-dns-records.sh`
|
|
|
|
**Current**: Attempts to fetch zone IDs if not provided in `.env`, but has fallback to empty string.
|
|
|
|
**Potential Issue**: If zone ID lookup fails and `.env` doesn't have zone IDs, script will fail silently or skip zones.
|
|
|
|
**Action Required**:
|
|
- Add validation that zone IDs are set (either from `.env` or from API lookup)
|
|
- Fail clearly if zone ID cannot be determined
|
|
- Provide helpful error message with instructions
|
|
|
|
---
|
|
|
|
## Documentation Completeness
|
|
|
|
### 13. Missing Troubleshooting Sections
|
|
|
|
**Location**: `docs/04-configuration/INGRESS_VERIFICATION_RUNBOOK.md`
|
|
|
|
**Current**: Basic troubleshooting section exists (lines 427-468) but could be expanded.
|
|
|
|
**Missing Topics**:
|
|
- What to do if verification scripts fail partially
|
|
- How to interpret "unknown" status vs "needs-fix" status
|
|
- How to manually verify items that scripts can't automate
|
|
- Common Cloudflare API errors and solutions
|
|
- Common NPMplus API authentication issues
|
|
- SSH connection failures to Proxmox hosts
|
|
|
|
**Action Required**: Expand troubleshooting section with more scenarios.
|
|
|
|
---
|
|
|
|
### 14. Missing Rollback Procedures
|
|
|
|
**Location**: `docs/04-configuration/SANKOFA_CUTOVER_PLAN.md`
|
|
|
|
**Current**: Basic rollback steps exist (lines 330-342) but could be more detailed.
|
|
|
|
**Missing**:
|
|
- Automated rollback script reference
|
|
- Exact commands to restore previous NPMplus configuration
|
|
- How to verify rollback was successful
|
|
- Recovery time expectations
|
|
|
|
**Action Required**:
|
|
- Create `scripts/verify/rollback-sankofa-routing.sh` (optional but recommended)
|
|
- Or expand manual rollback steps with exact API calls
|
|
|
|
---
|
|
|
|
## Priority Summary
|
|
|
|
### 🔴 Critical (Must Fix Before Production Use)
|
|
1. ✅ **Create `scripts/verify/backup-npmplus.sh`** - Referenced but missing
|
|
2. ✅ **Resolve TBD nginx config paths** (VMID 10130, 2400) - Blocks verification
|
|
3. ✅ **Add file dependency validation** in `generate-source-of-truth.sh`
|
|
|
|
### 🟡 Important (Should Fix Soon)
|
|
4. **Add `.env.example` file** with all required variables
|
|
5. **Add dependency checks** to verification scripts
|
|
6. **Expand service-specific health checks** for Besu, Node.js, Blockscout
|
|
7. **Document WebSocket testing limitations** or automate it
|
|
|
|
### 🟢 Nice to Have (Can Wait)
|
|
8. **Expand troubleshooting section** with more scenarios
|
|
9. **Create rollback script** for Sankofa cutover
|
|
10. **Add dependency installation guide** to runbook
|
|
11. **Make container status dynamic** in source-of-truth generation
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- **Placeholders in examples**: Most "your-password", "your-token" placeholders in documentation are intentional examples and acceptable, but should clearly reference `.env` file usage.
|
|
- **Sankofa placeholders**: `<TARGET_IP>` and `<TARGET_PORT>` are expected placeholders until Sankofa services are deployed. These should be updated during cutover.
|
|
- **TBD config paths**: These need to be discovered by running verification and inspecting actual VMs.
|
|
|
|
---
|
|
|
|
---
|
|
|
|
## Additional Items Completed
|
|
|
|
### 15. NPMplus High Availability (HA) Setup Guide ✅ ADDED
|
|
|
|
**Status**: ✅ **DOCUMENTATION COMPLETE** - Implementation pending
|
|
**Location**: `docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md`
|
|
|
|
**What Was Added**:
|
|
- Complete HA architecture guide (Active-Passive with Keepalived)
|
|
- Step-by-step implementation instructions (6 phases)
|
|
- Helper scripts: `sync-certificates.sh`, `monitor-ha-status.sh`
|
|
- Testing and validation procedures
|
|
- Troubleshooting guide
|
|
- Rollback plan
|
|
- Future upgrade path to Active-Active
|
|
|
|
**Scripts Created**:
|
|
- `scripts/npmplus/sync-certificates.sh` - Synchronize certificates from primary to secondary
|
|
- `scripts/npmplus/monitor-ha-status.sh` - Monitor HA status and send alerts
|
|
|
|
**Impact**: Eliminates single point of failure for NPMplus, enables automatic failover.
|
|
|
|
---
|
|
|
|
## NPMplus HA Implementation Tasks
|
|
|
|
### Phase 1: Prepare Secondary NPMplus Instance
|
|
|
|
#### Task 1.1: Create Secondary NPMplus Container
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 30 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Download Alpine 3.22 template on r630-02
|
|
- [ ] Create container VMID 10234 with:
|
|
- Hostname: `npmplus-secondary`
|
|
- IP: `192.168.11.167/24`
|
|
- Memory: 1024 MB
|
|
- Cores: 2
|
|
- Disk: 5 GB
|
|
- Features: nesting=1, unprivileged=1
|
|
- [ ] Start container and verify it's running
|
|
- [ ] Document container creation in deployment log
|
|
|
|
**Commands**:
|
|
```bash
|
|
# On r630-02
|
|
CTID=10234
|
|
HOSTNAME="npmplus-secondary"
|
|
IP="192.168.11.167"
|
|
BRIDGE="vmbr0"
|
|
|
|
pveam download local alpine-3.22-default_20241208_amd64.tar.xz
|
|
|
|
pct create $CTID \
|
|
local:vztmpl/alpine-3.22-default_20241208_amd64.tar.xz \
|
|
--hostname $HOSTNAME \
|
|
--memory 1024 \
|
|
--cores 2 \
|
|
--rootfs local-lvm:5 \
|
|
--net0 name=eth0,bridge=$BRIDGE,ip=$IP/24,gw=192.168.11.1 \
|
|
--unprivileged 1 \
|
|
--features nesting=1
|
|
|
|
pct start $CTID
|
|
```
|
|
|
|
---
|
|
|
|
#### Task 1.2: Install NPMplus on Secondary Instance
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 45 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] SSH to r630-02 and enter container
|
|
- [ ] Install dependencies: `tzdata`, `gawk`, `yq`, `docker`, `docker-compose`, `curl`, `bash`, `rsync`
|
|
- [ ] Start and enable Docker service
|
|
- [ ] Download NPMplus compose.yaml from GitHub
|
|
- [ ] Configure timezone: `America/New_York`
|
|
- [ ] Configure ACME email: `nsatoshi2007@hotmail.com`
|
|
- [ ] Start NPMplus container (but don't configure yet - will sync first)
|
|
- [ ] Wait for NPMplus to be healthy
|
|
- [ ] Retrieve admin password and document it
|
|
|
|
**Commands**:
|
|
```bash
|
|
ssh root@192.168.11.12
|
|
pct exec 10234 -- ash
|
|
|
|
apk update
|
|
apk add --no-cache tzdata gawk yq docker docker-compose curl bash rsync
|
|
|
|
rc-service docker start
|
|
rc-update add docker default
|
|
sleep 5
|
|
|
|
cd /opt
|
|
curl -fsSL "https://raw.githubusercontent.com/ZoeyVid/NPMplus/refs/heads/develop/compose.yaml" -o compose.yaml
|
|
|
|
TZ="America/New_York"
|
|
ACME_EMAIL="nsatoshi2007@hotmail.com"
|
|
|
|
yq -i "
|
|
.services.npmplus.environment |=
|
|
(map(select(. != \"TZ=*\" and . != \"ACME_EMAIL=*\")) +
|
|
[\"TZ=$TZ\", \"ACME_EMAIL=$ACME_EMAIL\"])
|
|
" compose.yaml
|
|
|
|
docker compose up -d
|
|
```
|
|
|
|
---
|
|
|
|
#### Task 1.3: Configure Secondary Container Network
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 10 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Verify static IP assignment: `192.168.11.167`
|
|
- [ ] Verify gateway: `192.168.11.1`
|
|
- [ ] Test network connectivity to primary host
|
|
- [ ] Test network connectivity to backend VMs
|
|
- [ ] Document network configuration
|
|
|
|
**Commands**:
|
|
```bash
|
|
pct exec 10234 -- ip addr show eth0
|
|
pct exec 10234 -- ping -c 3 192.168.11.11
|
|
pct exec 10234 -- ping -c 3 192.168.11.166
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 2: Set Up Certificate Synchronization
|
|
|
|
#### Task 2.1: Create Certificate Sync Script
|
|
**Status**: ✅ **COMPLETED**
|
|
**Location**: `scripts/npmplus/sync-certificates.sh`
|
|
**Note**: Script already created, needs testing
|
|
|
|
**Actions Required**:
|
|
- [ ] Test certificate sync script manually
|
|
- [ ] Verify certificates sync correctly
|
|
- [ ] Verify script handles errors gracefully
|
|
- [ ] Document certificate paths for both primary and secondary
|
|
|
|
---
|
|
|
|
#### Task 2.2: Set Up Automated Certificate Sync
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 15 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Add cron job on primary Proxmox host (r630-01)
|
|
- [ ] Configure to run every 5 minutes
|
|
- [ ] Set up log rotation for `/var/log/npmplus-cert-sync.log`
|
|
- [ ] Test cron job execution
|
|
- [ ] Monitor logs for successful syncs
|
|
- [ ] Verify certificate count matches between primary and secondary
|
|
|
|
**Commands**:
|
|
```bash
|
|
# On r630-01
|
|
crontab -e
|
|
|
|
# Add:
|
|
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh >> /var/log/npmplus-cert-sync.log 2>&1
|
|
|
|
# Test manually first
|
|
bash /home/intlc/projects/proxmox/scripts/npmplus/sync-certificates.sh
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 3: Set Up Keepalived for Virtual IP
|
|
|
|
#### Task 3.1: Install Keepalived on Proxmox Hosts
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 10 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Install Keepalived on r630-01 (primary)
|
|
- [ ] Install Keepalived on r630-02 (secondary)
|
|
- [ ] Verify Keepalived installation
|
|
- [ ] Check firewall rules for VRRP (multicast 224.0.0.0/8)
|
|
|
|
**Commands**:
|
|
```bash
|
|
# On both hosts
|
|
apt update
|
|
apt install -y keepalived
|
|
|
|
# Verify installation
|
|
keepalived --version
|
|
```
|
|
|
|
---
|
|
|
|
#### Task 3.2: Configure Keepalived on Primary Host (r630-01)
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 20 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Create `/etc/keepalived/keepalived.conf` with MASTER configuration
|
|
- [ ] Set virtual_router_id: 51
|
|
- [ ] Set priority: 110
|
|
- [ ] Configure auth_pass (use secure password)
|
|
- [ ] Configure virtual_ipaddress: 192.168.11.166/24
|
|
- [ ] Reference health check script path
|
|
- [ ] Reference notification script path
|
|
- [ ] Verify configuration syntax
|
|
- [ ] Document Keepalived configuration
|
|
|
|
**Files to Create**:
|
|
- `/etc/keepalived/keepalived.conf` (see HA guide for full config)
|
|
- `/usr/local/bin/check-npmplus-health.sh` (Task 3.4)
|
|
- `/usr/local/bin/keepalived-notify.sh` (Task 3.5)
|
|
|
|
---
|
|
|
|
#### Task 3.3: Configure Keepalived on Secondary Host (r630-02)
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 20 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Create `/etc/keepalived/keepalived.conf` with BACKUP configuration
|
|
- [ ] Set virtual_router_id: 51 (must match primary)
|
|
- [ ] Set priority: 100 (lower than primary)
|
|
- [ ] Configure auth_pass (must match primary)
|
|
- [ ] Configure virtual_ipaddress: 192.168.11.166/24
|
|
- [ ] Reference health check script path
|
|
- [ ] Reference notification script path
|
|
- [ ] Verify configuration syntax
|
|
- [ ] Document Keepalived configuration
|
|
|
|
**Files to Create**:
|
|
- `/etc/keepalived/keepalived.conf` (see HA guide for full config)
|
|
- `/usr/local/bin/check-npmplus-health.sh` (Task 3.4)
|
|
- `/usr/local/bin/keepalived-notify.sh` (Task 3.5)
|
|
|
|
---
|
|
|
|
#### Task 3.4: Create Health Check Script
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 30 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Create `/usr/local/bin/check-npmplus-health.sh` on both hosts
|
|
- [ ] Script should:
|
|
- Detect hostname to determine which VMID to check
|
|
- Check if container is running
|
|
- Check if NPMplus Docker container is healthy
|
|
- Check if NPMplus web interface responds (port 81)
|
|
- Return exit code 0 if healthy, 1 if unhealthy
|
|
- [ ] Make script executable: `chmod +x`
|
|
- [ ] Test script manually on both hosts
|
|
- [ ] Verify script detects failures correctly
|
|
|
|
**File**: `/usr/local/bin/check-npmplus-health.sh`
|
|
**Details**: See HA guide for full script content
|
|
|
|
---
|
|
|
|
#### Task 3.5: Create Keepalived Notification Script
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🟡 **Important**
|
|
**Estimated Time**: 15 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Create `/usr/local/bin/keepalived-notify.sh` on both hosts
|
|
- [ ] Script should handle states: master, backup, fault
|
|
- [ ] Log state changes to `/var/log/keepalived-notify.log`
|
|
- [ ] Optional: Send alerts (email, webhook) on fault state
|
|
- [ ] Make script executable: `chmod +x`
|
|
- [ ] Test script with each state manually
|
|
|
|
**File**: `/usr/local/bin/keepalived-notify.sh`
|
|
**Details**: See HA guide for full script content
|
|
|
|
---
|
|
|
|
#### Task 3.6: Start and Enable Keepalived
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 15 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Enable Keepalived service on both hosts
|
|
- [ ] Start Keepalived on both hosts
|
|
- [ ] Verify Keepalived is running
|
|
- [ ] Verify primary host owns VIP (192.168.11.166)
|
|
- [ ] Verify secondary host is in BACKUP state
|
|
- [ ] Monitor Keepalived logs for any errors
|
|
- [ ] Document VIP ownership verification
|
|
|
|
**Commands**:
|
|
```bash
|
|
# On both hosts
|
|
systemctl enable keepalived
|
|
systemctl start keepalived
|
|
|
|
# Verify status
|
|
systemctl status keepalived
|
|
|
|
# Check VIP ownership (should be on primary)
|
|
ip addr show vmbr0 | grep 192.168.11.166
|
|
|
|
# Check logs
|
|
journalctl -u keepalived -f
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 4: Sync Configuration to Secondary
|
|
|
|
#### Task 4.1: Export Primary Configuration
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 30 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Create export script: `scripts/npmplus/export-primary-config.sh`
|
|
- [ ] Export NPMplus SQLite database to SQL dump
|
|
- [ ] Export proxy hosts via API (JSON)
|
|
- [ ] Export certificates via API (JSON)
|
|
- [ ] Create timestamped backup directory
|
|
- [ ] Verify all exports completed successfully
|
|
- [ ] Document backup location and contents
|
|
|
|
**Script to Create**: `scripts/npmplus/export-primary-config.sh`
|
|
**Details**: See HA guide for full script content
|
|
|
|
---
|
|
|
|
#### Task 4.2: Import Configuration to Secondary
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 45 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Create import script: `scripts/npmplus/import-secondary-config.sh`
|
|
- [ ] Stop NPMplus container on secondary (if running)
|
|
- [ ] Copy database SQL dump to secondary
|
|
- [ ] Import database dump into secondary NPMplus
|
|
- [ ] Restart NPMplus container on secondary
|
|
- [ ] Wait for NPMplus to be healthy
|
|
- [ ] Verify proxy hosts are configured
|
|
- [ ] Verify certificates are accessible
|
|
- [ ] Document any manual configuration steps needed
|
|
|
|
**Script to Create**: `scripts/npmplus/import-secondary-config.sh`
|
|
**Details**: See HA guide for full script content
|
|
|
|
**Note**: Some configuration may need manual replication via API or UI.
|
|
|
|
---
|
|
|
|
### Phase 5: Set Up Ongoing Configuration Sync
|
|
|
|
#### Task 5.1: Create Configuration Sync Script
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🟡 **Important**
|
|
**Estimated Time**: 45 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Create sync script: `scripts/npmplus/sync-config.sh`
|
|
- [ ] Authenticate to NPMplus API (primary)
|
|
- [ ] Export proxy hosts configuration
|
|
- [ ] Implement API-based sync or document manual sync process
|
|
- [ ] Add script to automation (if automated sync is possible)
|
|
- [ ] Document manual sync procedures for configuration changes
|
|
|
|
**Script to Create**: `scripts/npmplus/sync-config.sh`
|
|
**Note**: Full automated sync requires shared database or complex API sync. For now, manual sync may be required.
|
|
|
|
---
|
|
|
|
### Phase 6: Testing and Validation
|
|
|
|
#### Task 6.1: Test Virtual IP Failover
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 30 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Verify primary owns VIP before test
|
|
- [ ] Simulate primary failure (stop Keepalived or NPMplus container)
|
|
- [ ] Verify VIP moves to secondary within 5-10 seconds
|
|
- [ ] Test connectivity to VIP from external source
|
|
- [ ] Restore primary and verify failback
|
|
- [ ] Document failover time (should be < 10 seconds)
|
|
- [ ] Test multiple failover scenarios
|
|
- [ ] Document test results
|
|
|
|
**Test Scenarios**:
|
|
1. Stop Keepalived on primary
|
|
2. Stop NPMplus container on primary
|
|
3. Stop entire Proxmox host (if possible in test environment)
|
|
4. Network partition (if possible in test environment)
|
|
|
|
---
|
|
|
|
#### Task 6.2: Test Certificate Access
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 30 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Verify certificates exist on secondary (after sync)
|
|
- [ ] Test SSL endpoint from external: `curl -vI https://explorer.d-bis.org`
|
|
- [ ] Verify certificate is valid and trusted
|
|
- [ ] Test multiple domains with SSL
|
|
- [ ] Verify certificate expiration dates match
|
|
- [ ] Test certificate auto-renewal on secondary (when primary renews)
|
|
- [ ] Document certificate test results
|
|
|
|
**Commands**:
|
|
```bash
|
|
# Verify certificates on secondary
|
|
ssh root@192.168.11.12 "pct exec 10234 -- ls -la /var/lib/docker/volumes/npmplus_data/_data/tls/certbot/live/"
|
|
|
|
# Test SSL endpoint
|
|
curl -vI https://explorer.d-bis.org
|
|
curl -vI https://mim4u.org
|
|
curl -vI https://rpc-http-pub.d-bis.org
|
|
```
|
|
|
|
---
|
|
|
|
#### Task 6.3: Test Proxy Host Functionality
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🔴 **Critical**
|
|
**Estimated Time**: 45 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Test each domain from external after failover
|
|
- [ ] Verify HTTP to HTTPS redirects work
|
|
- [ ] Verify WebSocket connections work (for RPC endpoints)
|
|
- [ ] Verify API endpoints respond correctly
|
|
- [ ] Test all 19+ domains
|
|
- [ ] Document any domains that don't work correctly
|
|
- [ ] Test with secondary as active instance
|
|
- [ ] Test failback to primary
|
|
|
|
**Test Domains**:
|
|
- All d-bis.org domains (9 domains)
|
|
- All mim4u.org domains (4 domains)
|
|
- All sankofa.nexus domains (5 domains)
|
|
- defi-oracle.io domain (1 domain)
|
|
|
|
---
|
|
|
|
### Monitoring and Maintenance
|
|
|
|
#### Task 7.1: Set Up HA Status Monitoring
|
|
**Status**: ✅ **COMPLETED** (script created, needs deployment)
|
|
**Priority**: 🟡 **Important**
|
|
**Location**: `scripts/npmplus/monitor-ha-status.sh`
|
|
|
|
**Actions Required**:
|
|
- [ ] Add cron job for HA status monitoring (every 5 minutes)
|
|
- [ ] Configure log rotation for `/var/log/npmplus-ha-monitor.log`
|
|
- [ ] Test monitoring script manually
|
|
- [ ] Optional: Integrate with alerting system (email, webhook)
|
|
- [ ] Document alert thresholds and escalation procedures
|
|
- [ ] Test alert generation
|
|
|
|
**Commands**:
|
|
```bash
|
|
# On primary Proxmox host
|
|
crontab -e
|
|
|
|
# Add:
|
|
*/5 * * * * /home/intlc/projects/proxmox/scripts/npmplus/monitor-ha-status.sh >> /var/log/npmplus-ha-monitor.log 2>&1
|
|
```
|
|
|
|
---
|
|
|
|
#### Task 7.2: Document Manual Failover Procedures
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🟡 **Important**
|
|
**Estimated Time**: 30 minutes
|
|
|
|
**Actions Required**:
|
|
- [ ] Document step-by-step manual failover procedure
|
|
- [ ] Document how to force failover to secondary
|
|
- [ ] Document how to force failback to primary
|
|
- [ ] Document troubleshooting steps for common issues
|
|
- [ ] Create runbook for operations team
|
|
- [ ] Test manual failover procedures
|
|
- [ ] Review and approve documentation
|
|
|
|
**Location**: Add to `docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md` troubleshooting section
|
|
|
|
---
|
|
|
|
#### Task 7.3: Test All Failover Scenarios
|
|
**Status**: ⏳ **PENDING**
|
|
**Priority**: 🟡 **Important**
|
|
**Estimated Time**: 2 hours
|
|
|
|
**Actions Required**:
|
|
- [ ] Test automatic failover (primary failure)
|
|
- [ ] Test automatic failback (primary recovery)
|
|
- [ ] Test manual failover (force to secondary)
|
|
- [ ] Test manual failback (force to primary)
|
|
- [ ] Test partial failure (Keepalived down but NPMplus up)
|
|
- [ ] Test network partition scenarios
|
|
- [ ] Test during high traffic (if possible)
|
|
- [ ] Document all test results
|
|
- [ ] Identify and fix any issues found
|
|
|
|
---
|
|
|
|
## HA Implementation Summary
|
|
|
|
### Total Estimated Time
|
|
- **Phase 1**: 1.5 hours (container creation and NPMplus installation)
|
|
- **Phase 2**: 30 minutes (certificate sync setup)
|
|
- **Phase 3**: 2 hours (Keepalived configuration and scripts)
|
|
- **Phase 4**: 1.5 hours (configuration export/import)
|
|
- **Phase 5**: 45 minutes (ongoing sync setup)
|
|
- **Phase 6**: 2 hours (testing and validation)
|
|
- **Monitoring**: 1 hour (monitoring setup and documentation)
|
|
|
|
**Total**: ~9 hours of implementation time
|
|
|
|
### Prerequisites Checklist
|
|
- [ ] Secondary Proxmox host available (r630-02 or ml110)
|
|
- [ ] Network connectivity between hosts verified
|
|
- [ ] Sufficient resources on secondary host (1 GB RAM, 5 GB disk, 2 CPU cores)
|
|
- [ ] SSH access configured between hosts (key-based auth recommended)
|
|
- [ ] Maintenance window scheduled
|
|
- [ ] Backup of primary NPMplus completed
|
|
- [ ] Team notified of maintenance window
|
|
|
|
### Risk Mitigation
|
|
- [ ] Rollback plan documented and tested
|
|
- [ ] Primary NPMplus backup verified before changes
|
|
- [ ] Test environment available (if possible)
|
|
- [ ] Monitoring in place before production deployment
|
|
- [ ] Emergency contact list available
|
|
|
|
---
|
|
|
|
**Last Updated**: 2026-01-20
|
|
**Next Review**: After addressing critical items
|