Files
Sankofa/docs/deployment/DEPLOYMENT_NEXT_STEPS.md
defiQUG fe0365757a Update documentation structure and enhance .gitignore
- Added generated index files and report directories to .gitignore to prevent unnecessary tracking of transient files.
- Updated README links to reflect new documentation paths for better navigation.
- Improved documentation organization by ensuring all links point to the correct locations, enhancing user experience and accessibility.
2025-12-12 21:18:55 -08:00

222 lines
4.8 KiB
Markdown

# Deployment Next Steps
**Date**: 2025-12-09
**Status**: ⚠️ **LOCK ISSUE - MANUAL RESOLUTION REQUIRED**
---
## Current Situation
### ✅ Completed
1. **Provider Configuration**: ✅ Verified and working
2. **VM Resource Created**: ✅ basic-vm-001 (VMID 100)
3. **Deployment Initiated**: ✅ VM created in Proxmox
### ⚠️ Blocking Issue
**VM Lock Timeout**: Configuration update blocked by Proxmox lock file
**Error**: `can't lock file '/var/lock/qemu-server/lock-100.conf' - got timeout`
---
## Immediate Action Required
### Step 1: Resolve Lock on Proxmox Node
**Access the Proxmox node and clear the lock:**
```bash
# Connect to Proxmox node (replace with actual IP/hostname)
ssh root@<proxmox-node-ip>
# Check VM status
qm status 100
# Unlock the VM
qm unlock 100
# If unlock doesn't work, remove lock file
rm -f /var/lock/qemu-server/lock-100.conf
# Verify lock is cleared
ls -la /var/lock/qemu-server/lock-100.conf
```
**Note**: If you don't have direct SSH access, you may need to:
- Use Proxmox web UI
- Access via console
- Use another method to access the node
### Step 2: Verify Image Availability
**While on the Proxmox node, verify the image exists:**
```bash
# Check for image
find /var/lib/vz/template/iso -name "ubuntu-22.04-cloud.img"
pvesm list local-lvm | grep ubuntu-22.04-cloud
# If missing, download it
cd /var/lib/vz/template/iso
wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img
mv jammy-server-cloudimg-amd64.img ubuntu-22.04-cloud.img
```
### Step 3: Monitor Automatic Retry
**After clearing the lock, the provider will automatically retry:**
```bash
# Watch VM status
kubectl get proxmoxvm basic-vm-001 -w
# Watch provider logs
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50 -f
```
**Expected Timeline**: 1-5 minutes after lock is cleared
---
## After Lock Resolution
### Expected Sequence
1. **Provider retries** configuration update (automatic)
2. **VM configuration** completes successfully
3. **Image import** (if needed) completes
4. **Boot order** set correctly
5. **Cloud-init** configured
6. **VM boots** successfully
7. **VM reaches "running" state**
8. **IP address assigned**
9. **Ready condition becomes "True"**
### Verification Steps
Once VM is running:
```bash
# Get VM IP
IP=$(kubectl get proxmoxvm basic-vm-001 -o jsonpath='{.status.networkInterfaces[0].ipAddress}')
# Check cloud-init logs
ssh admin@$IP "cat /var/log/cloud-init-output.log | tail -50"
# Verify services
ssh admin@$IP "systemctl status qemu-guest-agent chrony unattended-upgrades"
# Test SSH access
ssh admin@$IP "hostname && uptime"
```
---
## If Lock Resolution Fails
### Alternative: Delete and Redeploy
If the lock cannot be cleared:
```bash
# 1. Delete Kubernetes resource
kubectl delete proxmoxvm basic-vm-001
# 2. On Proxmox node, force delete VM
ssh root@<proxmox-node> "qm destroy 100 --purge --skiplock"
# 3. Clean up locks
ssh root@<proxmox-node> "rm -f /var/lock/qemu-server/lock-100.conf"
# 4. Wait for cleanup
sleep 10
# 5. Redeploy
kubectl apply -f examples/production/basic-vm.yaml
```
---
## Long-term Solutions
### 1. Code Enhancement
**Add lock handling to provider code:**
- Detect lock errors in `UpdateVM`
- Automatically call `qm unlock` before retry
- Increase timeout for lock operations
- Add exponential backoff for lock retries
**File**: `crossplane-provider-proxmox/pkg/proxmox/client.go`
### 2. Pre-deployment Checks
**Add validation before VM creation:**
- Check for existing locks on target node
- Verify no conflicting operations
- Ensure Proxmox node is healthy
### 3. Deployment Strategy
**For full deployment:**
- Deploy VMs sequentially (not in parallel)
- Add delays between deployments (30-60 seconds)
- Monitor each deployment before proceeding
- Implement retry logic with lock handling
---
## Full Deployment Plan (After Test Success)
### Phase 1: Infrastructure (2 VMs)
1. nginx-proxy-vm.yaml
2. cloudflare-tunnel-vm.yaml
### Phase 2: SMOM-DBIS-138 Core (8 VMs)
3-6. validator-01 through validator-04
7-10. sentry-01 through sentry-04
### Phase 3: SMOM-DBIS-138 Services (8 VMs)
11-14. rpc-node-01 through rpc-node-04
15. services.yaml
16. blockscout.yaml
17. monitoring.yaml
18. management.yaml
### Phase 4: Phoenix VMs (8 VMs)
19-26. All Phoenix VMs
### Phase 5: Template VMs (2 VMs - Optional)
27. medium-vm.yaml
28. large-vm.yaml
**Total**: 28 additional VMs after test VM
---
## Summary
### Current Status
- ✅ Provider: Working
- ✅ VM Created: Yes (VMID 100)
- ⚠️ Configuration: Blocked by lock
- ⚠️ State: Stopped
### Required Action
**Manual lock resolution on Proxmox node**
### After Resolution
- Provider will automatically retry
- VM should complete configuration
- VM should boot successfully
- Full deployment can proceed
---
**Last Updated**: 2025-12-09
**Status**: ⚠️ **WAITING FOR MANUAL LOCK RESOLUTION**