Files
proxmox/reports/storage/BACKUP_AND_RECREATION_PLAN.md

503 lines
11 KiB
Markdown
Raw Normal View History

# Backup and Recreation Plan
**Date:** January 7, 2026
**Status:** 📋 **PLAN READY FOR IMPLEMENTATION**
---
## Executive Summary
This document outlines a comprehensive plan for:
1. **Setting up automated backups** to prevent future data loss
2. **Recreating lost containers** from their configurations
3. **Restoring data** from backups if available
4. **Best practices** for ongoing backup management
---
## Current Situation
### Containers Status
**Containers with Data ✅ (7 containers):**
- 100, 101, 102, 103, 104, 105, 130 (migrated from r630-02)
**Containers Without Data ❌ (~28 containers):**
- 106, 107, 108 (empty volumes)
- 3000-10151 (empty volumes)
### Data Loss Summary
- **Lost During:** RAID 10 expansion (4→6 disks)
- **Cause:** RAID recreation wiped all data structures
- **Recovery:** Not possible from thin1 (data overwritten)
- **Solution:** Restore from backups or recreate from templates
---
## Part 1: Automated Backup Setup
### Objective
Set up automated daily backups for all containers/VMs to prevent future data loss.
### Implementation
#### Step 1: Create Backup Script
**Script:** `scripts/setup-automated-backups.sh`
**Features:**
- Daily backups at 2 AM
- Snapshot mode (no downtime)
- Gzip compression
- Automatic cleanup (keep 7 days)
- Logging to `/var/log/proxmox-backups/`
**Usage:**
```bash
./scripts/setup-automated-backups.sh
```
#### Step 2: Manual Backup Script
**Script:** `/usr/local/bin/manual-backup.sh` (created on r630-01)
**Usage:**
```bash
# Backup specific containers
ssh root@192.168.11.11 "/usr/local/bin/manual-backup.sh 106 107 108"
# Backup all running containers
ssh root@192.168.11.11 "pct list | awk 'NR>1 && \$2==\"running\" {print \$1}' | xargs /usr/local/bin/manual-backup.sh"
```
#### Step 3: Backup Storage
**Current Storage:**
- `local` storage: `/var/lib/vz/dump/` (directory storage)
- Capacity: ~536GB available
**Backup Location:**
- `/var/lib/vz/dump/vzdump-lxc-<vmid>-<timestamp>.tar.gz`
- `/var/lib/vz/dump/vzdump-qemu-<vmid>-<timestamp>.vma.gz`
#### Step 4: Backup Schedule
**Automated:**
- **Frequency:** Daily at 2:00 AM
- **Mode:** Snapshot (no downtime)
- **Compression:** Gzip
- **Retention:** 7 days
**Manual:**
- Run anytime using `/usr/local/bin/manual-backup.sh`
---
## Part 2: Container Recreation Plan
### Objective
Recreate containers that lost data, restoring them to a working state.
### Approach
#### Option A: Restore from Backups (If Available)
**Steps:**
1. Check for backups:
```bash
find /var/lib/vz/dump -name "*106*" -o -name "*107*" -o -name "*108*"
```
2. Restore container:
```bash
pct restore <vmid> <backup_file> --storage thin1
```
3. Start container:
```bash
pct start <vmid>
```
#### Option B: Recreate from Templates (If No Backups)
**Steps:**
1. Use recreation script:
```bash
./scripts/recreate-containers-from-configs.sh 106 107 108
```
2. Script will:
- Read container configuration
- Destroy empty container
- Recreate from template
- Restore configuration
- Create volume on correct storage
3. Manual recreation:
```bash
# Download template if needed
pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst
# Recreate container
pct create <vmid> /var/lib/vz/template/cache/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--storage thin1 --rootfs thin1:10G \
--hostname <hostname> \
--memory <memory> --swap <swap> --cores <cores> \
--net0 name=eth0,bridge=vmbr0,ip=<ip>/24
```
### Container Recreation Priority
**High Priority (Critical Services):**
1. **106** - redis-rpc-translator
2. **107** - web3signer-rpc-translator
3. **108** - vault-rpc-translator
4. **3000-3003** - ml110 containers
5. **3500** - oracle-publisher-1
6. **3501** - ccip-monitor-1
**Medium Priority:**
7. **5200** - cacti-1
8. **6000** - fabric-1
9. **6400** - indy-1
10. **10100-10151** - dbis containers
**Lower Priority:**
11. **10000-10092** - order containers
12. **10200-10230** - monitoring containers
---
## Part 3: Data Restoration Procedures
### Step 1: Check for Existing Backups
```bash
# Check all nodes for backups
for node in ml110 r630-01 r630-02; do
echo "=== $node ==="
ssh root@$node "find /var/lib/vz/dump -name '*106*' -o -name '*107*' -o -name '*108*'"
done
# Check Proxmox Backup Server (if configured)
pvesm list | grep backup
```
### Step 2: Restore from Backup
```bash
# Copy backup to r630-01
scp root@source:/var/lib/vz/dump/vzdump-lxc-106-*.tar.gz root@192.168.11.11:/var/lib/vz/dump/
# Restore container
ssh root@192.168.11.11 "pct restore 106 /var/lib/vz/dump/vzdump-lxc-106-*.tar.gz --storage thin1"
# Start container
ssh root@192.168.11.11 "pct start 106"
```
### Step 3: Verify Restoration
```bash
# Check container status
pct list | grep 106
# Check container logs
pct logs 106
# Test services
pct exec 106 -- systemctl status <service>
```
---
## Part 4: Ongoing Backup Management
### Daily Operations
**Automated Backups:**
- Run automatically at 2 AM daily
- Logs available in `/var/log/proxmox-backups/`
- Check logs weekly for errors
**Manual Backups:**
- Before major changes
- Before migrations
- Before updates
### Backup Verification
**Weekly Checks:**
```bash
# Check backup directory
ls -lh /var/lib/vz/dump/
# Check backup logs
tail -f /var/log/proxmox-backups/backup_$(date +%Y%m%d).log
# Verify backup integrity
tar -tzf /var/lib/vz/dump/vzdump-lxc-106-*.tar.gz | head -10
```
### Backup Retention
**Current Policy:**
- Keep last 7 days of backups
- Cleanup runs automatically after each backup
**Recommended Policy:**
- Daily backups: Keep 7 days
- Weekly backups: Keep 4 weeks
- Monthly backups: Keep 12 months
### Backup Storage Management
**Monitor Storage:**
```bash
# Check backup storage usage
df -h /var/lib/vz/dump/
pvesm status | grep local
# Cleanup old backups manually if needed
find /var/lib/vz/dump -name "*.tar.gz" -mtime +7 -delete
```
---
## Part 5: Implementation Checklist
### Phase 1: Backup Setup ✅
- [ ] Run `scripts/setup-automated-backups.sh`
- [ ] Verify cron job is set up
- [ ] Test manual backup script
- [ ] Verify backup storage has space
- [ ] Run test backup
### Phase 2: Check for Existing Backups
- [ ] Check `/var/lib/vz/dump/` on all nodes
- [ ] Check external backup locations
- [ ] Check Proxmox Backup Server (if configured)
- [ ] Document found backups
### Phase 3: Restore from Backups (If Available)
- [ ] Copy backups to r630-01
- [ ] Restore containers using `pct restore`
- [ ] Verify containers are working
- [ ] Start containers
- [ ] Test services
### Phase 4: Recreate Containers (If No Backups)
- [ ] Prioritize containers by importance
- [ ] Download required templates
- [ ] Run recreation script for each container
- [ ] Restore configurations manually
- [ ] Install applications
- [ ] Restore application data (if available)
### Phase 5: Verify and Document
- [ ] Verify all containers are running
- [ ] Test all services
- [ ] Document restoration process
- [ ] Update backup procedures
- [ ] Schedule regular backup verification
---
## Part 6: Best Practices
### Backup Best Practices
1. **Automated Backups:**
- Set up daily automated backups
- Use snapshot mode for running containers
- Compress backups to save space
- Keep multiple backup copies
2. **Backup Storage:**
- Use separate storage for backups
- Monitor backup storage usage
- Consider off-site backups
- Use Proxmox Backup Server for better management
3. **Backup Testing:**
- Test backup restoration regularly
- Verify backup integrity
- Document restoration procedures
- Keep backup logs
### Container Recreation Best Practices
1. **Before Recreation:**
- Check for backups first
- Document container configurations
- Note any custom settings
- Plan recreation order
2. **During Recreation:**
- Recreate from templates
- Restore configurations
- Install applications
- Restore data if available
3. **After Recreation:**
- Verify containers work
- Test all services
- Update documentation
- Set up backups immediately
---
## Part 7: Recovery Procedures
### Container Recovery Workflow
```
1. Check for Backups
├─ Found? → Restore from Backup
└─ Not Found? → Recreate from Template
2. Restore from Backup
├─ Copy backup to r630-01
├─ Restore using pct restore
├─ Start container
└─ Verify services
3. Recreate from Template
├─ Read container config
├─ Download template
├─ Create container
├─ Restore configuration
├─ Install applications
└─ Restore data (if available)
```
### Emergency Recovery
**If backups fail:**
1. Stop affected containers
2. Check backup logs
3. Manually create backup
4. Restore from manual backup
5. Verify restoration
**If recreation fails:**
1. Check container logs
2. Verify template exists
3. Check storage availability
4. Retry recreation
5. Contact support if needed
---
## Part 8: Monitoring and Maintenance
### Backup Monitoring
**Daily:**
- Check backup logs for errors
- Verify backups completed successfully
**Weekly:**
- Review backup storage usage
- Test backup restoration
- Clean up old backups
**Monthly:**
- Review backup policies
- Update backup procedures
- Document any issues
### Container Monitoring
**After Recreation:**
- Monitor container status
- Check service logs
- Verify applications work
- Test functionality
**Ongoing:**
- Regular health checks
- Monitor resource usage
- Update applications
- Maintain backups
---
## Commands Reference
### Backup Commands
```bash
# Setup automated backups
./scripts/setup-automated-backups.sh
# Manual backup
ssh root@192.168.11.11 "/usr/local/bin/manual-backup.sh 106 107 108"
# Check backup status
ssh root@192.168.11.11 "ls -lh /var/lib/vz/dump/"
# View backup logs
ssh root@192.168.11.11 "tail -f /var/log/proxmox-backups/backup_$(date +%Y%m%d).log"
```
### Restoration Commands
```bash
# Restore from backup
ssh root@192.168.11.11 "pct restore 106 /var/lib/vz/dump/vzdump-lxc-106-*.tar.gz --storage thin1"
# Recreate from template
./scripts/recreate-containers-from-configs.sh 106 107 108
# Check container status
ssh root@192.168.11.11 "pct list | grep 106"
```
### Verification Commands
```bash
# Check container volumes
ssh root@192.168.11.11 "lvs pve | grep vm-106-disk"
# Check container config
ssh root@192.168.11.11 "pct config 106"
# Check container logs
ssh root@192.168.11.11 "pct logs 106"
```
---
## Next Steps
1. **Immediate:**
- [ ] Run backup setup script
- [ ] Check for existing backups
- [ ] Document found backups
2. **Short-term:**
- [ ] Restore containers from backups (if available)
- [ ] Recreate high-priority containers
- [ ] Verify all services work
3. **Long-term:**
- [ ] Set up Proxmox Backup Server
- [ ] Implement off-site backups
- [ ] Regular backup testing
- [ ] Documentation updates
---
**Status:** 📋 **PLAN READY**
**Next Action:** Run backup setup script and check for existing backups
**Last Updated:** January 7, 2026