This document outlines disaster recovery procedures for the Proxmox infrastructure, including recovery from hardware failures, data loss, network outages, and security incidents.
---
## Recovery Scenarios
### 1. Complete Host Failure
**Scenario:** A Proxmox host (R630 or ML110) fails completely and cannot be recovered.
**Recovery Steps:**
1.**Assess Impact:**
```bash
# Check which VMs/containers were running on failed host
pvecm status
pvecm nodes
```
2.**Recover from Backup:**
- Identify backup location (Proxmox Backup Server or external storage)
- Restore VMs/containers to another host in the cluster
- Verify network connectivity and services
3.**Rejoin Cluster (if host is replaced):**
```bash
# On new/repaired host
pvecm add <cluster-name> -link0 <interface>
```
4.**Verify Services:**
- Check all critical services are running
- Verify network connectivity
- Test application functionality
**Recovery Time Objective (RTO):** 4 hours
**Recovery Point Objective (RPO):** Last backup (typically daily)
---
### 2. Storage Failure
**Scenario:** Storage pool fails (ZFS pool corruption, disk failure, etc.)
**Recovery Steps:**
1.**Immediate Actions:**
- Stop all VMs/containers using affected storage
- Assess extent of damage
- Check backup availability
2.**Storage Recovery:**
```bash
# For ZFS pools
zpool status
zpool import -f <pool-name>
zfs scrub <pool-name>
```
3.**Data Recovery:**
- Restore from backups if pool cannot be recovered
- Use Proxmox Backup Server if available
- Restore individual VMs/containers as needed
4.**Verification:**
- Verify data integrity
- Test restored VMs/containers
- Document lessons learned
**RTO:** 8 hours
**RPO:** Last backup
---
### 3. Network Outage
**Scenario:** Complete network failure or misconfiguration
**Recovery Steps:**
1.**Local Access:**
- Use console access (iDRAC, iLO, or physical console)
- Verify Proxmox host is running
- Check network configuration
2.**Network Restoration:**
```bash
# Check network interfaces
ip addr show
ip link show
# Check routing
ip route show
# Restart networking if needed
systemctl restart networking
```
3.**VLAN Restoration:**
- Verify VLAN configuration on switches
- Check Proxmox bridge configuration
- Test connectivity between VLANs
4.**Service Verification:**
- Test internal services
- Verify external connectivity (if applicable)
- Check Cloudflare tunnels (if used)
**RTO:** 2 hours
**RPO:** No data loss (network issue only)
---
### 4. Data Corruption
**Scenario:** VM/container data corruption or accidental deletion
**Recovery Steps:**
1.**Immediate Actions:**
- Stop affected VM/container
- Do not attempt repairs that might worsen corruption
- Document what was lost
2.**Recovery Options:**
- **From Snapshot:** Restore from most recent snapshot
- **From Backup:** Restore from Proxmox Backup Server
- **From External Backup:** Use external backup solution