# Disaster Recovery Procedures **Last Updated:** 2025-01-20 **Document Version:** 1.0 **Status:** Active Documentation --- ## Overview This document outlines disaster recovery procedures for the Proxmox infrastructure, including recovery from hardware failures, data loss, network outages, and security incidents. --- ## Recovery Scenarios ### 1. Complete Host Failure **Scenario:** A Proxmox host (R630 or ML110) fails completely and cannot be recovered. **Recovery Steps:** 1. **Assess Impact:** ```bash # Check which VMs/containers were running on failed host pvecm status pvecm nodes ``` 2. **Recover from Backup:** - Identify backup location (Proxmox Backup Server or external storage) - Restore VMs/containers to another host in the cluster - Verify network connectivity and services 3. **Rejoin Cluster (if host is replaced):** ```bash # On new/repaired host pvecm add -link0 ``` 4. **Verify Services:** - Check all critical services are running - Verify network connectivity - Test application functionality **Recovery Time Objective (RTO):** 4 hours **Recovery Point Objective (RPO):** Last backup (typically daily) --- ### 2. Storage Failure **Scenario:** Storage pool fails (ZFS pool corruption, disk failure, etc.) **Recovery Steps:** 1. **Immediate Actions:** - Stop all VMs/containers using affected storage - Assess extent of damage - Check backup availability 2. **Storage Recovery:** ```bash # For ZFS pools zpool status zpool import -f zfs scrub ``` 3. **Data Recovery:** - Restore from backups if pool cannot be recovered - Use Proxmox Backup Server if available - Restore individual VMs/containers as needed 4. **Verification:** - Verify data integrity - Test restored VMs/containers - Document lessons learned **RTO:** 8 hours **RPO:** Last backup --- ### 3. Network Outage **Scenario:** Complete network failure or misconfiguration **Recovery Steps:** 1. **Local Access:** - Use console access (iDRAC, iLO, or physical console) - Verify Proxmox host is running - Check network configuration 2. **Network Restoration:** ```bash # Check network interfaces ip addr show ip link show # Check routing ip route show # Restart networking if needed systemctl restart networking ``` 3. **VLAN Restoration:** - Verify VLAN configuration on switches - Check Proxmox bridge configuration - Test connectivity between VLANs 4. **Service Verification:** - Test internal services - Verify external connectivity (if applicable) - Check Cloudflare tunnels (if used) **RTO:** 2 hours **RPO:** No data loss (network issue only) --- ### 4. Data Corruption **Scenario:** VM/container data corruption or accidental deletion **Recovery Steps:** 1. **Immediate Actions:** - Stop affected VM/container - Do not attempt repairs that might worsen corruption - Document what was lost 2. **Recovery Options:** - **From Snapshot:** Restore from most recent snapshot - **From Backup:** Restore from Proxmox Backup Server - **From External Backup:** Use external backup solution 3. **Restoration:** ```bash # Restore from PBS vzdump restore --storage # Or restore from snapshot qm rollback ``` 4. **Verification:** - Verify data integrity - Test application functionality - Update documentation **RTO:** 4 hours **RPO:** Last snapshot/backup --- ### 5. Security Incident **Scenario:** Security breach, unauthorized access, or malware **Recovery Steps:** 1. **Immediate Containment:** - Isolate affected systems - Disconnect from network if necessary - Preserve evidence (logs, snapshots) 2. **Assessment:** - Identify scope of breach - Determine what was accessed/modified - Check for data exfiltration 3. **Recovery:** - Restore from known-good backups (pre-incident) - Rebuild affected systems if necessary - Update all credentials and keys 4. **Hardening:** - Review and update security policies - Patch vulnerabilities - Enhance monitoring 5. **Documentation:** - Document incident timeline - Update security procedures - Conduct post-incident review **RTO:** 24 hours **RPO:** Pre-incident state --- ## Backup Strategy ### Backup Schedule - **Critical VMs/Containers:** Daily backups - **Standard VMs/Containers:** Weekly backups - **Configuration:** Daily backups of Proxmox configuration - **Network Configuration:** Version controlled (Git) ### Backup Locations 1. **Primary:** Proxmox Backup Server (if available) 2. **Secondary:** External storage (NFS, SMB, or USB) 3. **Offsite:** Cloud storage or remote location ### Backup Verification - Weekly restore tests - Monthly full disaster recovery drill - Quarterly review of backup strategy --- ## Recovery Contacts ### Primary Contacts - **Infrastructure Lead:** [Contact Information] - **Network Administrator:** [Contact Information] - **Security Team:** [Contact Information] ### Escalation - **Level 1:** Infrastructure team (4 hours) - **Level 2:** Management (8 hours) - **Level 3:** External support (24 hours) --- ## Testing and Maintenance ### Quarterly DR Drills 1. **Test Scenario:** Simulate host failure 2. **Test Scenario:** Simulate storage failure 3. **Test Scenario:** Simulate network outage 4. **Document Results:** Update procedures based on findings ### Annual Full DR Test - Complete infrastructure rebuild from backups - Verify all services - Update documentation --- ## Related Documentation - **[BACKUP_AND_RESTORE.md](BACKUP_AND_RESTORE.md)** - Detailed backup procedures - **[OPERATIONAL_RUNBOOKS.md](OPERATIONAL_RUNBOOKS.md)** - Operational procedures - **[TROUBLESHOOTING_FAQ.md](../09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Troubleshooting guide --- **Last Updated:** 2025-01-20 **Review Cycle:** Quarterly