proxmox/docs/03-deployment/DISASTER_RECOVERY.md

# Disaster Recovery Procedures

**Last Updated:** 2025-01-20
**Document Version:** 1.0
**Status:** Active Documentation

---

## Overview

This document outlines disaster recovery procedures for the Proxmox infrastructure, including recovery from hardware failures, data loss, network outages, and security incidents.

---

## Recovery Scenarios

### 1. Complete Host Failure

**Scenario:** A Proxmox host (R630 or ML110) fails completely and cannot be recovered.

**Recovery Steps:**

1. **Assess Impact:**
   ```bash
   # Check which VMs/containers were running on failed host
   pvecm status
   pvecm nodes
   ```

2. **Recover from Backup:**
   - Identify backup location (Proxmox Backup Server or external storage)
   - Restore VMs/containers to another host in the cluster
   - Verify network connectivity and services

3. **Rejoin Cluster (if host is replaced):**
   ```bash
   # On new/repaired host
   pvecm add <cluster-name> -link0 <interface>
   ```

4. **Verify Services:**
   - Check all critical services are running
   - Verify network connectivity
   - Test application functionality

**Recovery Time Objective (RTO):** 4 hours
**Recovery Point Objective (RPO):** Last backup (typically daily)

---

### 2. Storage Failure

**Scenario:** Storage pool fails (ZFS pool corruption, disk failure, etc.)

**Recovery Steps:**

1. **Immediate Actions:**
   - Stop all VMs/containers using affected storage
   - Assess extent of damage
   - Check backup availability

2. **Storage Recovery:**
   ```bash
   # For ZFS pools
   zpool status
   zpool import -f <pool-name>
   zfs scrub <pool-name>
   ```

3. **Data Recovery:**
   - Restore from backups if pool cannot be recovered
   - Use Proxmox Backup Server if available
   - Restore individual VMs/containers as needed

4. **Verification:**
   - Verify data integrity
   - Test restored VMs/containers
   - Document lessons learned

**RTO:** 8 hours
**RPO:** Last backup

---

### 3. Network Outage

**Scenario:** Complete network failure or misconfiguration

**Recovery Steps:**

1. **Local Access:**
   - Use console access (iDRAC, iLO, or physical console)
   - Verify Proxmox host is running
   - Check network configuration

2. **Network Restoration:**
   ```bash
   # Check network interfaces
   ip addr show
   ip link show

   # Check routing
   ip route show

   # Restart networking if needed
   systemctl restart networking
   ```

3. **VLAN Restoration:**
   - Verify VLAN configuration on switches
   - Check Proxmox bridge configuration
   - Test connectivity between VLANs

4. **Service Verification:**
   - Test internal services
   - Verify external connectivity (if applicable)
   - Check Cloudflare tunnels (if used)

**RTO:** 2 hours
**RPO:** No data loss (network issue only)

---

### 4. Data Corruption

**Scenario:** VM/container data corruption or accidental deletion

**Recovery Steps:**

1. **Immediate Actions:**
   - Stop affected VM/container
   - Do not attempt repairs that might worsen corruption
   - Document what was lost

2. **Recovery Options:**
   - **From Snapshot:** Restore from most recent snapshot
   - **From Backup:** Restore from Proxmox Backup Server
   - **From External Backup:** Use external backup solution

3. **Restoration:**
   ```bash
   # Restore from PBS
   vzdump restore <backup-id> <vmid> --storage <storage>

   # Or restore from snapshot
   qm rollback <vmid> <snapshot-name>
   ```

4. **Verification:**
   - Verify data integrity
   - Test application functionality
   - Update documentation

**RTO:** 4 hours
**RPO:** Last snapshot/backup

---

### 5. Security Incident

**Scenario:** Security breach, unauthorized access, or malware

**Recovery Steps:**

1. **Immediate Containment:**
   - Isolate affected systems
   - Disconnect from network if necessary
   - Preserve evidence (logs, snapshots)

2. **Assessment:**
   - Identify scope of breach
   - Determine what was accessed/modified
   - Check for data exfiltration

3. **Recovery:**
   - Restore from known-good backups (pre-incident)
   - Rebuild affected systems if necessary
   - Update all credentials and keys

4. **Hardening:**
   - Review and update security policies
   - Patch vulnerabilities
   - Enhance monitoring

5. **Documentation:**
   - Document incident timeline
   - Update security procedures
   - Conduct post-incident review

**RTO:** 24 hours
**RPO:** Pre-incident state

---

## Backup Strategy

### Backup Schedule

- **Critical VMs/Containers:** Daily backups
- **Standard VMs/Containers:** Weekly backups
- **Configuration:** Daily backups of Proxmox configuration
- **Network Configuration:** Version controlled (Git)

### Backup Locations

1. **Primary:** Proxmox Backup Server (if available)
2. **Secondary:** External storage (NFS, SMB, or USB)
3. **Offsite:** Cloud storage or remote location

### Backup Verification

- Weekly restore tests
- Monthly full disaster recovery drill
- Quarterly review of backup strategy

---

## Recovery Contacts

### Primary Contacts

- **Infrastructure Lead:** [Contact Information]
- **Network Administrator:** [Contact Information]
- **Security Team:** [Contact Information]

### Escalation

- **Level 1:** Infrastructure team (4 hours)
- **Level 2:** Management (8 hours)
- **Level 3:** External support (24 hours)

---

## Testing and Maintenance

### Quarterly DR Drills

1. **Test Scenario:** Simulate host failure
2. **Test Scenario:** Simulate storage failure
3. **Test Scenario:** Simulate network outage
4. **Document Results:** Update procedures based on findings

### Annual Full DR Test

- Complete infrastructure rebuild from backups
- Verify all services
- Update documentation

---

## Related Documentation

- **[BACKUP_AND_RESTORE.md](BACKUP_AND_RESTORE.md)** - Detailed backup procedures
- **[OPERATIONAL_RUNBOOKS.md](OPERATIONAL_RUNBOOKS.md)** - Operational procedures
- **[TROUBLESHOOTING_FAQ.md](../09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Troubleshooting guide

---

**Last Updated:** 2025-01-20
**Review Cycle:** Quarterly