docs/03-deployment/DISASTER_RECOVERY.md

# Disaster Recovery Procedures

**Last Updated:** 2025-01-20  
**Document Version:** 1.0  
**Status:** Active Documentation

---

## Overview

This document outlines disaster recovery procedures for the Proxmox infrastructure, including recovery from hardware failures, data loss, network outages, and security incidents.

---

## Recovery Scenarios

### 1. Complete Host Failure

**Scenario:** A Proxmox host (R630 or ML110) fails completely and cannot be recovered.

**Recovery Steps:**

1. **Assess Impact:**
   ```bash
   # Check which VMs/containers were running on failed host
   pvecm status
   pvecm nodes
   ```

2. **Recover from Backup:**
   - Identify backup location (Proxmox Backup Server or external storage)
   - Restore VMs/containers to another host in the cluster
   - Verify network connectivity and services

3. **Rejoin Cluster (if host is replaced):**
   ```bash
   # On new/repaired host
   pvecm add <cluster-name> -link0 <interface>
   ```

4. **Verify Services:**
   - Check all critical services are running
   - Verify network connectivity
   - Test application functionality

**Recovery Time Objective (RTO):** 4 hours  
**Recovery Point Objective (RPO):** Last backup (typically daily)

---

### 2. Storage Failure

**Scenario:** Storage pool fails (ZFS pool corruption, disk failure, etc.)

**Recovery Steps:**

1. **Immediate Actions:**
   - Stop all VMs/containers using affected storage
   - Assess extent of damage
   - Check backup availability

2. **Storage Recovery:**
   ```bash
   # For ZFS pools
   zpool status
   zpool import -f <pool-name>
   zfs scrub <pool-name>
   ```

3. **Data Recovery:**
   - Restore from backups if pool cannot be recovered
   - Use Proxmox Backup Server if available
   - Restore individual VMs/containers as needed

4. **Verification:**
   - Verify data integrity
   - Test restored VMs/containers
   - Document lessons learned

**RTO:** 8 hours  
**RPO:** Last backup

---

### 3. Network Outage

**Scenario:** Complete network failure or misconfiguration

**Recovery Steps:**

1. **Local Access:**
   - Use console access (iDRAC, iLO, or physical console)
   - Verify Proxmox host is running
   - Check network configuration

2. **Network Restoration:**
   ```bash
   # Check network interfaces
   ip addr show
   ip link show
   
   # Check routing
   ip route show
   
   # Restart networking if needed
   systemctl restart networking
   ```

3. **VLAN Restoration:**
   - Verify VLAN configuration on switches
   - Check Proxmox bridge configuration
   - Test connectivity between VLANs

4. **Service Verification:**
   - Test internal services
   - Verify external connectivity (if applicable)
   - Check Cloudflare tunnels (if used)

**RTO:** 2 hours  
**RPO:** No data loss (network issue only)

---

### 4. Data Corruption

**Scenario:** VM/container data corruption or accidental deletion

**Recovery Steps:**

1. **Immediate Actions:**
   - Stop affected VM/container
   - Do not attempt repairs that might worsen corruption
   - Document what was lost

2. **Recovery Options:**
   - **From Snapshot:** Restore from most recent snapshot
   - **From Backup:** Restore from Proxmox Backup Server
   - **From External Backup:** Use external backup solution

3. **Restoration:**
   ```bash
   # Restore from PBS
   vzdump restore <backup-id> <vmid> --storage <storage>
   
   # Or restore from snapshot
   qm rollback <vmid> <snapshot-name>
   ```

4. **Verification:**
   - Verify data integrity
   - Test application functionality
   - Update documentation

**RTO:** 4 hours  
**RPO:** Last snapshot/backup

---

### 5. Security Incident

**Scenario:** Security breach, unauthorized access, or malware

**Recovery Steps:**

1. **Immediate Containment:**
   - Isolate affected systems
   - Disconnect from network if necessary
   - Preserve evidence (logs, snapshots)

2. **Assessment:**
   - Identify scope of breach
   - Determine what was accessed/modified
   - Check for data exfiltration

3. **Recovery:**
   - Restore from known-good backups (pre-incident)
   - Rebuild affected systems if necessary
   - Update all credentials and keys

4. **Hardening:**
   - Review and update security policies
   - Patch vulnerabilities
   - Enhance monitoring

5. **Documentation:**
   - Document incident timeline
   - Update security procedures
   - Conduct post-incident review

**RTO:** 24 hours  
**RPO:** Pre-incident state

---

## Backup Strategy

### Backup Schedule

- **Critical VMs/Containers:** Daily backups
- **Standard VMs/Containers:** Weekly backups
- **Configuration:** Daily backups of Proxmox configuration
- **Network Configuration:** Version controlled (Git)

### Backup Locations

1. **Primary:** Proxmox Backup Server (if available)
2. **Secondary:** External storage (NFS, SMB, or USB)
3. **Offsite:** Cloud storage or remote location

### Backup Verification

- Weekly restore tests
- Monthly full disaster recovery drill
- Quarterly review of backup strategy

---

## Recovery Contacts

### Primary Contacts

- **Infrastructure Lead:** [Contact Information]
- **Network Administrator:** [Contact Information]
- **Security Team:** [Contact Information]

### Escalation

- **Level 1:** Infrastructure team (4 hours)
- **Level 2:** Management (8 hours)
- **Level 3:** External support (24 hours)

---

## Testing and Maintenance

### Quarterly DR Drills

1. **Test Scenario:** Simulate host failure
2. **Test Scenario:** Simulate storage failure
3. **Test Scenario:** Simulate network outage
4. **Document Results:** Update procedures based on findings

### Annual Full DR Test

- Complete infrastructure rebuild from backups
- Verify all services
- Update documentation

---

## Related Documentation

- **[BACKUP_AND_RESTORE.md](BACKUP_AND_RESTORE.md)** - Detailed backup procedures
- **[OPERATIONAL_RUNBOOKS.md](OPERATIONAL_RUNBOOKS.md)** - Operational procedures
- **[TROUBLESHOOTING_FAQ.md](../09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Troubleshooting guide

---

**Last Updated:** 2025-01-20  
**Review Cycle:** Quarterly
Complete markdown files cleanup and organization - Organized 252 files across project - Root directory: 187 → 2 files (98.9% reduction) - Moved configuration guides to docs/04-configuration/ - Moved troubleshooting guides to docs/09-troubleshooting/ - Moved quick start guides to docs/01-getting-started/ - Moved reports to reports/ directory - Archived temporary files - Generated comprehensive reports and documentation - Created maintenance scripts and guides All files organized according to established standards. 2026-01-06 01:46:25 -08:00			`# Disaster Recovery Procedures`

			`Last Updated: 2025-01-20`
			`Document Version: 1.0`
			`Status: Active Documentation`

			`---`

			`## Overview`

			`This document outlines disaster recovery procedures for the Proxmox infrastructure, including recovery from hardware failures, data loss, network outages, and security incidents.`

			`---`

			`## Recovery Scenarios`

			`### 1. Complete Host Failure`

			`Scenario: A Proxmox host (R630 or ML110) fails completely and cannot be recovered.`

			`Recovery Steps:`

			`1. Assess Impact:`
			```bash
			`# Check which VMs/containers were running on failed host`
			`pvecm status`
			`pvecm nodes`
			```

			`2. Recover from Backup:`
			`- Identify backup location (Proxmox Backup Server or external storage)`
			`- Restore VMs/containers to another host in the cluster`
			`- Verify network connectivity and services`

			`3. Rejoin Cluster (if host is replaced):`
			```bash
			`# On new/repaired host`
			`pvecm add <cluster-name> -link0 <interface>`
			```

			`4. Verify Services:`
			`- Check all critical services are running`
			`- Verify network connectivity`
			`- Test application functionality`

			`Recovery Time Objective (RTO): 4 hours`
			`Recovery Point Objective (RPO): Last backup (typically daily)`

			`---`

			`### 2. Storage Failure`

			`Scenario: Storage pool fails (ZFS pool corruption, disk failure, etc.)`

			`Recovery Steps:`

			`1. Immediate Actions:`
			`- Stop all VMs/containers using affected storage`
			`- Assess extent of damage`
			`- Check backup availability`

			`2. Storage Recovery:`
			```bash
			`# For ZFS pools`
			`zpool status`
			`zpool import -f <pool-name>`
			`zfs scrub <pool-name>`
			```

			`3. Data Recovery:`
			`- Restore from backups if pool cannot be recovered`
			`- Use Proxmox Backup Server if available`
			`- Restore individual VMs/containers as needed`

			`4. Verification:`
			`- Verify data integrity`
			`- Test restored VMs/containers`
			`- Document lessons learned`

			`RTO: 8 hours`
			`RPO: Last backup`

			`---`

			`### 3. Network Outage`

			`Scenario: Complete network failure or misconfiguration`

			`Recovery Steps:`

			`1. Local Access:`
			`- Use console access (iDRAC, iLO, or physical console)`
			`- Verify Proxmox host is running`
			`- Check network configuration`

			`2. Network Restoration:`
			```bash
			`# Check network interfaces`
			`ip addr show`
			`ip link show`

			`# Check routing`
			`ip route show`

			`# Restart networking if needed`
			`systemctl restart networking`
			```

			`3. VLAN Restoration:`
			`- Verify VLAN configuration on switches`
			`- Check Proxmox bridge configuration`
			`- Test connectivity between VLANs`

			`4. Service Verification:`
			`- Test internal services`
			`- Verify external connectivity (if applicable)`
			`- Check Cloudflare tunnels (if used)`

			`RTO: 2 hours`
			`RPO: No data loss (network issue only)`

			`---`

			`### 4. Data Corruption`

			`Scenario: VM/container data corruption or accidental deletion`

			`Recovery Steps:`

			`1. Immediate Actions:`
			`- Stop affected VM/container`
			`- Do not attempt repairs that might worsen corruption`
			`- Document what was lost`

			`2. Recovery Options:`
			`- From Snapshot: Restore from most recent snapshot`
			`- From Backup: Restore from Proxmox Backup Server`
			`- From External Backup: Use external backup solution`

			`3. Restoration:`
			```bash
			`# Restore from PBS`
			`vzdump restore <backup-id> <vmid> --storage <storage>`

			`# Or restore from snapshot`
			`qm rollback <vmid> <snapshot-name>`
			```

			`4. Verification:`
			`- Verify data integrity`
			`- Test application functionality`
			`- Update documentation`

			`RTO: 4 hours`
			`RPO: Last snapshot/backup`

			`---`

			`### 5. Security Incident`

			`Scenario: Security breach, unauthorized access, or malware`

			`Recovery Steps:`

			`1. Immediate Containment:`
			`- Isolate affected systems`
			`- Disconnect from network if necessary`
			`- Preserve evidence (logs, snapshots)`

			`2. Assessment:`
			`- Identify scope of breach`
			`- Determine what was accessed/modified`
			`- Check for data exfiltration`

			`3. Recovery:`
			`- Restore from known-good backups (pre-incident)`
			`- Rebuild affected systems if necessary`
			`- Update all credentials and keys`

			`4. Hardening:`
			`- Review and update security policies`
			`- Patch vulnerabilities`
			`- Enhance monitoring`

			`5. Documentation:`
			`- Document incident timeline`
			`- Update security procedures`
			`- Conduct post-incident review`

			`RTO: 24 hours`
			`RPO: Pre-incident state`

			`---`

			`## Backup Strategy`

			`### Backup Schedule`

			`- Critical VMs/Containers: Daily backups`
			`- Standard VMs/Containers: Weekly backups`
			`- Configuration: Daily backups of Proxmox configuration`
			`- Network Configuration: Version controlled (Git)`

			`### Backup Locations`

			`1. Primary: Proxmox Backup Server (if available)`
			`2. Secondary: External storage (NFS, SMB, or USB)`
			`3. Offsite: Cloud storage or remote location`

			`### Backup Verification`

			`- Weekly restore tests`
			`- Monthly full disaster recovery drill`
			`- Quarterly review of backup strategy`

			`---`

			`## Recovery Contacts`

			`### Primary Contacts`

			`- Infrastructure Lead: [Contact Information]`
			`- Network Administrator: [Contact Information]`
			`- Security Team: [Contact Information]`

			`### Escalation`

			`- Level 1: Infrastructure team (4 hours)`
			`- Level 2: Management (8 hours)`
			`- Level 3: External support (24 hours)`

			`---`

			`## Testing and Maintenance`

			`### Quarterly DR Drills`

			`1. Test Scenario: Simulate host failure`
			`2. Test Scenario: Simulate storage failure`
			`3. Test Scenario: Simulate network outage`
			`4. Document Results: Update procedures based on findings`

			`### Annual Full DR Test`

			`- Complete infrastructure rebuild from backups`
			`- Verify all services`
			`- Update documentation`

			`---`

			`## Related Documentation`

			`- [BACKUP_AND_RESTORE.md](BACKUP_AND_RESTORE.md) - Detailed backup procedures`
			`- [OPERATIONAL_RUNBOOKS.md](OPERATIONAL_RUNBOOKS.md) - Operational procedures`
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates - ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands - CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround - CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check - NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere - MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates - LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference Co-authored-by: Cursor <cursoragent@cursor.com> 2026-02-12 15:46:57 -08:00			`- [TROUBLESHOOTING_FAQ.md](../09-troubleshooting/TROUBLESHOOTING_FAQ.md) - Troubleshooting guide`
Complete markdown files cleanup and organization - Organized 252 files across project - Root directory: 187 → 2 files (98.9% reduction) - Moved configuration guides to docs/04-configuration/ - Moved troubleshooting guides to docs/09-troubleshooting/ - Moved quick start guides to docs/01-getting-started/ - Moved reports to reports/ directory - Archived temporary files - Generated comprehensive reports and documentation - Created maintenance scripts and guides All files organized according to established standards. 2026-01-06 01:46:25 -08:00
			`---`

			`Last Updated: 2025-01-20`
			`Review Cycle: Quarterly`