Files
proxmox/docs/03-deployment/DISASTER_RECOVERY.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

261 lines
5.9 KiB
Markdown

# Disaster Recovery Procedures
**Last Updated:** 2025-01-20
**Document Version:** 1.0
**Status:** Active Documentation
---
## Overview
This document outlines disaster recovery procedures for the Proxmox infrastructure, including recovery from hardware failures, data loss, network outages, and security incidents.
---
## Recovery Scenarios
### 1. Complete Host Failure
**Scenario:** A Proxmox host (R630 or ML110) fails completely and cannot be recovered.
**Recovery Steps:**
1. **Assess Impact:**
```bash
# Check which VMs/containers were running on failed host
pvecm status
pvecm nodes
```
2. **Recover from Backup:**
- Identify backup location (Proxmox Backup Server or external storage)
- Restore VMs/containers to another host in the cluster
- Verify network connectivity and services
3. **Rejoin Cluster (if host is replaced):**
```bash
# On new/repaired host
pvecm add <cluster-name> -link0 <interface>
```
4. **Verify Services:**
- Check all critical services are running
- Verify network connectivity
- Test application functionality
**Recovery Time Objective (RTO):** 4 hours
**Recovery Point Objective (RPO):** Last backup (typically daily)
---
### 2. Storage Failure
**Scenario:** Storage pool fails (ZFS pool corruption, disk failure, etc.)
**Recovery Steps:**
1. **Immediate Actions:**
- Stop all VMs/containers using affected storage
- Assess extent of damage
- Check backup availability
2. **Storage Recovery:**
```bash
# For ZFS pools
zpool status
zpool import -f <pool-name>
zfs scrub <pool-name>
```
3. **Data Recovery:**
- Restore from backups if pool cannot be recovered
- Use Proxmox Backup Server if available
- Restore individual VMs/containers as needed
4. **Verification:**
- Verify data integrity
- Test restored VMs/containers
- Document lessons learned
**RTO:** 8 hours
**RPO:** Last backup
---
### 3. Network Outage
**Scenario:** Complete network failure or misconfiguration
**Recovery Steps:**
1. **Local Access:**
- Use console access (iDRAC, iLO, or physical console)
- Verify Proxmox host is running
- Check network configuration
2. **Network Restoration:**
```bash
# Check network interfaces
ip addr show
ip link show
# Check routing
ip route show
# Restart networking if needed
systemctl restart networking
```
3. **VLAN Restoration:**
- Verify VLAN configuration on switches
- Check Proxmox bridge configuration
- Test connectivity between VLANs
4. **Service Verification:**
- Test internal services
- Verify external connectivity (if applicable)
- Check Cloudflare tunnels (if used)
**RTO:** 2 hours
**RPO:** No data loss (network issue only)
---
### 4. Data Corruption
**Scenario:** VM/container data corruption or accidental deletion
**Recovery Steps:**
1. **Immediate Actions:**
- Stop affected VM/container
- Do not attempt repairs that might worsen corruption
- Document what was lost
2. **Recovery Options:**
- **From Snapshot:** Restore from most recent snapshot
- **From Backup:** Restore from Proxmox Backup Server
- **From External Backup:** Use external backup solution
3. **Restoration:**
```bash
# Restore from PBS
vzdump restore <backup-id> <vmid> --storage <storage>
# Or restore from snapshot
qm rollback <vmid> <snapshot-name>
```
4. **Verification:**
- Verify data integrity
- Test application functionality
- Update documentation
**RTO:** 4 hours
**RPO:** Last snapshot/backup
---
### 5. Security Incident
**Scenario:** Security breach, unauthorized access, or malware
**Recovery Steps:**
1. **Immediate Containment:**
- Isolate affected systems
- Disconnect from network if necessary
- Preserve evidence (logs, snapshots)
2. **Assessment:**
- Identify scope of breach
- Determine what was accessed/modified
- Check for data exfiltration
3. **Recovery:**
- Restore from known-good backups (pre-incident)
- Rebuild affected systems if necessary
- Update all credentials and keys
4. **Hardening:**
- Review and update security policies
- Patch vulnerabilities
- Enhance monitoring
5. **Documentation:**
- Document incident timeline
- Update security procedures
- Conduct post-incident review
**RTO:** 24 hours
**RPO:** Pre-incident state
---
## Backup Strategy
### Backup Schedule
- **Critical VMs/Containers:** Daily backups
- **Standard VMs/Containers:** Weekly backups
- **Configuration:** Daily backups of Proxmox configuration
- **Network Configuration:** Version controlled (Git)
### Backup Locations
1. **Primary:** Proxmox Backup Server (if available)
2. **Secondary:** External storage (NFS, SMB, or USB)
3. **Offsite:** Cloud storage or remote location
### Backup Verification
- Weekly restore tests
- Monthly full disaster recovery drill
- Quarterly review of backup strategy
---
## Recovery Contacts
### Primary Contacts
- **Infrastructure Lead:** [Contact Information]
- **Network Administrator:** [Contact Information]
- **Security Team:** [Contact Information]
### Escalation
- **Level 1:** Infrastructure team (4 hours)
- **Level 2:** Management (8 hours)
- **Level 3:** External support (24 hours)
---
## Testing and Maintenance
### Quarterly DR Drills
1. **Test Scenario:** Simulate host failure
2. **Test Scenario:** Simulate storage failure
3. **Test Scenario:** Simulate network outage
4. **Document Results:** Update procedures based on findings
### Annual Full DR Test
- Complete infrastructure rebuild from backups
- Verify all services
- Update documentation
---
## Related Documentation
- **[BACKUP_AND_RESTORE.md](BACKUP_AND_RESTORE.md)** - Detailed backup procedures
- **[OPERATIONAL_RUNBOOKS.md](OPERATIONAL_RUNBOOKS.md)** - Operational procedures
- **[TROUBLESHOOTING_FAQ.md](../09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Troubleshooting guide
---
**Last Updated:** 2025-01-20
**Review Cycle:** Quarterly