Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands - CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround - CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check - NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere - MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates - LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference Co-authored-by: Cursor <cursoragent@cursor.com>
261 lines
5.9 KiB
Markdown
261 lines
5.9 KiB
Markdown
# Disaster Recovery Procedures
|
|
|
|
**Last Updated:** 2025-01-20
|
|
**Document Version:** 1.0
|
|
**Status:** Active Documentation
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This document outlines disaster recovery procedures for the Proxmox infrastructure, including recovery from hardware failures, data loss, network outages, and security incidents.
|
|
|
|
---
|
|
|
|
## Recovery Scenarios
|
|
|
|
### 1. Complete Host Failure
|
|
|
|
**Scenario:** A Proxmox host (R630 or ML110) fails completely and cannot be recovered.
|
|
|
|
**Recovery Steps:**
|
|
|
|
1. **Assess Impact:**
|
|
```bash
|
|
# Check which VMs/containers were running on failed host
|
|
pvecm status
|
|
pvecm nodes
|
|
```
|
|
|
|
2. **Recover from Backup:**
|
|
- Identify backup location (Proxmox Backup Server or external storage)
|
|
- Restore VMs/containers to another host in the cluster
|
|
- Verify network connectivity and services
|
|
|
|
3. **Rejoin Cluster (if host is replaced):**
|
|
```bash
|
|
# On new/repaired host
|
|
pvecm add <cluster-name> -link0 <interface>
|
|
```
|
|
|
|
4. **Verify Services:**
|
|
- Check all critical services are running
|
|
- Verify network connectivity
|
|
- Test application functionality
|
|
|
|
**Recovery Time Objective (RTO):** 4 hours
|
|
**Recovery Point Objective (RPO):** Last backup (typically daily)
|
|
|
|
---
|
|
|
|
### 2. Storage Failure
|
|
|
|
**Scenario:** Storage pool fails (ZFS pool corruption, disk failure, etc.)
|
|
|
|
**Recovery Steps:**
|
|
|
|
1. **Immediate Actions:**
|
|
- Stop all VMs/containers using affected storage
|
|
- Assess extent of damage
|
|
- Check backup availability
|
|
|
|
2. **Storage Recovery:**
|
|
```bash
|
|
# For ZFS pools
|
|
zpool status
|
|
zpool import -f <pool-name>
|
|
zfs scrub <pool-name>
|
|
```
|
|
|
|
3. **Data Recovery:**
|
|
- Restore from backups if pool cannot be recovered
|
|
- Use Proxmox Backup Server if available
|
|
- Restore individual VMs/containers as needed
|
|
|
|
4. **Verification:**
|
|
- Verify data integrity
|
|
- Test restored VMs/containers
|
|
- Document lessons learned
|
|
|
|
**RTO:** 8 hours
|
|
**RPO:** Last backup
|
|
|
|
---
|
|
|
|
### 3. Network Outage
|
|
|
|
**Scenario:** Complete network failure or misconfiguration
|
|
|
|
**Recovery Steps:**
|
|
|
|
1. **Local Access:**
|
|
- Use console access (iDRAC, iLO, or physical console)
|
|
- Verify Proxmox host is running
|
|
- Check network configuration
|
|
|
|
2. **Network Restoration:**
|
|
```bash
|
|
# Check network interfaces
|
|
ip addr show
|
|
ip link show
|
|
|
|
# Check routing
|
|
ip route show
|
|
|
|
# Restart networking if needed
|
|
systemctl restart networking
|
|
```
|
|
|
|
3. **VLAN Restoration:**
|
|
- Verify VLAN configuration on switches
|
|
- Check Proxmox bridge configuration
|
|
- Test connectivity between VLANs
|
|
|
|
4. **Service Verification:**
|
|
- Test internal services
|
|
- Verify external connectivity (if applicable)
|
|
- Check Cloudflare tunnels (if used)
|
|
|
|
**RTO:** 2 hours
|
|
**RPO:** No data loss (network issue only)
|
|
|
|
---
|
|
|
|
### 4. Data Corruption
|
|
|
|
**Scenario:** VM/container data corruption or accidental deletion
|
|
|
|
**Recovery Steps:**
|
|
|
|
1. **Immediate Actions:**
|
|
- Stop affected VM/container
|
|
- Do not attempt repairs that might worsen corruption
|
|
- Document what was lost
|
|
|
|
2. **Recovery Options:**
|
|
- **From Snapshot:** Restore from most recent snapshot
|
|
- **From Backup:** Restore from Proxmox Backup Server
|
|
- **From External Backup:** Use external backup solution
|
|
|
|
3. **Restoration:**
|
|
```bash
|
|
# Restore from PBS
|
|
vzdump restore <backup-id> <vmid> --storage <storage>
|
|
|
|
# Or restore from snapshot
|
|
qm rollback <vmid> <snapshot-name>
|
|
```
|
|
|
|
4. **Verification:**
|
|
- Verify data integrity
|
|
- Test application functionality
|
|
- Update documentation
|
|
|
|
**RTO:** 4 hours
|
|
**RPO:** Last snapshot/backup
|
|
|
|
---
|
|
|
|
### 5. Security Incident
|
|
|
|
**Scenario:** Security breach, unauthorized access, or malware
|
|
|
|
**Recovery Steps:**
|
|
|
|
1. **Immediate Containment:**
|
|
- Isolate affected systems
|
|
- Disconnect from network if necessary
|
|
- Preserve evidence (logs, snapshots)
|
|
|
|
2. **Assessment:**
|
|
- Identify scope of breach
|
|
- Determine what was accessed/modified
|
|
- Check for data exfiltration
|
|
|
|
3. **Recovery:**
|
|
- Restore from known-good backups (pre-incident)
|
|
- Rebuild affected systems if necessary
|
|
- Update all credentials and keys
|
|
|
|
4. **Hardening:**
|
|
- Review and update security policies
|
|
- Patch vulnerabilities
|
|
- Enhance monitoring
|
|
|
|
5. **Documentation:**
|
|
- Document incident timeline
|
|
- Update security procedures
|
|
- Conduct post-incident review
|
|
|
|
**RTO:** 24 hours
|
|
**RPO:** Pre-incident state
|
|
|
|
---
|
|
|
|
## Backup Strategy
|
|
|
|
### Backup Schedule
|
|
|
|
- **Critical VMs/Containers:** Daily backups
|
|
- **Standard VMs/Containers:** Weekly backups
|
|
- **Configuration:** Daily backups of Proxmox configuration
|
|
- **Network Configuration:** Version controlled (Git)
|
|
|
|
### Backup Locations
|
|
|
|
1. **Primary:** Proxmox Backup Server (if available)
|
|
2. **Secondary:** External storage (NFS, SMB, or USB)
|
|
3. **Offsite:** Cloud storage or remote location
|
|
|
|
### Backup Verification
|
|
|
|
- Weekly restore tests
|
|
- Monthly full disaster recovery drill
|
|
- Quarterly review of backup strategy
|
|
|
|
---
|
|
|
|
## Recovery Contacts
|
|
|
|
### Primary Contacts
|
|
|
|
- **Infrastructure Lead:** [Contact Information]
|
|
- **Network Administrator:** [Contact Information]
|
|
- **Security Team:** [Contact Information]
|
|
|
|
### Escalation
|
|
|
|
- **Level 1:** Infrastructure team (4 hours)
|
|
- **Level 2:** Management (8 hours)
|
|
- **Level 3:** External support (24 hours)
|
|
|
|
---
|
|
|
|
## Testing and Maintenance
|
|
|
|
### Quarterly DR Drills
|
|
|
|
1. **Test Scenario:** Simulate host failure
|
|
2. **Test Scenario:** Simulate storage failure
|
|
3. **Test Scenario:** Simulate network outage
|
|
4. **Document Results:** Update procedures based on findings
|
|
|
|
### Annual Full DR Test
|
|
|
|
- Complete infrastructure rebuild from backups
|
|
- Verify all services
|
|
- Update documentation
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **[BACKUP_AND_RESTORE.md](BACKUP_AND_RESTORE.md)** - Detailed backup procedures
|
|
- **[OPERATIONAL_RUNBOOKS.md](OPERATIONAL_RUNBOOKS.md)** - Operational procedures
|
|
- **[TROUBLESHOOTING_FAQ.md](../09-troubleshooting/TROUBLESHOOTING_FAQ.md)** - Troubleshooting guide
|
|
|
|
---
|
|
|
|
**Last Updated:** 2025-01-20
|
|
**Review Cycle:** Quarterly
|