3.2 KiB
3.2 KiB
Disaster Recovery Procedures
Overview
This document outlines procedures for disaster recovery and business continuity for the DBIS Core Lite payment system.
Recovery Objectives
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 1 hour (data loss tolerance)
Backup Strategy
Database Backups
Full Backup:
- Frequency: Daily at 02:00 UTC
- Retention: 30 days
- Location: Secure backup storage
- Format: Compressed SQL dump
Transaction Log Backups:
- Frequency: Every 15 minutes
- Retention: 7 days
- Used for point-in-time recovery
Audit Log Backups
- Frequency: Daily
- Retention: 10 years (compliance requirement)
- Format: CSV export + database dump
Configuration Backups
- All configuration files (env, certificates) backed up daily
- Version controlled in secure repository
Recovery Procedures
Full System Recovery
-
Prerequisites:
- Access to backup storage
- Database server available
- Application server available
-
Steps:
# 1. Restore database gunzip < backups/dbis_core_YYYYMMDD.sql.gz | psql $DATABASE_URL # 2. Run migrations npm run migrate # 3. Restore configuration cp backups/.env.production .env # 4. Restore certificates cp -r backups/certs/* ./certs/ # 5. Start application npm start
Point-in-Time Recovery
- Restore full backup to recovery server
- Apply transaction logs up to desired point
- Verify data integrity
- Switch traffic to recovered system
Partial Recovery (Single Table)
-- Restore specific table
pg_restore -t payments -d dbis_core backups/dbis_core_YYYYMMDD.dump
Disaster Scenarios
Database Server Failure
Procedure:
- Identify failure (health check, monitoring alerts)
- Activate standby database or restore from backup
- Update connection strings
- Restart application
- Verify operations
Application Server Failure
Procedure:
- Deploy application to backup server
- Update load balancer configuration
- Verify health checks
- Monitor for issues
Network Partition
Procedure:
- Identify affected components
- Route traffic around partition
- Monitor reconciliation for missed transactions
- Reconcile when connectivity restored
Data Corruption
Procedure:
- Identify corrupted data
- Isolate affected records
- Restore from backup
- Replay transactions if needed
- Verify data integrity
Testing
Disaster Recovery Testing
Schedule:
- Full DR test: Quarterly
- Partial DR test: Monthly
- Backup restore test: Weekly
Test Scenarios:
- Database server failure
- Application server failure
- Network partition
- Data corruption
- Complete site failure
Communication Plan
During disaster:
- Notify technical team immediately
- Activate on-call engineer
- Update status page
- Communicate with stakeholders
Post-Recovery
- Document incident
- Review recovery time and process
- Update procedures if needed
- Conduct post-mortem
- Implement improvements
Contacts
- Primary On-Call: [Contact]
- Secondary On-Call: [Contact]
- Database Team: [Contact]
- Infrastructure Team: [Contact]