Files
dbis_core-lite/docs/deployment/disaster-recovery.md
2026-02-09 21:51:45 -08:00

3.2 KiB

Disaster Recovery Procedures

Overview

This document outlines procedures for disaster recovery and business continuity for the DBIS Core Lite payment system.

Recovery Objectives

  • RTO (Recovery Time Objective): 4 hours
  • RPO (Recovery Point Objective): 1 hour (data loss tolerance)

Backup Strategy

Database Backups

Full Backup:

  • Frequency: Daily at 02:00 UTC
  • Retention: 30 days
  • Location: Secure backup storage
  • Format: Compressed SQL dump

Transaction Log Backups:

  • Frequency: Every 15 minutes
  • Retention: 7 days
  • Used for point-in-time recovery

Audit Log Backups

  • Frequency: Daily
  • Retention: 10 years (compliance requirement)
  • Format: CSV export + database dump

Configuration Backups

  • All configuration files (env, certificates) backed up daily
  • Version controlled in secure repository

Recovery Procedures

Full System Recovery

  1. Prerequisites:

    • Access to backup storage
    • Database server available
    • Application server available
  2. Steps:

    # 1. Restore database
    gunzip < backups/dbis_core_YYYYMMDD.sql.gz | psql $DATABASE_URL
    
    # 2. Run migrations
    npm run migrate
    
    # 3. Restore configuration
    cp backups/.env.production .env
    
    # 4. Restore certificates
    cp -r backups/certs/* ./certs/
    
    # 5. Start application
    npm start
    

Point-in-Time Recovery

  1. Restore full backup to recovery server
  2. Apply transaction logs up to desired point
  3. Verify data integrity
  4. Switch traffic to recovered system

Partial Recovery (Single Table)

-- Restore specific table
pg_restore -t payments -d dbis_core backups/dbis_core_YYYYMMDD.dump

Disaster Scenarios

Database Server Failure

Procedure:

  1. Identify failure (health check, monitoring alerts)
  2. Activate standby database or restore from backup
  3. Update connection strings
  4. Restart application
  5. Verify operations

Application Server Failure

Procedure:

  1. Deploy application to backup server
  2. Update load balancer configuration
  3. Verify health checks
  4. Monitor for issues

Network Partition

Procedure:

  1. Identify affected components
  2. Route traffic around partition
  3. Monitor reconciliation for missed transactions
  4. Reconcile when connectivity restored

Data Corruption

Procedure:

  1. Identify corrupted data
  2. Isolate affected records
  3. Restore from backup
  4. Replay transactions if needed
  5. Verify data integrity

Testing

Disaster Recovery Testing

Schedule:

  • Full DR test: Quarterly
  • Partial DR test: Monthly
  • Backup restore test: Weekly

Test Scenarios:

  1. Database server failure
  2. Application server failure
  3. Network partition
  4. Data corruption
  5. Complete site failure

Communication Plan

During disaster:

  1. Notify technical team immediately
  2. Activate on-call engineer
  3. Update status page
  4. Communicate with stakeholders

Post-Recovery

  1. Document incident
  2. Review recovery time and process
  3. Update procedures if needed
  4. Conduct post-mortem
  5. Implement improvements

Contacts

  • Primary On-Call: [Contact]
  • Secondary On-Call: [Contact]
  • Database Team: [Contact]
  • Infrastructure Team: [Contact]