the_order/docs/operations/DISASTER_RECOVERY.md

# Disaster Recovery Procedures

**Last Updated**: 2025-01-27
**Status**: Production Ready

## Overview

This document outlines disaster recovery (DR) procedures for The Order platform, including Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).

## RTO/RPO Definitions

- **RTO (Recovery Time Objective)**: 4 hours
  - Maximum acceptable downtime
  - Time to restore service after a disaster

- **RPO (Recovery Point Objective)**: 1 hour
  - Maximum acceptable data loss
  - Time between backups

## Backup Strategy

### Database Backups
- **Full Backups**: Daily at 02:00 UTC
- **Incremental Backups**: Hourly
- **Retention**: 30 days for full backups, 7 days for incremental
- **Location**: Primary region + cross-region replication

### Storage Backups
- **Object Storage**: Cross-region replication enabled
- **WORM Storage**: Immutable, no deletion possible
- **Backup Frequency**: Real-time replication

### Configuration Backups
- **Infrastructure**: Version controlled in Git
- **Secrets**: Stored in Azure Key Vault with backup
- **Kubernetes Manifests**: Version controlled

## Recovery Procedures

### Database Recovery

1. **Identify latest backup**
   ```bash
   ls -lt /backups/full_backup_*.sql.gz | head -1
   ```

2. **Restore database**
   ```bash
   gunzip < backup_file.sql.gz | psql $DATABASE_URL
   ```

3. **Apply incremental backups** (if needed)
   ```bash
   for backup in incremental_backup_*.sql.gz; do
     gunzip < $backup | psql $DATABASE_URL
   done
   ```

### Service Recovery

1. **Restore from Git**
   ```bash
   git checkout <last-known-good-commit>
   ```

2. **Rebuild and deploy**
   ```bash
   pnpm build
   kubectl apply -k infra/k8s/overlays/prod
   ```

3. **Verify health**
   ```bash
   kubectl get pods -n the-order-prod
   kubectl logs -f <pod-name> -n the-order-prod
   ```

### Full Disaster Recovery

1. **Assess situation**
   - Identify affected components
   - Determine scope of disaster
   - Notify stakeholders

2. **Activate DR site** (if primary region unavailable)
   - Switch DNS to DR region
   - Start services in DR region
   - Restore from backups

3. **Data recovery**
   - Restore database from latest backup
   - Restore object storage from replication
   - Verify data integrity

4. **Service restoration**
   - Deploy all services
   - Verify connectivity
   - Run health checks

5. **Validation**
   - Test critical workflows
   - Verify data consistency
   - Monitor for issues

6. **Communication**
   - Update status page
   - Notify users
   - Document incident

## DR Testing

### Quarterly DR Tests
- Test database restore
- Test service recovery
- Test full DR procedure
- Document results

### Test Scenarios
1. **Database corruption**: Restore from backup
2. **Region failure**: Failover to DR region
3. **Service failure**: Restore from Git + redeploy
4. **Data loss**: Restore from backups

## Monitoring and Alerts

- **Backup failures**: Alert immediately
- **Replication lag**: Alert if > 5 minutes
- **Service health**: Alert if any service down
- **Storage usage**: Alert if > 80% capacity

## Contacts

- **On-Call Engineer**: See PagerDuty
- **Database Team**: database-team@the-order.org
- **Infrastructure Team**: infra-team@the-order.org
- **Security Team**: security@the-order.org

---

**Last Updated**: 2025-01-27