Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements

- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
2025-12-12 18:01:35 -08:00
parent e01131efaf
commit 9daf1fd378
968 changed files with 160890 additions and 1092 deletions
--- a/docs/runbooks/DATA_RETENTION_POLICY.md
+++ b/docs/runbooks/DATA_RETENTION_POLICY.md
@@ -0,0 +1,224 @@
+# Data Retention Policy
+
+## Overview
+
+This document defines data retention policies for the Sankofa Phoenix platform to ensure compliance with regulatory requirements and optimize storage costs.
+
+## Retention Periods
+
+### Application Data
+
+#### User Data
+- **Active Users**: Retained indefinitely while account is active
+- **Inactive Users**: Retained for 7 years after last login
+- **Deleted Users**: Soft delete for 90 days, then permanent deletion
+- **User Activity Logs**: 2 years
+
+#### Tenant Data
+- **Active Tenants**: Retained indefinitely while tenant is active
+- **Suspended Tenants**: Retained for 1 year after suspension
+- **Deleted Tenants**: Soft delete for 90 days, then permanent deletion
+
+#### Resource Data
+- **Active Resources**: Retained indefinitely
+- **Deleted Resources**: Retained for 90 days for recovery purposes
+- **Resource History**: 1 year
+
+### Audit and Compliance Data
+
+#### Audit Logs
+- **Security Events**: 7 years (compliance requirement)
+- **Authentication Logs**: 2 years
+- **Authorization Logs**: 2 years
+- **Data Access Logs**: 2 years
+- **Administrative Actions**: 7 years
+
+#### Compliance Data
+- **STIG Compliance Reports**: 7 years
+- **RMF Documentation**: 7 years
+- **Incident Reports**: 7 years
+- **Risk Assessments**: 7 years
+
+### Operational Data
+
+#### Application Logs
+- **Application Logs (Loki)**: 30 days
+- **Access Logs**: 90 days
+- **Error Logs**: 90 days
+- **Performance Logs**: 30 days
+
+#### Metrics
+- **Prometheus Metrics**: 30 days (raw)
+- **Aggregated Metrics**: 1 year
+- **Custom Metrics**: 90 days
+
+#### Backups
+- **Database Backups**: 7 days (daily), 4 weeks (weekly), 12 months (monthly)
+- **Configuration Backups**: 90 days
+- **Disaster Recovery Backups**: 7 years
+
+### Blockchain Data
+
+#### Transaction History
+- **All Transactions**: Retained indefinitely (immutable)
+- **Transaction Logs**: 7 years
+
+#### Smart Contract Data
+- **Contract State**: Retained indefinitely
+- **Contract Events**: 7 years
+
+## Data Deletion Procedures
+
+### Automated Deletion
+
+#### Scheduled Cleanup Jobs
+```bash
+# Run daily cleanup job
+kubectl create cronjob cleanup-old-data \
+  --image=postgres:14-alpine \
+  --schedule="0 3 * * *" \
+  --restart=OnFailure \
+  -- /bin/bash -c "psql $DATABASE_URL -f /scripts/cleanup-old-data.sql"
+```
+
+#### Cleanup Scripts
+- **User Data Cleanup**: Runs monthly, deletes users inactive > 7 years
+- **Log Cleanup**: Runs daily, deletes logs older than retention period
+- **Backup Cleanup**: Runs daily, deletes backups older than retention period
+
+### Manual Deletion
+
+#### User-Requested Deletion
+1. User submits deletion request
+2. Account marked for deletion
+3. 30-day grace period for account recovery
+4. Data anonymized after grace period
+5. Permanent deletion after 90 days
+
+#### Administrative Deletion
+1. Admin initiates deletion
+2. Approval required for sensitive data
+3. Data exported for compliance (if required)
+4. Data deleted according to retention policy
+
+## Compliance Requirements
+
+### GDPR (General Data Protection Regulation)
+- **Right to Erasure**: Users can request data deletion
+- **Data Portability**: Users can export their data
+- **Retention Limitation**: Data retained only as long as necessary
+
+### SOX (Sarbanes-Oxley Act)
+- **Financial Records**: 7 years retention
+- **Audit Trails**: 7 years retention
+
+### HIPAA (Health Insurance Portability and Accountability Act)
+- **PHI Data**: 6 years minimum retention
+- **Access Logs**: 6 years minimum retention
+
+### DoD/MilSpec Compliance
+- **Security Logs**: 7 years retention
+- **Audit Trails**: 7 years retention
+- **Compliance Reports**: 7 years retention
+
+## Implementation
+
+### Database Retention
+
+#### Automated Cleanup Queries
+```sql
+-- Delete inactive users (7 years)
+DELETE FROM users 
+WHERE last_login < NOW() - INTERVAL '7 years'
+  AND status = 'INACTIVE';
+
+-- Delete old audit logs (after 2 years, archive first)
+INSERT INTO audit_logs_archive 
+SELECT * FROM audit_logs 
+WHERE created_at < NOW() - INTERVAL '2 years';
+
+DELETE FROM audit_logs 
+WHERE created_at < NOW() - INTERVAL '2 years';
+```
+
+### Log Retention
+
+#### Loki Retention Configuration
+```yaml
+# gitops/apps/monitoring/loki-config.yaml
+retention_period: 30d
+retention_stream:
+  - selector: '{job="api"}'
+    period: 90d
+  - selector: '{job="portal"}'
+    period: 90d
+```
+
+#### Prometheus Retention Configuration
+```yaml
+# gitops/apps/monitoring/prometheus-config.yaml
+retention: 30d
+retentionSize: 50GB
+```
+
+### Backup Retention
+
+#### Backup Cleanup Script
+```bash
+# Delete backups older than retention period
+find /backups/postgres -name "*.sql.gz" -mtime +7 -delete
+find /backups/postgres -name "*.sql.gz" -mtime +30 -delete  # Weekly backups
+find /backups/postgres -name "*.sql.gz" -mtime +365 -delete  # Monthly backups
+```
+
+## Data Archival
+
+### Long-Term Storage
+
+#### Archived Data Storage
+- **Location**: S3 Glacier or equivalent
+- **Format**: Compressed, encrypted archives
+- **Retention**: Per compliance requirements
+- **Access**: On-demand restoration
+
+#### Archive Process
+1. Data identified for archival
+2. Data compressed and encrypted
+3. Data uploaded to archival storage
+4. Index updated with archive location
+5. Original data deleted after verification
+
+## Monitoring and Compliance
+
+### Retention Policy Compliance
+
+#### Automated Checks
+- Daily verification of retention policies
+- Alert on data older than retention period
+- Report on data deletion activities
+
+#### Compliance Reporting
+- Monthly retention compliance report
+- Quarterly audit of data retention
+- Annual compliance review
+
+## Exceptions and Extensions
+
+### Legal Hold
+- Data subject to legal hold cannot be deleted
+- Legal hold overrides retention policy
+- Legal hold must be documented
+- Data released after legal hold lifted
+
+### Business Requirements
+- Extended retention for business-critical data
+- Approval required for extensions
+- Extensions documented and reviewed annually
+
+## Contact
+
+For questions about data retention:
+- **Data Protection Officer**: dpo@sankofa.nexus
+- **Compliance Team**: compliance@sankofa.nexus
+- **Legal Team**: legal@sankofa.nexus
+
--- a/docs/runbooks/ESCALATION_PROCEDURES.md
+++ b/docs/runbooks/ESCALATION_PROCEDURES.md
@@ -0,0 +1,239 @@
+# Escalation Procedures
+
+## Overview
+
+This document defines escalation procedures for incidents, support requests, and operational issues in the Sankofa Phoenix platform.
+
+## Escalation Levels
+
+### Level 1: On-Call Engineer
+- **Response Time**: Immediate (P0/P1) or < 1 hour (P2/P3)
+- **Responsibilities**:
+  - Initial incident triage
+  - Basic troubleshooting
+  - Service restart/recovery
+  - Status updates
+
+### Level 2: Team Lead / Senior Engineer
+- **Response Time**: < 15 minutes (P0/P1) or < 2 hours (P2/P3)
+- **Responsibilities**:
+  - Complex troubleshooting
+  - Architecture decisions
+  - Code review for hotfixes
+  - Customer communication
+
+### Level 3: Engineering Manager
+- **Response Time**: < 30 minutes (P0) or < 4 hours (P1)
+- **Responsibilities**:
+  - Resource allocation
+  - Cross-team coordination
+  - Business impact assessment
+  - Executive communication
+
+### Level 4: CTO / VP Engineering
+- **Response Time**: < 1 hour (P0 only)
+- **Responsibilities**:
+  - Strategic decisions
+  - Customer escalation
+  - Public communication
+  - Resource approval
+
+## Escalation Triggers
+
+### Automatic Escalation
+- P0 incident not resolved in 30 minutes
+- P1 incident not resolved in 2 hours
+- Multiple services affected simultaneously
+- Data loss or security breach detected
+
+### Manual Escalation
+- On-call engineer requests assistance
+- Customer escalates to management
+- Issue requires expertise not available at current level
+- Business impact exceeds threshold
+
+## Escalation Matrix
+
+| Severity | Level 1 | Level 2 | Level 3 | Level 4 |
+|----------|---------|---------|---------|---------|
+| P0 | Immediate | 15 min | 30 min | 1 hour |
+| P1 | 15 min | 30 min | 2 hours | 4 hours |
+| P2 | 1 hour | 2 hours | 24 hours | N/A |
+| P3 | 4 hours | 24 hours | 1 week | N/A |
+
+## Escalation Process
+
+### Step 1: Initial Assessment
+1. On-call engineer receives alert/notification
+2. Assess severity and impact
+3. Begin investigation
+4. Document findings
+
+### Step 2: Escalation Decision
+**Escalate if**:
+- Issue not resolved within SLA
+- Additional expertise needed
+- Customer impact is severe
+- Business impact is high
+- Security concern
+
+**Do NOT escalate if**:
+- Issue is being actively worked on
+- Resolution is in progress
+- Impact is minimal
+- Standard procedure can resolve
+
+### Step 3: Escalation Execution
+1. **Notify next level**:
+   - Create escalation ticket
+   - Update incident channel
+   - Call/Slack next level contact
+   - Provide context and current status
+
+2. **Handoff information**:
+   - Incident summary
+   - Current status
+   - Actions taken
+   - Relevant logs/metrics
+   - Customer impact
+
+3. **Update tracking**:
+   - Update incident system
+   - Update status page
+   - Document escalation reason
+
+### Step 4: Escalation Resolution
+1. Escalated engineer takes ownership
+2. On-call engineer provides support
+3. Regular status updates
+4. Resolution and post-mortem
+
+## Communication Channels
+
+### Internal Communication
+- **Slack/Teams**: `#incident-YYYY-MM-DD-<name>`
+- **PagerDuty/Opsgenie**: Automatic escalation
+- **Email**: For non-urgent escalations
+- **Phone**: For P0 incidents
+
+### External Communication
+- **Status Page**: Public updates
+- **Customer Notifications**: For affected customers
+- **Support Tickets**: Update existing tickets
+
+## Contact Information
+
+### On-Call Rotation
+- **Primary**: [Contact Info]
+- **Secondary**: [Contact Info]
+- **Schedule**: [Link to schedule]
+
+### Escalation Contacts
+- **Team Lead**: [Contact Info]
+- **Engineering Manager**: [Contact Info]
+- **CTO**: [Contact Info]
+- **VP Engineering**: [Contact Info]
+
+### Support Contacts
+- **Support Team Lead**: [Contact Info]
+- **Customer Success**: [Contact Info]
+
+## Escalation Scenarios
+
+### Scenario 1: P0 Service Outage
+1. **Detection**: Monitoring alert
+2. **Level 1**: On-call engineer investigates (5 min)
+3. **Escalation**: If not resolved in 15 min → Level 2
+4. **Level 2**: Team lead coordinates (15 min)
+5. **Escalation**: If not resolved in 30 min → Level 3
+6. **Level 3**: Engineering manager allocates resources
+7. **Resolution**: Service restored
+8. **Post-Mortem**: Within 24 hours
+
+### Scenario 2: Security Breach
+1. **Detection**: Security alert or anomaly
+2. **Immediate**: Escalate to Level 3 (bypass Level 1/2)
+3. **Level 3**: Engineering manager + Security team
+4. **Escalation**: If data breach → Level 4
+5. **Level 4**: CTO + Legal + PR
+6. **Resolution**: Contain, investigate, remediate
+7. **Post-Mortem**: Within 48 hours
+
+### Scenario 3: Data Loss
+1. **Detection**: Backup failure or data corruption
+2. **Immediate**: Escalate to Level 2
+3. **Level 2**: Team lead + Database team
+4. **Escalation**: If cannot recover → Level 3
+5. **Level 3**: Engineering manager + Customer Success
+6. **Resolution**: Restore from backup or data recovery
+7. **Post-Mortem**: Within 24 hours
+
+### Scenario 4: Performance Degradation
+1. **Detection**: Performance metrics exceed thresholds
+2. **Level 1**: On-call engineer investigates (1 hour)
+3. **Escalation**: If not resolved → Level 2
+4. **Level 2**: Team lead + Performance team
+5. **Resolution**: Optimize or scale resources
+6. **Post-Mortem**: If P1/P0, within 48 hours
+
+## Customer Escalation
+
+### Customer Escalation Process
+1. **Support receives** customer escalation
+2. **Assess severity**:
+   - Technical issue → Engineering
+   - Billing issue → Finance
+   - Account issue → Customer Success
+3. **Notify appropriate team**
+4. **Provide customer updates** every 2 hours (P0/P1)
+5. **Resolve and follow up**
+
+### Customer Escalation Contacts
+- **Support Escalation**: support-escalation@sankofa.nexus
+- **Technical Escalation**: tech-escalation@sankofa.nexus
+- **Executive Escalation**: executive-escalation@sankofa.nexus
+
+## Escalation Metrics
+
+### Tracking
+- **Escalation Rate**: % of incidents escalated
+- **Escalation Time**: Time to escalate
+- **Resolution Time**: Time to resolve after escalation
+- **Customer Satisfaction**: Post-incident surveys
+
+### Goals
+- **P0 Escalation**: < 5% of P0 incidents
+- **P1 Escalation**: < 10% of P1 incidents
+- **Escalation Time**: < SLA threshold
+- **Resolution Time**: < 2x normal resolution time
+
+## Best Practices
+
+### Do's
+- ✅ Escalate early if unsure
+- ✅ Provide complete context
+- ✅ Document all actions
+- ✅ Communicate frequently
+- ✅ Learn from escalations
+
+### Don'ts
+- ❌ Escalate without trying
+- ❌ Escalate without context
+- ❌ Skip levels unnecessarily
+- ❌ Ignore customer escalations
+- ❌ Forget to update status
+
+## Review and Improvement
+
+### Monthly Review
+- Review escalation patterns
+- Identify common causes
+- Update procedures
+- Train team on improvements
+
+### Quarterly Review
+- Analyze escalation metrics
+- Update contact information
+- Review and update SLAs
+- Improve documentation
+
--- a/docs/runbooks/INCIDENT_RESPONSE.md
+++ b/docs/runbooks/INCIDENT_RESPONSE.md
@@ -0,0 +1,319 @@
+# Incident Response Runbook
+
+## Overview
+
+This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.
+
+## Incident Severity Levels
+
+### P0 - Critical (Immediate Response)
+- Complete service outage
+- Data loss or corruption
+- Security breach
+- **Response Time**: Immediate (< 5 minutes)
+- **Resolution Target**: < 1 hour
+
+### P1 - High (Urgent Response)
+- Partial service outage affecting multiple users
+- Performance degradation > 50%
+- Authentication failures
+- **Response Time**: < 15 minutes
+- **Resolution Target**: < 4 hours
+
+### P2 - Medium (Standard Response)
+- Single feature/service degraded
+- Performance degradation 20-50%
+- Non-critical errors
+- **Response Time**: < 1 hour
+- **Resolution Target**: < 24 hours
+
+### P3 - Low (Normal Response)
+- Minor issues
+- Cosmetic problems
+- Non-blocking errors
+- **Response Time**: < 4 hours
+- **Resolution Target**: < 1 week
+
+## Incident Response Process
+
+### 1. Detection and Triage
+
+#### Detection Sources
+- **Monitoring Alerts**: Prometheus/Alertmanager
+- **Error Logs**: Loki, application logs
+- **User Reports**: Support tickets, status page
+- **Health Checks**: Automated health check failures
+
+#### Initial Triage Steps
+```bash
+# 1. Check service health
+kubectl get pods --all-namespaces | grep -v Running
+
+# 2. Check API health
+curl -f https://api.sankofa.nexus/health || echo "API DOWN"
+
+# 3. Check portal health
+curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"
+
+# 4. Check database connectivity
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"
+
+# 5. Check Keycloak
+curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"
+```
+
+### 2. Incident Declaration
+
+#### Create Incident Channel
+- Create dedicated Slack/Teams channel: `#incident-YYYY-MM-DD-<name>`
+- Invite: On-call engineer, Team lead, Product owner
+- Post initial status
+
+#### Incident Template
+```
+INCIDENT: [Brief Description]
+SEVERITY: P0/P1/P2/P3
+STATUS: Investigating/Identified/Monitoring/Resolved
+START TIME: [Timestamp]
+AFFECTED SERVICES: [List]
+IMPACT: [User impact description]
+```
+
+### 3. Investigation
+
+#### Common Investigation Commands
+
+**Check Pod Status**
+```bash
+kubectl get pods --all-namespaces -o wide
+kubectl describe pod <pod-name> -n <namespace>
+kubectl logs <pod-name> -n <namespace> --tail=100
+```
+
+**Check Resource Usage**
+```bash
+kubectl top nodes
+kubectl top pods --all-namespaces
+```
+
+**Check Database**
+```bash
+# Connection count
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
+
+# Long-running queries
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
+```
+
+**Check Logs**
+```bash
+# Recent errors
+kubectl logs -n api deployment/api --tail=500 | grep -i error
+
+# Authentication failures
+kubectl logs -n api deployment/api | grep -i "auth.*fail"
+
+# Rate limiting
+kubectl logs -n api deployment/api | grep -i "rate limit"
+```
+
+**Check Monitoring**
+```bash
+# Access Grafana
+open https://grafana.sankofa.nexus
+
+# Check Prometheus alerts
+kubectl get prometheusrules -n monitoring
+```
+
+### 4. Resolution
+
+#### Common Resolution Actions
+
+**Restart Service**
+```bash
+kubectl rollout restart deployment/api -n api
+kubectl rollout restart deployment/portal -n portal
+```
+
+**Scale Up**
+```bash
+kubectl scale deployment/api --replicas=5 -n api
+```
+
+**Rollback Deployment**
+```bash
+# See ROLLBACK_PLAN.md for detailed procedures
+kubectl rollout undo deployment/api -n api
+```
+
+**Clear Rate Limits** (if needed)
+```bash
+# Access Redis/rate limit store and clear keys
+# Or restart rate limit service
+kubectl rollout restart deployment/rate-limit -n api
+```
+
+**Database Maintenance**
+```bash
+# Vacuum database
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "VACUUM ANALYZE;"
+
+# Kill long-running queries
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"
+```
+
+### 5. Post-Incident
+
+#### Incident Report Template
+```markdown
+# Incident Report: [Date] - [Title]
+
+## Summary
+[Brief description of incident]
+
+## Timeline
+- [Time] - Incident detected
+- [Time] - Investigation started
+- [Time] - Root cause identified
+- [Time] - Resolution implemented
+- [Time] - Service restored
+
+## Root Cause
+[Detailed root cause analysis]
+
+## Impact
+- **Users Affected**: [Number]
+- **Duration**: [Time]
+- **Services Affected**: [List]
+
+## Resolution
+[Steps taken to resolve]
+
+## Prevention
+- [ ] Action item 1
+- [ ] Action item 2
+- [ ] Action item 3
+
+## Follow-up
+- [ ] Update monitoring/alerts
+- [ ] Update runbooks
+- [ ] Code changes needed
+- [ ] Documentation updates
+```
+
+## Common Incidents
+
+### API High Latency
+
+**Symptoms**: API response times > 500ms
+
+**Investigation**:
+```bash
+# Check database query performance
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
+
+# Check API metrics
+curl https://api.sankofa.nexus/metrics | grep http_request_duration
+```
+
+**Resolution**:
+- Scale API replicas
+- Optimize slow queries
+- Add database indexes
+- Check for N+1 query problems
+
+### Database Connection Pool Exhausted
+
+**Symptoms**: "too many connections" errors
+
+**Investigation**:
+```bash
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
+```
+
+**Resolution**:
+- Increase connection pool size
+- Kill idle connections
+- Scale database
+- Check for connection leaks
+
+### Authentication Failures
+
+**Symptoms**: Users cannot log in
+
+**Investigation**:
+```bash
+# Check Keycloak
+curl https://keycloak.sankofa.nexus/health
+kubectl logs -n keycloak deployment/keycloak --tail=100
+
+# Check API auth logs
+kubectl logs -n api deployment/api | grep -i "auth.*fail"
+```
+
+**Resolution**:
+- Restart Keycloak if needed
+- Check OIDC configuration
+- Verify JWT secret
+- Check network connectivity
+
+### Portal Not Loading
+
+**Symptoms**: Portal returns 500 or blank page
+
+**Investigation**:
+```bash
+# Check portal pods
+kubectl get pods -n portal
+kubectl logs -n portal deployment/portal --tail=100
+
+# Check portal health
+curl https://portal.sankofa.nexus/api/health
+```
+
+**Resolution**:
+- Restart portal deployment
+- Check environment variables
+- Verify Keycloak connectivity
+- Check build errors
+
+## Escalation
+
+### When to Escalate
+- P0 incident not resolved in 30 minutes
+- P1 incident not resolved in 2 hours
+- Need additional expertise
+- Customer impact is severe
+
+### Escalation Path
+1. **On-call Engineer** → Team Lead
+2. **Team Lead** → Engineering Manager
+3. **Engineering Manager** → CTO/VP Engineering
+4. **CTO** → Executive Team
+
+### Emergency Contacts
+- **On-call**: [Phone/Slack]
+- **Team Lead**: [Phone/Slack]
+- **Engineering Manager**: [Phone/Slack]
+- **CTO**: [Phone/Slack]
+
+## Communication
+
+### Status Page Updates
+- Update status page during incident
+- Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
+- Include: Status, affected services, estimated resolution time
+
+### Customer Communication
+- For P0/P1: Notify affected customers immediately
+- For P2/P3: Include in next status update
+- Be transparent about impact and resolution timeline
+
--- a/docs/runbooks/PROXMOX_DISASTER_RECOVERY.md
+++ b/docs/runbooks/PROXMOX_DISASTER_RECOVERY.md
@@ -0,0 +1,244 @@
+# Proxmox Disaster Recovery Procedures
+
+## Overview
+
+This document outlines disaster recovery procedures for Proxmox infrastructure managed by the Crossplane provider.
+
+## Recovery Scenarios
+
+### Scenario 1: Provider Pod Failure
+
+#### Symptoms
+- Provider pod not running
+- VM operations failing
+- ProviderConfig not working
+
+#### Recovery Steps
+
+1. **Check Pod Status**:
+   ```bash
+   kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
+   ```
+
+2. **Restart Provider**:
+   ```bash
+   kubectl delete pod -n crossplane-system -l app=crossplane-provider-proxmox
+   ```
+
+3. **Verify Recovery**:
+   ```bash
+   kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
+   kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50
+   ```
+
+### Scenario 2: Proxmox Node Failure
+
+#### Symptoms
+- Cannot connect to Proxmox
+- VMs unreachable
+- Provider connection errors
+
+#### Recovery Steps
+
+1. **Verify Node Status**:
+   - Check Proxmox Web UI
+   - Verify node is online
+   - Check network connectivity
+
+2. **Check ProviderConfig**:
+   ```bash
+   kubectl get providerconfig proxmox-provider-config -o yaml
+   ```
+
+3. **Update Endpoint if Needed**:
+   - If node IP changed, update ProviderConfig
+   - If using hostname, verify DNS
+
+4. **Test Connectivity**:
+   ```bash
+   curl -k https://your-proxmox:8006/api2/json/version
+   ```
+
+### Scenario 3: Credential Compromise
+
+#### Symptoms
+- Authentication failures
+- Security alerts
+- Unauthorized access
+
+#### Recovery Steps
+
+1. **Revoke Compromised Credentials**:
+   - Log into Proxmox Web UI
+   - Revoke API tokens
+   - Change passwords
+
+2. **Create New Credentials**:
+   - Create new API tokens
+   - Use strong passwords
+   - Set appropriate permissions
+
+3. **Update Kubernetes Secret**:
+   ```bash
+   kubectl delete secret proxmox-credentials -n crossplane-system
+   kubectl create secret generic proxmox-credentials \
+     --from-literal=credentials.json='{"username":"root@pam","token":"new-token"}' \
+     -n crossplane-system
+   ```
+
+4. **Restart Provider**:
+   ```bash
+   kubectl delete pod -n crossplane-system -l app=crossplane-provider-proxmox
+   ```
+
+### Scenario 4: VM Data Loss
+
+#### Symptoms
+- VM not found
+- Data missing
+- Storage errors
+
+#### Recovery Steps
+
+1. **Check VM Status**:
+   ```bash
+   kubectl get proxmoxvm <vm-name>
+   kubectl describe proxmoxvm <vm-name>
+   ```
+
+2. **Check Proxmox Backups**:
+   - Log into Proxmox Web UI
+   - Check backup storage
+   - Review backup schedule
+
+3. **Restore from Backup**:
+   - Use Proxmox backup restore
+   - Or recreate VM from template
+
+4. **Recreate VM Resource**:
+   ```bash
+   # Delete existing resource
+   kubectl delete proxmoxvm <vm-name>
+   
+   # Recreate with same configuration
+   kubectl apply -f <vm-manifest>.yaml
+   ```
+
+### Scenario 5: Complete Provider Failure
+
+#### Symptoms
+- Provider not responding
+- All VM operations failing
+- ProviderConfig errors
+
+#### Recovery Steps
+
+1. **Check Provider Deployment**:
+   ```bash
+   kubectl get deployment -n crossplane-system crossplane-provider-proxmox
+   kubectl describe deployment -n crossplane-system crossplane-provider-proxmox
+   ```
+
+2. **Redeploy Provider**:
+   ```bash
+   kubectl delete deployment -n crossplane-system crossplane-provider-proxmox
+   kubectl apply -f crossplane-provider-proxmox/config/provider.yaml
+   ```
+
+3. **Verify ProviderConfig**:
+   ```bash
+   kubectl get providerconfig
+   kubectl describe providerconfig proxmox-provider-config
+   ```
+
+4. **Test VM Operations**:
+   ```bash
+   kubectl get proxmoxvm
+   kubectl describe proxmoxvm <test-vm>
+   ```
+
+## Backup Procedures
+
+### Provider Configuration Backup
+
+```bash
+# Backup ProviderConfig
+kubectl get providerconfig proxmox-provider-config -o yaml > providerconfig-backup.yaml
+
+# Backup credentials secret (be careful with this!)
+kubectl get secret proxmox-credentials -n crossplane-system -o yaml > credentials-backup.yaml
+```
+
+### VM Configuration Backup
+
+```bash
+# Backup all VM resources
+kubectl get proxmoxvm -o yaml > all-vms-backup.yaml
+
+# Backup specific VM
+kubectl get proxmoxvm <vm-name> -o yaml > <vm-name>-backup.yaml
+```
+
+### Proxmox Backup
+
+1. **Configure Backup Schedule**:
+   - Log into Proxmox Web UI
+   - Go to Datacenter → Backup
+   - Configure backup schedule
+
+2. **Manual Backup**:
+   - Select VM in Proxmox Web UI
+   - Click Backup
+   - Choose backup storage
+   - Start backup
+
+## Recovery Testing
+
+### Test Provider Recovery
+
+1. **Simulate Failure**:
+   ```bash
+   kubectl delete pod -n crossplane-system -l app=crossplane-provider-proxmox
+   ```
+
+2. **Verify Auto-Recovery**:
+   ```bash
+   kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
+   ```
+
+3. **Test VM Operations**:
+   ```bash
+   kubectl get proxmoxvm
+   ```
+
+### Test VM Recovery
+
+1. **Create Test VM**:
+   ```bash
+   kubectl apply -f test-vm.yaml
+   ```
+
+2. **Delete VM**:
+   ```bash
+   kubectl delete proxmoxvm test-vm
+   ```
+
+3. **Recreate VM**:
+   ```bash
+   kubectl apply -f test-vm.yaml
+   ```
+
+## Prevention
+
+1. **Regular Backups**: Schedule regular backups
+2. **Monitoring**: Set up alerts for failures
+3. **Documentation**: Keep procedures documented
+4. **Testing**: Regularly test recovery procedures
+5. **Redundancy**: Use multiple Proxmox nodes
+
+## Related Documentation
+
+- [VM Provisioning Runbook](./PROXMOX_VM_PROVISIONING.md)
+- [Troubleshooting Guide](./PROXMOX_TROUBLESHOOTING.md)
+- [Deployment Guide](../proxmox/DEPLOYMENT_GUIDE.md)
+
--- a/docs/runbooks/PROXMOX_TROUBLESHOOTING.md
+++ b/docs/runbooks/PROXMOX_TROUBLESHOOTING.md
@@ -0,0 +1,272 @@
+# Proxmox Troubleshooting Guide
+
+## Common Issues and Solutions
+
+### Provider Not Connecting
+
+#### Symptoms
+- Provider logs show connection errors
+- ProviderConfig status is not Ready
+- VM creation fails with connection errors
+
+#### Solutions
+
+1. **Verify Endpoint**:
+   ```bash
+   curl -k https://your-proxmox:8006/api2/json/version
+   ```
+
+2. **Check Credentials**:
+   ```bash
+   kubectl get secret proxmox-credentials -n crossplane-system -o yaml
+   ```
+
+3. **Test Authentication**:
+   ```bash
+   curl -k -X POST \
+     -d "username=root@pam&password=your-password" \
+     https://your-proxmox:8006/api2/json/access/ticket
+   ```
+
+4. **Check Provider Logs**:
+   ```bash
+   kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=100
+   ```
+
+### VM Creation Fails
+
+#### Symptoms
+- VM resource stuck in Creating state
+- Error messages in VM resource status
+- No VM appears in Proxmox
+
+#### Solutions
+
+1. **Check VM Resource**:
+   ```bash
+   kubectl describe proxmoxvm <vm-name>
+   ```
+
+2. **Verify Site Configuration**:
+   - Site must exist in ProviderConfig
+   - Endpoint must be reachable
+   - Node name must match actual Proxmox node
+
+3. **Check Proxmox Resources**:
+   - Storage pool must exist
+   - Network bridge must exist
+   - OS template must exist
+
+4. **Check Proxmox Logs**:
+   - Log into Proxmox Web UI
+   - Check System Log
+   - Review task history
+
+### VM Status Not Updating
+
+#### Symptoms
+- VM status remains unknown
+- IP address not populated
+- State not reflecting actual VM state
+
+#### Solutions
+
+1. **Check Provider Connectivity**:
+   ```bash
+   kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox | grep -i error
+   ```
+
+2. **Verify VM Exists in Proxmox**:
+   - Check Proxmox Web UI
+   - Verify VM ID matches
+
+3. **Check Reconciliation**:
+   ```bash
+   kubectl get proxmoxvm <vm-name> -o yaml | grep -A 5 conditions
+   ```
+
+### Storage Issues
+
+#### Symptoms
+- VM creation fails with storage errors
+- "Storage not found" errors
+- Insufficient storage errors
+
+#### Solutions
+
+1. **List Available Storage**:
+   ```bash
+   # Via Proxmox API
+   curl -k -H "Authorization: PVEAuthCookie=TOKEN" \
+     https://your-proxmox:8006/api2/json/storage
+   ```
+
+2. **Check Storage Capacity**:
+   - Log into Proxmox Web UI
+   - Check Storage section
+   - Verify available space
+
+3. **Update Storage Name**:
+   - Verify actual storage pool name
+   - Update VM manifest if needed
+
+### Network Issues
+
+#### Symptoms
+- VM created but no network connectivity
+- IP address not assigned
+- Network bridge errors
+
+#### Solutions
+
+1. **Verify Network Bridge**:
+   ```bash
+   # Via Proxmox API
+   curl -k -H "Authorization: PVEAuthCookie=TOKEN" \
+     https://your-proxmox:8006/api2/json/nodes/ML110-01/network
+   ```
+
+2. **Check Network Configuration**:
+   - Verify bridge name in VM manifest
+   - Check bridge exists on node
+   - Verify bridge is active
+
+3. **Check DHCP**:
+   - Verify DHCP server is running
+   - Check network configuration
+   - Review VM network settings
+
+### Authentication Failures
+
+#### Symptoms
+- 401 Unauthorized errors
+- Authentication failed messages
+- Token/ticket errors
+
+#### Solutions
+
+1. **Verify Credentials**:
+   - Check username format: `user@realm`
+   - Verify password is correct
+   - Check token format if using tokens
+
+2. **Test Authentication**:
+   ```bash
+   # Password auth
+   curl -k -X POST \
+     -d "username=root@pam&password=your-password" \
+     https://your-proxmox:8006/api2/json/access/ticket
+   
+   # Token auth
+   curl -k -H "Authorization: PVEAuthCookie=TOKEN" \
+     https://your-proxmox:8006/api2/json/version
+   ```
+
+3. **Check Permissions**:
+   - Verify user has VM creation permissions
+   - Check token permissions
+   - Review Proxmox user roles
+
+### Provider Pod Issues
+
+#### Symptoms
+- Provider pod not starting
+- Provider pod crashing
+- Provider pod in Error state
+
+#### Solutions
+
+1. **Check Pod Status**:
+   ```bash
+   kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
+   kubectl describe pod -n crossplane-system -l app=crossplane-provider-proxmox
+   ```
+
+2. **Check Pod Logs**:
+   ```bash
+   kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=100
+   ```
+
+3. **Check Image**:
+   ```bash
+   kubectl get deployment -n crossplane-system crossplane-provider-proxmox -o yaml | grep image
+   ```
+
+4. **Verify Resources**:
+   ```bash
+   kubectl get deployment -n crossplane-system crossplane-provider-proxmox -o yaml | grep -A 5 resources
+   ```
+
+## Diagnostic Commands
+
+### Check Provider Health
+```bash
+# Provider status
+kubectl get deployment -n crossplane-system crossplane-provider-proxmox
+
+# Provider logs
+kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50
+
+# Provider metrics
+kubectl port-forward -n crossplane-system deployment/crossplane-provider-proxmox 8080:8080
+curl http://localhost:8080/metrics
+```
+
+### Check VM Resources
+```bash
+# List all VMs
+kubectl get proxmoxvm
+
+# Get VM details
+kubectl get proxmoxvm <vm-name> -o yaml
+
+# Check VM events
+kubectl describe proxmoxvm <vm-name>
+```
+
+### Check ProviderConfig
+```bash
+# List ProviderConfigs
+kubectl get providerconfig
+
+# Get ProviderConfig details
+kubectl get providerconfig proxmox-provider-config -o yaml
+
+# Check ProviderConfig status
+kubectl describe providerconfig proxmox-provider-config
+```
+
+## Escalation Procedures
+
+### Level 1: Basic Troubleshooting
+1. Check provider logs
+2. Verify credentials
+3. Test connectivity
+4. Review VM resource status
+
+### Level 2: Advanced Troubleshooting
+1. Check Proxmox Web UI
+2. Review Proxmox logs
+3. Verify network connectivity
+4. Check resource availability
+
+### Level 3: Infrastructure Issues
+1. Contact Proxmox administrator
+2. Check infrastructure status
+3. Review network configuration
+4. Verify DNS resolution
+
+## Prevention
+
+1. **Regular Monitoring**: Set up alerts for provider health
+2. **Resource Verification**: Verify resources before deployment
+3. **Credential Rotation**: Rotate credentials regularly
+4. **Backup Configuration**: Backup ProviderConfig and secrets
+5. **Documentation**: Keep documentation up to date
+
+## Related Documentation
+
+- [VM Provisioning Runbook](./PROXMOX_VM_PROVISIONING.md)
+- [Deployment Guide](../proxmox/DEPLOYMENT_GUIDE.md)
+- [Site Mapping](../proxmox/SITE_MAPPING.md)
+
--- a/docs/runbooks/PROXMOX_VM_PROVISIONING.md
+++ b/docs/runbooks/PROXMOX_VM_PROVISIONING.md
@@ -0,0 +1,207 @@
+# Proxmox VM Provisioning Runbook
+
+## Overview
+
+This runbook provides step-by-step procedures for provisioning virtual machines on Proxmox infrastructure using the Crossplane provider.
+
+## Prerequisites
+
+- Kubernetes cluster with Crossplane and Proxmox provider installed
+- ProviderConfig configured and ready
+- Appropriate permissions to create ProxmoxVM resources
+- Access to Proxmox Web UI (for verification)
+
+## Standard VM Provisioning
+
+### Step 1: Create VM Manifest
+
+Create a YAML manifest for the VM:
+
+```yaml
+apiVersion: proxmox.sankofa.nexus/v1alpha1
+kind: ProxmoxVM
+metadata:
+  name: my-vm
+  namespace: default
+spec:
+  forProvider:
+    node: ML110-01
+    name: my-vm
+    cpu: 2
+    memory: 4Gi
+    disk: 50Gi
+    storage: local-lvm
+    network: vmbr0
+    image: ubuntu-22.04-cloud
+    site: us-sfvalley
+    userData: |
+      #cloud-config
+      users:
+        - name: admin
+          groups: sudo
+          shell: /bin/bash
+          sudo: ['ALL=(ALL) NOPASSWD:ALL']
+  providerConfigRef:
+    name: proxmox-provider-config
+```
+
+### Step 2: Apply Manifest
+
+```bash
+kubectl apply -f my-vm.yaml
+```
+
+### Step 3: Verify Creation
+
+```bash
+# Check VM resource status
+kubectl get proxmoxvm my-vm
+
+# Get detailed status
+kubectl describe proxmoxvm my-vm
+
+# Check VM in Proxmox
+# Log into Proxmox Web UI and verify VM exists
+```
+
+### Step 4: Verify VM Status
+
+Wait for VM to be created and check status:
+
+```bash
+# Watch VM status
+kubectl get proxmoxvm my-vm -w
+
+# Check VM ID
+kubectl get proxmoxvm my-vm -o jsonpath='{.status.vmId}'
+
+# Check VM state
+kubectl get proxmoxvm my-vm -o jsonpath='{.status.state}'
+
+# Check IP address (if available)
+kubectl get proxmoxvm my-vm -o jsonpath='{.status.ipAddress}'
+```
+
+## Multi-Site VM Provisioning
+
+### Provision VM on Different Site
+
+Update the `site` field in the manifest:
+
+```yaml
+spec:
+  forProvider:
+    site: eu-west-1  # or apac-1 or us-sfvalley
+    node: R630-01   # for both eu-west-1 and apac-1
+```
+
+## VM Lifecycle Operations
+
+### Start VM
+
+```bash
+# VM should start automatically after creation
+# To manually start, update the VM resource or use Proxmox API
+```
+
+### Stop VM
+
+```bash
+# Update VM resource or use Proxmox Web UI
+```
+
+### Delete VM
+
+```bash
+kubectl delete proxmoxvm my-vm
+```
+
+## Troubleshooting
+
+### VM Creation Fails
+
+1. **Check ProviderConfig**:
+   ```bash
+   kubectl get providerconfig proxmox-provider-config -o yaml
+   ```
+
+2. **Check Provider Logs**:
+   ```bash
+   kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50
+   ```
+
+3. **Verify Site Configuration**:
+   - Check if site exists in ProviderConfig
+   - Verify endpoint is reachable
+   - Check node name matches actual Proxmox node
+
+4. **Check Proxmox Resources**:
+   - Verify storage pool exists
+   - Verify network bridge exists
+   - Verify OS template exists
+
+### VM Stuck in Creating State
+
+1. **Check VM Resource Events**:
+   ```bash
+   kubectl describe proxmoxvm my-vm
+   ```
+
+2. **Check Proxmox Web UI**:
+   - Log into Proxmox
+   - Check if VM exists
+   - Check VM status
+   - Review Proxmox logs
+
+3. **Verify Resources**:
+   - Check available storage
+   - Check available memory
+   - Check node status
+
+### VM Not Getting IP Address
+
+1. **Check Cloud-Init**:
+   - Verify userData is correct
+   - Check cloud-init logs in VM
+
+2. **Check Network Configuration**:
+   - Verify network bridge is correct
+   - Check DHCP configuration
+   - Verify VM network interface
+
+3. **Check Guest Agent**:
+   - Ensure QEMU guest agent is installed
+   - Verify guest agent is running
+
+## Best Practices
+
+1. **Resource Naming**: Use descriptive names for VMs
+2. **Resource Limits**: Set appropriate CPU and memory limits
+3. **Storage Planning**: Choose appropriate storage pools
+4. **Network Configuration**: Use correct network bridges
+5. **Backup Strategy**: Configure backups for important VMs
+6. **Monitoring**: Set up monitoring for VM metrics
+
+## Common Configurations
+
+### Small VM (Development)
+- CPU: 1-2 cores
+- Memory: 2-4 Gi
+- Disk: 20-50 Gi
+
+### Medium VM (Staging)
+- CPU: 2-4 cores
+- Memory: 4-8 Gi
+- Disk: 50-100 Gi
+
+### Large VM (Production)
+- CPU: 4+ cores
+- Memory: 8+ Gi
+- Disk: 100+ Gi
+
+## Related Documentation
+
+- [Deployment Guide](../proxmox/DEPLOYMENT_GUIDE.md)
+- [Site Mapping](../proxmox/SITE_MAPPING.md)
+- [Resource Inventory](../proxmox/RESOURCE_INVENTORY.md)
+
--- a/docs/runbooks/ROLLBACK_PLAN.md
+++ b/docs/runbooks/ROLLBACK_PLAN.md
@@ -0,0 +1,297 @@
+# Rollback Plan
+
+## Overview
+
+This document outlines procedures for rolling back deployments in the Sankofa Phoenix platform.
+
+## Rollback Strategy
+
+### GitOps Rollback (Recommended)
+
+All applications are managed via ArgoCD GitOps. Rollbacks should be performed through Git by reverting to a previous commit.
+
+### Manual Rollback
+
+For emergency situations, manual rollbacks can be performed directly in Kubernetes.
+
+## Pre-Rollback Checklist
+
+- [ ] Identify the commit/tag to rollback to
+- [ ] Verify the previous version is stable
+- [ ] Notify team of rollback
+- [ ] Document reason for rollback
+- [ ] Check database migration compatibility (if applicable)
+
+## Rollback Procedures
+
+### 1. API Service Rollback
+
+#### GitOps Method
+```bash
+# 1. Identify the commit to rollback to
+git log --oneline api/
+
+# 2. Revert to previous commit or tag
+cd gitops/apps/api
+git checkout <previous-commit-hash>
+git push origin main
+
+# 3. ArgoCD will automatically sync
+# Or manually sync:
+argocd app sync api
+```
+
+#### Manual Method
+```bash
+# 1. List deployment history
+kubectl rollout history deployment/api -n api
+
+# 2. View specific revision
+kubectl rollout history deployment/api -n api --revision=<revision-number>
+
+# 3. Rollback to previous revision
+kubectl rollout undo deployment/api -n api
+
+# 4. Or rollback to specific revision
+kubectl rollout undo deployment/api -n api --to-revision=<revision-number>
+
+# 5. Monitor rollback
+kubectl rollout status deployment/api -n api
+```
+
+### 2. Portal Rollback
+
+#### GitOps Method
+```bash
+cd gitops/apps/portal
+git checkout <previous-commit-hash>
+git push origin main
+argocd app sync portal
+```
+
+#### Manual Method
+```bash
+kubectl rollout undo deployment/portal -n portal
+kubectl rollout status deployment/portal -n portal
+```
+
+### 3. Database Migration Rollback
+
+**⚠️ WARNING**: Database rollbacks require careful planning. Not all migrations are reversible.
+
+#### Check Migration Status
+```bash
+# Connect to database
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL
+
+# Check migration history
+SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 10;
+```
+
+#### Rollback Migration (if reversible)
+```bash
+# Run down migration
+cd api
+npm run db:migrate:down
+
+# Or manually revert SQL
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -f /path/to/rollback.sql
+```
+
+#### For Non-Reversible Migrations
+1. Create new migration to restore previous state
+2. Test in staging first
+3. Apply during maintenance window
+4. Document data loss risks
+
+### 4. Frontend (Public Site) Rollback
+
+#### GitOps Method
+```bash
+cd gitops/apps/frontend
+git checkout <previous-commit-hash>
+git push origin main
+argocd app sync frontend
+```
+
+#### Manual Method
+```bash
+kubectl rollout undo deployment/frontend -n frontend
+kubectl rollout status deployment/frontend -n frontend
+```
+
+### 5. Monitoring Stack Rollback
+
+```bash
+# Rollback Prometheus
+kubectl rollout undo deployment/prometheus-operator -n monitoring
+
+# Rollback Grafana
+kubectl rollout undo deployment/grafana -n monitoring
+
+# Rollback Alertmanager
+kubectl rollout undo deployment/alertmanager -n monitoring
+```
+
+### 6. Keycloak Rollback
+
+```bash
+# Rollback Keycloak
+kubectl rollout undo deployment/keycloak -n keycloak
+
+# Verify Keycloak health
+curl https://keycloak.sankofa.nexus/health
+```
+
+## Post-Rollback Verification
+
+### 1. Health Checks
+```bash
+# API
+curl -f https://api.sankofa.nexus/health
+
+# Portal
+curl -f https://portal.sankofa.nexus/api/health
+
+# Keycloak
+curl -f https://keycloak.sankofa.nexus/health
+```
+
+### 2. Functional Testing
+```bash
+# Run smoke tests
+./scripts/smoke-tests.sh
+
+# Test authentication
+curl -X POST https://api.sankofa.nexus/graphql \
+  -H "Content-Type: application/json" \
+  -d '{"query": "mutation { login(email: \"test@example.com\", password: \"test\") { token } }"}'
+```
+
+### 3. Monitoring
+- Check Grafana dashboards for errors
+- Verify Prometheus metrics are normal
+- Check Loki logs for errors
+
+### 4. Database Verification
+```bash
+# Verify database connectivity
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT 1"
+
+# Check for data integrity issues
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT COUNT(*) FROM users;"
+```
+
+## Rollback Scenarios
+
+### Scenario 1: API Breaking Change
+
+**Symptoms**: API returns errors after deployment
+
+**Rollback Steps**:
+1. Immediately rollback API deployment
+2. Verify API health
+3. Check error logs
+4. Investigate root cause
+5. Fix and redeploy
+
+### Scenario 2: Database Migration Failure
+
+**Symptoms**: Database errors, application crashes
+
+**Rollback Steps**:
+1. Stop application deployments
+2. Assess migration state
+3. Rollback migration if possible
+4. Or restore from backup
+5. Redeploy previous application version
+
+### Scenario 3: Portal Build Failure
+
+**Symptoms**: Portal shows blank page or errors
+
+**Rollback Steps**:
+1. Rollback portal deployment
+2. Verify portal loads
+3. Check build logs
+4. Fix build issues
+5. Redeploy
+
+### Scenario 4: Configuration Error
+
+**Symptoms**: Services cannot connect to dependencies
+
+**Rollback Steps**:
+1. Revert configuration changes in Git
+2. ArgoCD will sync automatically
+3. Or manually update ConfigMaps/Secrets
+4. Restart affected services
+
+## Rollback Testing
+
+### Staging Rollback Test
+```bash
+# 1. Deploy new version to staging
+argocd app sync api-staging
+
+# 2. Test new version
+./scripts/smoke-tests.sh --env=staging
+
+# 3. Simulate rollback
+kubectl rollout undo deployment/api -n api-staging
+
+# 4. Verify rollback works
+./scripts/smoke-tests.sh --env=staging
+```
+
+## Rollback Communication
+
+### Internal Communication
+- Notify team in #engineering channel
+- Update incident tracking system
+- Document in runbook
+
+### External Communication
+- Update status page if user-facing
+- Notify affected customers if needed
+- Post-mortem for P0/P1 incidents
+
+## Prevention
+
+### Pre-Deployment
+- [ ] All tests passing
+- [ ] Code review completed
+- [ ] Staging deployment successful
+- [ ] Smoke tests passing
+- [ ] Database migrations tested
+- [ ] Rollback plan reviewed
+
+### Deployment
+- [ ] Deploy to staging first
+- [ ] Monitor staging for 24 hours
+- [ ] Gradual production rollout (canary)
+- [ ] Monitor metrics closely
+- [ ] Have rollback plan ready
+
+## Rollback Decision Matrix
+
+| Issue | Severity | Rollback? |
+|-------|----------|-----------|
+| Complete outage | P0 | Yes, immediately |
+| Data corruption | P0 | Yes, immediately |
+| Security breach | P0 | Yes, immediately |
+| >50% error rate | P1 | Yes, within 15 min |
+| Performance >50% degraded | P1 | Yes, within 30 min |
+| Single feature broken | P2 | Maybe, assess impact |
+| Minor bugs | P3 | No, fix forward |
+
+## Emergency Contacts
+
+- **On-call Engineer**: [Contact]
+- **Team Lead**: [Contact]
+- **DevOps Lead**: [Contact]
+