# Operations Runbook This runbook provides operational procedures for Sankofa Phoenix. ## Table of Contents 1. [Daily Operations](#daily-operations) 2. [Tenant Management](#tenant-management) 3. [Backup Procedures](#backup-procedures) 4. [Incident Response](#incident-response) 5. [Maintenance Windows](#maintenance-windows) 6. [Troubleshooting](#troubleshooting) ## Daily Operations ### Health Checks ```bash # Check all pods kubectl get pods --all-namespaces # Check API health curl https://api.sankofa.nexus/health # Check Keycloak health curl https://keycloak.sankofa.nexus/health # Check database connections kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT 1" ``` ### Monitoring Dashboard Review 1. Review system overview dashboard 2. Check error rates and latency 3. Review billing anomalies 4. Check security events 5. Review Proxmox infrastructure status ### Log Review ```bash # Recent errors kubectl logs -n api deployment/api --tail=100 | grep -i error # Authentication failures kubectl logs -n api deployment/api | grep -i "auth.*fail" # Billing issues kubectl logs -n api deployment/api | grep -i billing ``` ## Tenant Management ### Create New Tenant ```bash # Via GraphQL mutation { createTenant(input: { name: "New Tenant" domain: "tenant.example.com" tier: STANDARD }) { id name status } } # Or via API curl -X POST https://api.sankofa.nexus/graphql \ -H "Authorization: Bearer $TOKEN" \ -d '{"query": "mutation { createTenant(...) }"}' ``` ### Suspend Tenant ```bash # Update tenant status mutation { updateTenant(id: "tenant-id", input: { status: SUSPENDED }) { id status } } ``` ### Delete Tenant ```bash # Soft delete (recommended) mutation { updateTenant(id: "tenant-id", input: { status: DELETED }) { id status } } # Hard delete (requires confirmation) # This will delete all tenant resources ``` ### Tenant Resource Quotas ```bash # Check quota usage query { tenant(id: "tenant-id") { quotaLimits { compute { vcpu memory instances } storage { total perInstance } } usage { totalCost byResource { resourceId cost } } } } ``` ## Backup Procedures ### Database Backups #### Automated Backups Backups run daily at 2 AM UTC: ```bash # Check backup job status kubectl get cronjob -n api postgres-backup # View recent backups kubectl get pvc -n api | grep backup ``` #### Manual Backup ```bash # Create backup kubectl exec -it -n api deployment/postgres -- \ pg_dump -U sankofa sankofa > backup-$(date +%Y%m%d).sql # Restore from backup kubectl exec -i -n api deployment/postgres -- \ psql -U sankofa sankofa < backup-20240101.sql ``` ### Keycloak Backups ```bash # Export realm configuration kubectl exec -it -n keycloak deployment/keycloak -- \ /opt/keycloak/bin/kcadm.sh get realms/master \ --realm master \ --server http://localhost:8080 \ --user admin \ --password $ADMIN_PASSWORD > keycloak-realm-$(date +%Y%m%d).json ``` ### Proxmox Backups ```bash # Backup VM configuration # Via Proxmox API or UI # Store in version control or backup storage ``` ### Tenant-Specific Backups ```bash # Export tenant data query { tenant(id: "tenant-id") { id name resources { id name type } } } # Backup tenant resources # Use resource export API or database dump filtered by tenant_id ``` ## Incident Response ### Incident Classification - **P0 - Critical**: System down, data loss, security breach - **P1 - High**: Major feature broken, performance degradation - **P2 - Medium**: Minor feature broken, non-critical issues - **P3 - Low**: Cosmetic issues, minor bugs ### Incident Response Process 1. **Detection**: Monitor alerts, user reports 2. **Triage**: Classify severity, assign owner 3. **Containment**: Isolate affected systems 4. **Investigation**: Root cause analysis 5. **Resolution**: Fix and verify 6. **Post-Mortem**: Document and improve ### Common Incidents #### API Down ```bash # Check pod status kubectl get pods -n api # Check logs kubectl logs -n api deployment/api --tail=100 # Restart if needed kubectl rollout restart deployment/api -n api # Check database kubectl exec -it -n api deployment/postgres -- \ psql -U sankofa -c "SELECT 1" ``` #### Database Connection Issues ```bash # Check connection pool kubectl exec -it -n api deployment/api -- \ curl http://localhost:4000/metrics | grep db_connections # Restart API to reset connections kubectl rollout restart deployment/api -n api # Check database load kubectl exec -it -n api deployment/postgres -- \ psql -U sankofa -c "SELECT * FROM pg_stat_activity" ``` #### High Error Rate ```bash # Check error logs kubectl logs -n api deployment/api | grep -i error | tail -50 # Check recent deployments kubectl rollout history deployment/api -n api # Rollback if needed kubectl rollout undo deployment/api -n api ``` #### Billing Anomaly ```bash # Check billing metrics curl https://prometheus.sankofa.nexus/api/v1/query?query=sankofa_billing_cost_usd # Review recent usage records query { usage(tenantId: "tenant-id", timeRange: {...}) { totalCost byResource { resourceId cost } } } # Check for resource leaks kubectl get resources --all-namespaces | grep tenant-id ``` ## Maintenance Windows ### Scheduled Maintenance Maintenance windows are scheduled: - **Weekly**: Sunday 2-4 AM UTC (low traffic) - **Monthly**: First Sunday 2-6 AM UTC (major updates) ### Pre-Maintenance Checklist - [ ] Notify all tenants (24h advance) - [ ] Create backup of database - [ ] Create backup of Keycloak - [ ] Review recent changes - [ ] Prepare rollback plan - [ ] Set maintenance mode flag ### Maintenance Mode ```bash # Enable maintenance mode kubectl set env deployment/api -n api MAINTENANCE_MODE=true # Disable maintenance mode kubectl set env deployment/api -n api MAINTENANCE_MODE=false ``` ### Post-Maintenance Checklist - [ ] Verify all services are up - [ ] Run health checks - [ ] Check error rates - [ ] Verify backups completed - [ ] Notify tenants of completion - [ ] Update documentation ## Troubleshooting ### API Not Responding ```bash # Check pod status kubectl describe pod -n api -l app=api # Check logs kubectl logs -n api -l app=api --tail=100 # Check resource limits kubectl top pod -n api # Check network policies kubectl get networkpolicies -n api ``` ### Database Performance Issues ```bash # Check slow queries kubectl exec -it -n api deployment/postgres -- \ psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10" # Check table sizes kubectl exec -it -n api deployment/postgres -- \ psql -U sankofa -c "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10" # Analyze tables kubectl exec -it -n api deployment/postgres -- \ psql -U sankofa -c "ANALYZE" ``` ### Keycloak Issues ```bash # Check Keycloak logs kubectl logs -n keycloak deployment/keycloak --tail=100 # Check database connection kubectl exec -it -n keycloak deployment/keycloak -- \ curl http://localhost:8080/health/ready # Restart Keycloak kubectl rollout restart deployment/keycloak -n keycloak ``` ### Proxmox Integration Issues ```bash # Check Crossplane provider kubectl get pods -n crossplane-system | grep proxmox # Check provider logs kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox # Test Proxmox connection kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \ curl https://proxmox-endpoint:8006/api2/json/version ``` ## Security Audit ### Monthly Security Review 1. Review access logs 2. Check for failed authentication attempts 3. Review policy violations 4. Check for unusual API usage 5. Review incident response logs 6. Update security documentation ### Access Review ```bash # List all users query { users { id email role lastLogin } } # Review tenant access query { tenant(id: "tenant-id") { users { id email role } } } ``` ## Emergency Contacts - **On-Call Engineer**: (configure in PagerDuty/Opsgenie) - **Database Admin**: (configure) - **Security Team**: (configure) - **Management**: (configure) ## References - Monitoring Guide: `docs/MONITORING_GUIDE.md` - Deployment Guide: `docs/DEPLOYMENT_GUIDE.md` - Keycloak Guide: `docs/KEYCLOAK_DEPLOYMENT.md`