# Incident Response Runbook ## Overview This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform. ## Incident Severity Levels ### P0 - Critical (Immediate Response) - Complete service outage - Data loss or corruption - Security breach - **Response Time**: Immediate (< 5 minutes) - **Resolution Target**: < 1 hour ### P1 - High (Urgent Response) - Partial service outage affecting multiple users - Performance degradation > 50% - Authentication failures - **Response Time**: < 15 minutes - **Resolution Target**: < 4 hours ### P2 - Medium (Standard Response) - Single feature/service degraded - Performance degradation 20-50% - Non-critical errors - **Response Time**: < 1 hour - **Resolution Target**: < 24 hours ### P3 - Low (Normal Response) - Minor issues - Cosmetic problems - Non-blocking errors - **Response Time**: < 4 hours - **Resolution Target**: < 1 week ## Incident Response Process ### 1. Detection and Triage #### Detection Sources - **Monitoring Alerts**: Prometheus/Alertmanager - **Error Logs**: Loki, application logs - **User Reports**: Support tickets, status page - **Health Checks**: Automated health check failures #### Initial Triage Steps ```bash # 1. Check service health kubectl get pods --all-namespaces | grep -v Running # 2. Check API health curl -f https://api.sankofa.nexus/health || echo "API DOWN" # 3. Check portal health curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN" # 4. Check database connectivity kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED" # 5. Check Keycloak curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN" ``` ### 2. Incident Declaration #### Create Incident Channel - Create dedicated Slack/Teams channel: `#incident-YYYY-MM-DD-` - Invite: On-call engineer, Team lead, Product owner - Post initial status #### Incident Template ``` INCIDENT: [Brief Description] SEVERITY: P0/P1/P2/P3 STATUS: Investigating/Identified/Monitoring/Resolved START TIME: [Timestamp] AFFECTED SERVICES: [List] IMPACT: [User impact description] ``` ### 3. Investigation #### Common Investigation Commands **Check Pod Status** ```bash kubectl get pods --all-namespaces -o wide kubectl describe pod -n kubectl logs -n --tail=100 ``` **Check Resource Usage** ```bash kubectl top nodes kubectl top pods --all-namespaces ``` **Check Database** ```bash # Connection count kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;" # Long-running queries kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';" ``` **Check Logs** ```bash # Recent errors kubectl logs -n api deployment/api --tail=500 | grep -i error # Authentication failures kubectl logs -n api deployment/api | grep -i "auth.*fail" # Rate limiting kubectl logs -n api deployment/api | grep -i "rate limit" ``` **Check Monitoring** ```bash # Access Grafana open https://grafana.sankofa.nexus # Check Prometheus alerts kubectl get prometheusrules -n monitoring ``` ### 4. Resolution #### Common Resolution Actions **Restart Service** ```bash kubectl rollout restart deployment/api -n api kubectl rollout restart deployment/portal -n portal ``` **Scale Up** ```bash kubectl scale deployment/api --replicas=5 -n api ``` **Rollback Deployment** ```bash # See ROLLBACK_PLAN.md for detailed procedures kubectl rollout undo deployment/api -n api ``` **Clear Rate Limits** (if needed) ```bash # Access Redis/rate limit store and clear keys # Or restart rate limit service kubectl rollout restart deployment/rate-limit -n api ``` **Database Maintenance** ```bash # Vacuum database kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "VACUUM ANALYZE;" # Kill long-running queries kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';" ``` ### 5. Post-Incident #### Incident Report Template ```markdown # Incident Report: [Date] - [Title] ## Summary [Brief description of incident] ## Timeline - [Time] - Incident detected - [Time] - Investigation started - [Time] - Root cause identified - [Time] - Resolution implemented - [Time] - Service restored ## Root Cause [Detailed root cause analysis] ## Impact - **Users Affected**: [Number] - **Duration**: [Time] - **Services Affected**: [List] ## Resolution [Steps taken to resolve] ## Prevention - [ ] Action item 1 - [ ] Action item 2 - [ ] Action item 3 ## Follow-up - [ ] Update monitoring/alerts - [ ] Update runbooks - [ ] Code changes needed - [ ] Documentation updates ``` ## Common Incidents ### API High Latency **Symptoms**: API response times > 500ms **Investigation**: ```bash # Check database query performance kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;" # Check API metrics curl https://api.sankofa.nexus/metrics | grep http_request_duration ``` **Resolution**: - Scale API replicas - Optimize slow queries - Add database indexes - Check for N+1 query problems ### Database Connection Pool Exhausted **Symptoms**: "too many connections" errors **Investigation**: ```bash kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;" ``` **Resolution**: - Increase connection pool size - Kill idle connections - Scale database - Check for connection leaks ### Authentication Failures **Symptoms**: Users cannot log in **Investigation**: ```bash # Check Keycloak curl https://keycloak.sankofa.nexus/health kubectl logs -n keycloak deployment/keycloak --tail=100 # Check API auth logs kubectl logs -n api deployment/api | grep -i "auth.*fail" ``` **Resolution**: - Restart Keycloak if needed - Check OIDC configuration - Verify JWT secret - Check network connectivity ### Portal Not Loading **Symptoms**: Portal returns 500 or blank page **Investigation**: ```bash # Check portal pods kubectl get pods -n portal kubectl logs -n portal deployment/portal --tail=100 # Check portal health curl https://portal.sankofa.nexus/api/health ``` **Resolution**: - Restart portal deployment - Check environment variables - Verify Keycloak connectivity - Check build errors ## Escalation ### When to Escalate - P0 incident not resolved in 30 minutes - P1 incident not resolved in 2 hours - Need additional expertise - Customer impact is severe ### Escalation Path 1. **On-call Engineer** → Team Lead 2. **Team Lead** → Engineering Manager 3. **Engineering Manager** → CTO/VP Engineering 4. **CTO** → Executive Team ### Emergency Contacts - **On-call**: [Phone/Slack] - **Team Lead**: [Phone/Slack] - **Engineering Manager**: [Phone/Slack] - **CTO**: [Phone/Slack] ## Communication ### Status Page Updates - Update status page during incident - Post updates every 30 minutes (P0/P1) or hourly (P2/P3) - Include: Status, affected services, estimated resolution time ### Customer Communication - For P0/P1: Notify affected customers immediately - For P2/P3: Include in next status update - Be transparent about impact and resolution timeline