Sankofa/docs/runbooks/INCIDENT_RESPONSE.md

# Incident Response Runbook

## Overview

This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.

## Incident Severity Levels

### P0 - Critical (Immediate Response)
- Complete service outage
- Data loss or corruption
- Security breach
- **Response Time**: Immediate (< 5 minutes)
- **Resolution Target**: < 1 hour

### P1 - High (Urgent Response)
- Partial service outage affecting multiple users
- Performance degradation > 50%
- Authentication failures
- **Response Time**: < 15 minutes
- **Resolution Target**: < 4 hours

### P2 - Medium (Standard Response)
- Single feature/service degraded
- Performance degradation 20-50%
- Non-critical errors
- **Response Time**: < 1 hour
- **Resolution Target**: < 24 hours

### P3 - Low (Normal Response)
- Minor issues
- Cosmetic problems
- Non-blocking errors
- **Response Time**: < 4 hours
- **Resolution Target**: < 1 week

## Incident Response Process

### 1. Detection and Triage

#### Detection Sources
- **Monitoring Alerts**: Prometheus/Alertmanager
- **Error Logs**: Loki, application logs
- **User Reports**: Support tickets, status page
- **Health Checks**: Automated health check failures

#### Initial Triage Steps
```bash
# 1. Check service health
kubectl get pods --all-namespaces | grep -v Running

# 2. Check API health
curl -f https://api.sankofa.nexus/health || echo "API DOWN"

# 3. Check portal health
curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"

# 4. Check database connectivity
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"

# 5. Check Keycloak
curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"
```

### 2. Incident Declaration

#### Create Incident Channel
- Create dedicated Slack/Teams channel: `#incident-YYYY-MM-DD-<name>`
- Invite: On-call engineer, Team lead, Product owner
- Post initial status

#### Incident Template
```
INCIDENT: [Brief Description]
SEVERITY: P0/P1/P2/P3
STATUS: Investigating/Identified/Monitoring/Resolved
START TIME: [Timestamp]
AFFECTED SERVICES: [List]
IMPACT: [User impact description]
```

### 3. Investigation

#### Common Investigation Commands

**Check Pod Status**
```bash
kubectl get pods --all-namespaces -o wide
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100
```

**Check Resource Usage**
```bash
kubectl top nodes
kubectl top pods --all-namespaces
```

**Check Database**
```bash
# Connection count
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"

# Long-running queries
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
```

**Check Logs**
```bash
# Recent errors
kubectl logs -n api deployment/api --tail=500 | grep -i error

# Authentication failures
kubectl logs -n api deployment/api | grep -i "auth.*fail"

# Rate limiting
kubectl logs -n api deployment/api | grep -i "rate limit"
```

**Check Monitoring**
```bash
# Access Grafana
open https://grafana.sankofa.nexus

# Check Prometheus alerts
kubectl get prometheusrules -n monitoring
```

### 4. Resolution

#### Common Resolution Actions

**Restart Service**
```bash
kubectl rollout restart deployment/api -n api
kubectl rollout restart deployment/portal -n portal
```

**Scale Up**
```bash
kubectl scale deployment/api --replicas=5 -n api
```

**Rollback Deployment**
```bash
# See ROLLBACK_PLAN.md for detailed procedures
kubectl rollout undo deployment/api -n api
```

**Clear Rate Limits** (if needed)
```bash
# Access Redis/rate limit store and clear keys
# Or restart rate limit service
kubectl rollout restart deployment/rate-limit -n api
```

**Database Maintenance**
```bash
# Vacuum database
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "VACUUM ANALYZE;"

# Kill long-running queries
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"
```

### 5. Post-Incident

#### Incident Report Template
```markdown
# Incident Report: [Date] - [Title]

## Summary
[Brief description of incident]

## Timeline
- [Time] - Incident detected
- [Time] - Investigation started
- [Time] - Root cause identified
- [Time] - Resolution implemented
- [Time] - Service restored

## Root Cause
[Detailed root cause analysis]

## Impact
- **Users Affected**: [Number]
- **Duration**: [Time]
- **Services Affected**: [List]

## Resolution
[Steps taken to resolve]

## Prevention
- [ ] Action item 1
- [ ] Action item 2
- [ ] Action item 3

## Follow-up
- [ ] Update monitoring/alerts
- [ ] Update runbooks
- [ ] Code changes needed
- [ ] Documentation updates
```

## Common Incidents

### API High Latency

**Symptoms**: API response times > 500ms

**Investigation**:
```bash
# Check database query performance
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"

# Check API metrics
curl https://api.sankofa.nexus/metrics | grep http_request_duration
```

**Resolution**:
- Scale API replicas
- Optimize slow queries
- Add database indexes
- Check for N+1 query problems

### Database Connection Pool Exhausted

**Symptoms**: "too many connections" errors

**Investigation**:
```bash
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
```

**Resolution**:
- Increase connection pool size
- Kill idle connections
- Scale database
- Check for connection leaks

### Authentication Failures

**Symptoms**: Users cannot log in

**Investigation**:
```bash
# Check Keycloak
curl https://keycloak.sankofa.nexus/health
kubectl logs -n keycloak deployment/keycloak --tail=100

# Check API auth logs
kubectl logs -n api deployment/api | grep -i "auth.*fail"
```

**Resolution**:
- Restart Keycloak if needed
- Check OIDC configuration
- Verify JWT secret
- Check network connectivity

### Portal Not Loading

**Symptoms**: Portal returns 500 or blank page

**Investigation**:
```bash
# Check portal pods
kubectl get pods -n portal
kubectl logs -n portal deployment/portal --tail=100

# Check portal health
curl https://portal.sankofa.nexus/api/health
```

**Resolution**:
- Restart portal deployment
- Check environment variables
- Verify Keycloak connectivity
- Check build errors

## Escalation

### When to Escalate
- P0 incident not resolved in 30 minutes
- P1 incident not resolved in 2 hours
- Need additional expertise
- Customer impact is severe

### Escalation Path
1. **On-call Engineer** → Team Lead
2. **Team Lead** → Engineering Manager
3. **Engineering Manager** → CTO/VP Engineering
4. **CTO** → Executive Team

### Emergency Contacts
- **On-call**: [Phone/Slack]
- **Team Lead**: [Phone/Slack]
- **Engineering Manager**: [Phone/Slack]
- **CTO**: [Phone/Slack]

## Communication

### Status Page Updates
- Update status page during incident
- Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
- Include: Status, affected services, estimated resolution time

### Customer Communication
- For P0/P1: Notify affected customers immediately
- For P2/P3: Include in next status update
- Be transparent about impact and resolution timeline