Files
Sankofa/docs/runbooks/INCIDENT_RESPONSE.md
defiQUG 9daf1fd378 Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution
- Enhance API schema with expanded type definitions and resolvers
- Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth
- Implement new services: AI optimization, billing, blockchain, compliance, marketplace
- Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage)
- Update Crossplane provider with enhanced VM management capabilities
- Add comprehensive test suite for API endpoints and services
- Update frontend components with improved GraphQL subscriptions and real-time updates
- Enhance security configurations and headers (CSP, CORS, etc.)
- Update documentation and configuration files
- Add new CI/CD workflows and validation scripts
- Implement design system improvements and UI enhancements
2025-12-12 18:01:35 -08:00

320 lines
7.4 KiB
Markdown

# Incident Response Runbook
## Overview
This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.
## Incident Severity Levels
### P0 - Critical (Immediate Response)
- Complete service outage
- Data loss or corruption
- Security breach
- **Response Time**: Immediate (< 5 minutes)
- **Resolution Target**: < 1 hour
### P1 - High (Urgent Response)
- Partial service outage affecting multiple users
- Performance degradation > 50%
- Authentication failures
- **Response Time**: < 15 minutes
- **Resolution Target**: < 4 hours
### P2 - Medium (Standard Response)
- Single feature/service degraded
- Performance degradation 20-50%
- Non-critical errors
- **Response Time**: < 1 hour
- **Resolution Target**: < 24 hours
### P3 - Low (Normal Response)
- Minor issues
- Cosmetic problems
- Non-blocking errors
- **Response Time**: < 4 hours
- **Resolution Target**: < 1 week
## Incident Response Process
### 1. Detection and Triage
#### Detection Sources
- **Monitoring Alerts**: Prometheus/Alertmanager
- **Error Logs**: Loki, application logs
- **User Reports**: Support tickets, status page
- **Health Checks**: Automated health check failures
#### Initial Triage Steps
```bash
# 1. Check service health
kubectl get pods --all-namespaces | grep -v Running
# 2. Check API health
curl -f https://api.sankofa.nexus/health || echo "API DOWN"
# 3. Check portal health
curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"
# 4. Check database connectivity
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"
# 5. Check Keycloak
curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"
```
### 2. Incident Declaration
#### Create Incident Channel
- Create dedicated Slack/Teams channel: `#incident-YYYY-MM-DD-<name>`
- Invite: On-call engineer, Team lead, Product owner
- Post initial status
#### Incident Template
```
INCIDENT: [Brief Description]
SEVERITY: P0/P1/P2/P3
STATUS: Investigating/Identified/Monitoring/Resolved
START TIME: [Timestamp]
AFFECTED SERVICES: [List]
IMPACT: [User impact description]
```
### 3. Investigation
#### Common Investigation Commands
**Check Pod Status**
```bash
kubectl get pods --all-namespaces -o wide
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100
```
**Check Resource Usage**
```bash
kubectl top nodes
kubectl top pods --all-namespaces
```
**Check Database**
```bash
# Connection count
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
# Long-running queries
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
```
**Check Logs**
```bash
# Recent errors
kubectl logs -n api deployment/api --tail=500 | grep -i error
# Authentication failures
kubectl logs -n api deployment/api | grep -i "auth.*fail"
# Rate limiting
kubectl logs -n api deployment/api | grep -i "rate limit"
```
**Check Monitoring**
```bash
# Access Grafana
open https://grafana.sankofa.nexus
# Check Prometheus alerts
kubectl get prometheusrules -n monitoring
```
### 4. Resolution
#### Common Resolution Actions
**Restart Service**
```bash
kubectl rollout restart deployment/api -n api
kubectl rollout restart deployment/portal -n portal
```
**Scale Up**
```bash
kubectl scale deployment/api --replicas=5 -n api
```
**Rollback Deployment**
```bash
# See ROLLBACK_PLAN.md for detailed procedures
kubectl rollout undo deployment/api -n api
```
**Clear Rate Limits** (if needed)
```bash
# Access Redis/rate limit store and clear keys
# Or restart rate limit service
kubectl rollout restart deployment/rate-limit -n api
```
**Database Maintenance**
```bash
# Vacuum database
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "VACUUM ANALYZE;"
# Kill long-running queries
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"
```
### 5. Post-Incident
#### Incident Report Template
```markdown
# Incident Report: [Date] - [Title]
## Summary
[Brief description of incident]
## Timeline
- [Time] - Incident detected
- [Time] - Investigation started
- [Time] - Root cause identified
- [Time] - Resolution implemented
- [Time] - Service restored
## Root Cause
[Detailed root cause analysis]
## Impact
- **Users Affected**: [Number]
- **Duration**: [Time]
- **Services Affected**: [List]
## Resolution
[Steps taken to resolve]
## Prevention
- [ ] Action item 1
- [ ] Action item 2
- [ ] Action item 3
## Follow-up
- [ ] Update monitoring/alerts
- [ ] Update runbooks
- [ ] Code changes needed
- [ ] Documentation updates
```
## Common Incidents
### API High Latency
**Symptoms**: API response times > 500ms
**Investigation**:
```bash
# Check database query performance
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
# Check API metrics
curl https://api.sankofa.nexus/metrics | grep http_request_duration
```
**Resolution**:
- Scale API replicas
- Optimize slow queries
- Add database indexes
- Check for N+1 query problems
### Database Connection Pool Exhausted
**Symptoms**: "too many connections" errors
**Investigation**:
```bash
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
```
**Resolution**:
- Increase connection pool size
- Kill idle connections
- Scale database
- Check for connection leaks
### Authentication Failures
**Symptoms**: Users cannot log in
**Investigation**:
```bash
# Check Keycloak
curl https://keycloak.sankofa.nexus/health
kubectl logs -n keycloak deployment/keycloak --tail=100
# Check API auth logs
kubectl logs -n api deployment/api | grep -i "auth.*fail"
```
**Resolution**:
- Restart Keycloak if needed
- Check OIDC configuration
- Verify JWT secret
- Check network connectivity
### Portal Not Loading
**Symptoms**: Portal returns 500 or blank page
**Investigation**:
```bash
# Check portal pods
kubectl get pods -n portal
kubectl logs -n portal deployment/portal --tail=100
# Check portal health
curl https://portal.sankofa.nexus/api/health
```
**Resolution**:
- Restart portal deployment
- Check environment variables
- Verify Keycloak connectivity
- Check build errors
## Escalation
### When to Escalate
- P0 incident not resolved in 30 minutes
- P1 incident not resolved in 2 hours
- Need additional expertise
- Customer impact is severe
### Escalation Path
1. **On-call Engineer** → Team Lead
2. **Team Lead** → Engineering Manager
3. **Engineering Manager** → CTO/VP Engineering
4. **CTO** → Executive Team
### Emergency Contacts
- **On-call**: [Phone/Slack]
- **Team Lead**: [Phone/Slack]
- **Engineering Manager**: [Phone/Slack]
- **CTO**: [Phone/Slack]
## Communication
### Status Page Updates
- Update status page during incident
- Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
- Include: Status, affected services, estimated resolution time
### Customer Communication
- For P0/P1: Notify affected customers immediately
- For P2/P3: Include in next status update
- Be transparent about impact and resolution timeline