- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
320 lines
7.4 KiB
Markdown
320 lines
7.4 KiB
Markdown
# Incident Response Runbook
|
|
|
|
## Overview
|
|
|
|
This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.
|
|
|
|
## Incident Severity Levels
|
|
|
|
### P0 - Critical (Immediate Response)
|
|
- Complete service outage
|
|
- Data loss or corruption
|
|
- Security breach
|
|
- **Response Time**: Immediate (< 5 minutes)
|
|
- **Resolution Target**: < 1 hour
|
|
|
|
### P1 - High (Urgent Response)
|
|
- Partial service outage affecting multiple users
|
|
- Performance degradation > 50%
|
|
- Authentication failures
|
|
- **Response Time**: < 15 minutes
|
|
- **Resolution Target**: < 4 hours
|
|
|
|
### P2 - Medium (Standard Response)
|
|
- Single feature/service degraded
|
|
- Performance degradation 20-50%
|
|
- Non-critical errors
|
|
- **Response Time**: < 1 hour
|
|
- **Resolution Target**: < 24 hours
|
|
|
|
### P3 - Low (Normal Response)
|
|
- Minor issues
|
|
- Cosmetic problems
|
|
- Non-blocking errors
|
|
- **Response Time**: < 4 hours
|
|
- **Resolution Target**: < 1 week
|
|
|
|
## Incident Response Process
|
|
|
|
### 1. Detection and Triage
|
|
|
|
#### Detection Sources
|
|
- **Monitoring Alerts**: Prometheus/Alertmanager
|
|
- **Error Logs**: Loki, application logs
|
|
- **User Reports**: Support tickets, status page
|
|
- **Health Checks**: Automated health check failures
|
|
|
|
#### Initial Triage Steps
|
|
```bash
|
|
# 1. Check service health
|
|
kubectl get pods --all-namespaces | grep -v Running
|
|
|
|
# 2. Check API health
|
|
curl -f https://api.sankofa.nexus/health || echo "API DOWN"
|
|
|
|
# 3. Check portal health
|
|
curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"
|
|
|
|
# 4. Check database connectivity
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"
|
|
|
|
# 5. Check Keycloak
|
|
curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"
|
|
```
|
|
|
|
### 2. Incident Declaration
|
|
|
|
#### Create Incident Channel
|
|
- Create dedicated Slack/Teams channel: `#incident-YYYY-MM-DD-<name>`
|
|
- Invite: On-call engineer, Team lead, Product owner
|
|
- Post initial status
|
|
|
|
#### Incident Template
|
|
```
|
|
INCIDENT: [Brief Description]
|
|
SEVERITY: P0/P1/P2/P3
|
|
STATUS: Investigating/Identified/Monitoring/Resolved
|
|
START TIME: [Timestamp]
|
|
AFFECTED SERVICES: [List]
|
|
IMPACT: [User impact description]
|
|
```
|
|
|
|
### 3. Investigation
|
|
|
|
#### Common Investigation Commands
|
|
|
|
**Check Pod Status**
|
|
```bash
|
|
kubectl get pods --all-namespaces -o wide
|
|
kubectl describe pod <pod-name> -n <namespace>
|
|
kubectl logs <pod-name> -n <namespace> --tail=100
|
|
```
|
|
|
|
**Check Resource Usage**
|
|
```bash
|
|
kubectl top nodes
|
|
kubectl top pods --all-namespaces
|
|
```
|
|
|
|
**Check Database**
|
|
```bash
|
|
# Connection count
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
|
|
|
|
# Long-running queries
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
|
|
```
|
|
|
|
**Check Logs**
|
|
```bash
|
|
# Recent errors
|
|
kubectl logs -n api deployment/api --tail=500 | grep -i error
|
|
|
|
# Authentication failures
|
|
kubectl logs -n api deployment/api | grep -i "auth.*fail"
|
|
|
|
# Rate limiting
|
|
kubectl logs -n api deployment/api | grep -i "rate limit"
|
|
```
|
|
|
|
**Check Monitoring**
|
|
```bash
|
|
# Access Grafana
|
|
open https://grafana.sankofa.nexus
|
|
|
|
# Check Prometheus alerts
|
|
kubectl get prometheusrules -n monitoring
|
|
```
|
|
|
|
### 4. Resolution
|
|
|
|
#### Common Resolution Actions
|
|
|
|
**Restart Service**
|
|
```bash
|
|
kubectl rollout restart deployment/api -n api
|
|
kubectl rollout restart deployment/portal -n portal
|
|
```
|
|
|
|
**Scale Up**
|
|
```bash
|
|
kubectl scale deployment/api --replicas=5 -n api
|
|
```
|
|
|
|
**Rollback Deployment**
|
|
```bash
|
|
# See ROLLBACK_PLAN.md for detailed procedures
|
|
kubectl rollout undo deployment/api -n api
|
|
```
|
|
|
|
**Clear Rate Limits** (if needed)
|
|
```bash
|
|
# Access Redis/rate limit store and clear keys
|
|
# Or restart rate limit service
|
|
kubectl rollout restart deployment/rate-limit -n api
|
|
```
|
|
|
|
**Database Maintenance**
|
|
```bash
|
|
# Vacuum database
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "VACUUM ANALYZE;"
|
|
|
|
# Kill long-running queries
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"
|
|
```
|
|
|
|
### 5. Post-Incident
|
|
|
|
#### Incident Report Template
|
|
```markdown
|
|
# Incident Report: [Date] - [Title]
|
|
|
|
## Summary
|
|
[Brief description of incident]
|
|
|
|
## Timeline
|
|
- [Time] - Incident detected
|
|
- [Time] - Investigation started
|
|
- [Time] - Root cause identified
|
|
- [Time] - Resolution implemented
|
|
- [Time] - Service restored
|
|
|
|
## Root Cause
|
|
[Detailed root cause analysis]
|
|
|
|
## Impact
|
|
- **Users Affected**: [Number]
|
|
- **Duration**: [Time]
|
|
- **Services Affected**: [List]
|
|
|
|
## Resolution
|
|
[Steps taken to resolve]
|
|
|
|
## Prevention
|
|
- [ ] Action item 1
|
|
- [ ] Action item 2
|
|
- [ ] Action item 3
|
|
|
|
## Follow-up
|
|
- [ ] Update monitoring/alerts
|
|
- [ ] Update runbooks
|
|
- [ ] Code changes needed
|
|
- [ ] Documentation updates
|
|
```
|
|
|
|
## Common Incidents
|
|
|
|
### API High Latency
|
|
|
|
**Symptoms**: API response times > 500ms
|
|
|
|
**Investigation**:
|
|
```bash
|
|
# Check database query performance
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
|
|
|
|
# Check API metrics
|
|
curl https://api.sankofa.nexus/metrics | grep http_request_duration
|
|
```
|
|
|
|
**Resolution**:
|
|
- Scale API replicas
|
|
- Optimize slow queries
|
|
- Add database indexes
|
|
- Check for N+1 query problems
|
|
|
|
### Database Connection Pool Exhausted
|
|
|
|
**Symptoms**: "too many connections" errors
|
|
|
|
**Investigation**:
|
|
```bash
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
|
|
```
|
|
|
|
**Resolution**:
|
|
- Increase connection pool size
|
|
- Kill idle connections
|
|
- Scale database
|
|
- Check for connection leaks
|
|
|
|
### Authentication Failures
|
|
|
|
**Symptoms**: Users cannot log in
|
|
|
|
**Investigation**:
|
|
```bash
|
|
# Check Keycloak
|
|
curl https://keycloak.sankofa.nexus/health
|
|
kubectl logs -n keycloak deployment/keycloak --tail=100
|
|
|
|
# Check API auth logs
|
|
kubectl logs -n api deployment/api | grep -i "auth.*fail"
|
|
```
|
|
|
|
**Resolution**:
|
|
- Restart Keycloak if needed
|
|
- Check OIDC configuration
|
|
- Verify JWT secret
|
|
- Check network connectivity
|
|
|
|
### Portal Not Loading
|
|
|
|
**Symptoms**: Portal returns 500 or blank page
|
|
|
|
**Investigation**:
|
|
```bash
|
|
# Check portal pods
|
|
kubectl get pods -n portal
|
|
kubectl logs -n portal deployment/portal --tail=100
|
|
|
|
# Check portal health
|
|
curl https://portal.sankofa.nexus/api/health
|
|
```
|
|
|
|
**Resolution**:
|
|
- Restart portal deployment
|
|
- Check environment variables
|
|
- Verify Keycloak connectivity
|
|
- Check build errors
|
|
|
|
## Escalation
|
|
|
|
### When to Escalate
|
|
- P0 incident not resolved in 30 minutes
|
|
- P1 incident not resolved in 2 hours
|
|
- Need additional expertise
|
|
- Customer impact is severe
|
|
|
|
### Escalation Path
|
|
1. **On-call Engineer** → Team Lead
|
|
2. **Team Lead** → Engineering Manager
|
|
3. **Engineering Manager** → CTO/VP Engineering
|
|
4. **CTO** → Executive Team
|
|
|
|
### Emergency Contacts
|
|
- **On-call**: [Phone/Slack]
|
|
- **Team Lead**: [Phone/Slack]
|
|
- **Engineering Manager**: [Phone/Slack]
|
|
- **CTO**: [Phone/Slack]
|
|
|
|
## Communication
|
|
|
|
### Status Page Updates
|
|
- Update status page during incident
|
|
- Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
|
|
- Include: Status, affected services, estimated resolution time
|
|
|
|
### Customer Communication
|
|
- For P0/P1: Notify affected customers immediately
|
|
- For P2/P3: Include in next status update
|
|
- Be transparent about impact and resolution timeline
|
|
|