Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
This commit is contained in:
426
docs/OPERATIONS_RUNBOOK.md
Normal file
426
docs/OPERATIONS_RUNBOOK.md
Normal file
@@ -0,0 +1,426 @@
|
||||
# Operations Runbook
|
||||
|
||||
This runbook provides operational procedures for Sankofa Phoenix.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Daily Operations](#daily-operations)
|
||||
2. [Tenant Management](#tenant-management)
|
||||
3. [Backup Procedures](#backup-procedures)
|
||||
4. [Incident Response](#incident-response)
|
||||
5. [Maintenance Windows](#maintenance-windows)
|
||||
6. [Troubleshooting](#troubleshooting)
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# Check all pods
|
||||
kubectl get pods --all-namespaces
|
||||
|
||||
# Check API health
|
||||
curl https://api.sankofa.nexus/health
|
||||
|
||||
# Check Keycloak health
|
||||
curl https://keycloak.sankofa.nexus/health
|
||||
|
||||
# Check database connections
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT 1"
|
||||
```
|
||||
|
||||
### Monitoring Dashboard Review
|
||||
|
||||
1. Review system overview dashboard
|
||||
2. Check error rates and latency
|
||||
3. Review billing anomalies
|
||||
4. Check security events
|
||||
5. Review Proxmox infrastructure status
|
||||
|
||||
### Log Review
|
||||
|
||||
```bash
|
||||
# Recent errors
|
||||
kubectl logs -n api deployment/api --tail=100 | grep -i error
|
||||
|
||||
# Authentication failures
|
||||
kubectl logs -n api deployment/api | grep -i "auth.*fail"
|
||||
|
||||
# Billing issues
|
||||
kubectl logs -n api deployment/api | grep -i billing
|
||||
```
|
||||
|
||||
## Tenant Management
|
||||
|
||||
### Create New Tenant
|
||||
|
||||
```bash
|
||||
# Via GraphQL
|
||||
mutation {
|
||||
createTenant(input: {
|
||||
name: "New Tenant"
|
||||
domain: "tenant.example.com"
|
||||
tier: STANDARD
|
||||
}) {
|
||||
id
|
||||
name
|
||||
status
|
||||
}
|
||||
}
|
||||
|
||||
# Or via API
|
||||
curl -X POST https://api.sankofa.nexus/graphql \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-d '{"query": "mutation { createTenant(...) }"}'
|
||||
```
|
||||
|
||||
### Suspend Tenant
|
||||
|
||||
```bash
|
||||
# Update tenant status
|
||||
mutation {
|
||||
updateTenant(id: "tenant-id", input: { status: SUSPENDED }) {
|
||||
id
|
||||
status
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Delete Tenant
|
||||
|
||||
```bash
|
||||
# Soft delete (recommended)
|
||||
mutation {
|
||||
updateTenant(id: "tenant-id", input: { status: DELETED }) {
|
||||
id
|
||||
status
|
||||
}
|
||||
}
|
||||
|
||||
# Hard delete (requires confirmation)
|
||||
# This will delete all tenant resources
|
||||
```
|
||||
|
||||
### Tenant Resource Quotas
|
||||
|
||||
```bash
|
||||
# Check quota usage
|
||||
query {
|
||||
tenant(id: "tenant-id") {
|
||||
quotaLimits {
|
||||
compute { vcpu memory instances }
|
||||
storage { total perInstance }
|
||||
}
|
||||
usage {
|
||||
totalCost
|
||||
byResource {
|
||||
resourceId
|
||||
cost
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Database Backups
|
||||
|
||||
#### Automated Backups
|
||||
|
||||
Backups run daily at 2 AM UTC:
|
||||
|
||||
```bash
|
||||
# Check backup job status
|
||||
kubectl get cronjob -n api postgres-backup
|
||||
|
||||
# View recent backups
|
||||
kubectl get pvc -n api | grep backup
|
||||
```
|
||||
|
||||
#### Manual Backup
|
||||
|
||||
```bash
|
||||
# Create backup
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
pg_dump -U sankofa sankofa > backup-$(date +%Y%m%d).sql
|
||||
|
||||
# Restore from backup
|
||||
kubectl exec -i -n api deployment/postgres -- \
|
||||
psql -U sankofa sankofa < backup-20240101.sql
|
||||
```
|
||||
|
||||
### Keycloak Backups
|
||||
|
||||
```bash
|
||||
# Export realm configuration
|
||||
kubectl exec -it -n keycloak deployment/keycloak -- \
|
||||
/opt/keycloak/bin/kcadm.sh get realms/master \
|
||||
--realm master \
|
||||
--server http://localhost:8080 \
|
||||
--user admin \
|
||||
--password $ADMIN_PASSWORD > keycloak-realm-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
### Proxmox Backups
|
||||
|
||||
```bash
|
||||
# Backup VM configuration
|
||||
# Via Proxmox API or UI
|
||||
# Store in version control or backup storage
|
||||
```
|
||||
|
||||
### Tenant-Specific Backups
|
||||
|
||||
```bash
|
||||
# Export tenant data
|
||||
query {
|
||||
tenant(id: "tenant-id") {
|
||||
id
|
||||
name
|
||||
resources {
|
||||
id
|
||||
name
|
||||
type
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Backup tenant resources
|
||||
# Use resource export API or database dump filtered by tenant_id
|
||||
```
|
||||
|
||||
## Incident Response
|
||||
|
||||
### Incident Classification
|
||||
|
||||
- **P0 - Critical**: System down, data loss, security breach
|
||||
- **P1 - High**: Major feature broken, performance degradation
|
||||
- **P2 - Medium**: Minor feature broken, non-critical issues
|
||||
- **P3 - Low**: Cosmetic issues, minor bugs
|
||||
|
||||
### Incident Response Process
|
||||
|
||||
1. **Detection**: Monitor alerts, user reports
|
||||
2. **Triage**: Classify severity, assign owner
|
||||
3. **Containment**: Isolate affected systems
|
||||
4. **Investigation**: Root cause analysis
|
||||
5. **Resolution**: Fix and verify
|
||||
6. **Post-Mortem**: Document and improve
|
||||
|
||||
### Common Incidents
|
||||
|
||||
#### API Down
|
||||
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods -n api
|
||||
|
||||
# Check logs
|
||||
kubectl logs -n api deployment/api --tail=100
|
||||
|
||||
# Restart if needed
|
||||
kubectl rollout restart deployment/api -n api
|
||||
|
||||
# Check database
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT 1"
|
||||
```
|
||||
|
||||
#### Database Connection Issues
|
||||
|
||||
```bash
|
||||
# Check connection pool
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
curl http://localhost:4000/metrics | grep db_connections
|
||||
|
||||
# Restart API to reset connections
|
||||
kubectl rollout restart deployment/api -n api
|
||||
|
||||
# Check database load
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT * FROM pg_stat_activity"
|
||||
```
|
||||
|
||||
#### High Error Rate
|
||||
|
||||
```bash
|
||||
# Check error logs
|
||||
kubectl logs -n api deployment/api | grep -i error | tail -50
|
||||
|
||||
# Check recent deployments
|
||||
kubectl rollout history deployment/api -n api
|
||||
|
||||
# Rollback if needed
|
||||
kubectl rollout undo deployment/api -n api
|
||||
```
|
||||
|
||||
#### Billing Anomaly
|
||||
|
||||
```bash
|
||||
# Check billing metrics
|
||||
curl https://prometheus.sankofa.nexus/api/v1/query?query=sankofa_billing_cost_usd
|
||||
|
||||
# Review recent usage records
|
||||
query {
|
||||
usage(tenantId: "tenant-id", timeRange: {...}) {
|
||||
totalCost
|
||||
byResource {
|
||||
resourceId
|
||||
cost
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Check for resource leaks
|
||||
kubectl get resources --all-namespaces | grep tenant-id
|
||||
```
|
||||
|
||||
## Maintenance Windows
|
||||
|
||||
### Scheduled Maintenance
|
||||
|
||||
Maintenance windows are scheduled:
|
||||
- **Weekly**: Sunday 2-4 AM UTC (low traffic)
|
||||
- **Monthly**: First Sunday 2-6 AM UTC (major updates)
|
||||
|
||||
### Pre-Maintenance Checklist
|
||||
|
||||
- [ ] Notify all tenants (24h advance)
|
||||
- [ ] Create backup of database
|
||||
- [ ] Create backup of Keycloak
|
||||
- [ ] Review recent changes
|
||||
- [ ] Prepare rollback plan
|
||||
- [ ] Set maintenance mode flag
|
||||
|
||||
### Maintenance Mode
|
||||
|
||||
```bash
|
||||
# Enable maintenance mode
|
||||
kubectl set env deployment/api -n api MAINTENANCE_MODE=true
|
||||
|
||||
# Disable maintenance mode
|
||||
kubectl set env deployment/api -n api MAINTENANCE_MODE=false
|
||||
```
|
||||
|
||||
### Post-Maintenance Checklist
|
||||
|
||||
- [ ] Verify all services are up
|
||||
- [ ] Run health checks
|
||||
- [ ] Check error rates
|
||||
- [ ] Verify backups completed
|
||||
- [ ] Notify tenants of completion
|
||||
- [ ] Update documentation
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### API Not Responding
|
||||
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl describe pod -n api -l app=api
|
||||
|
||||
# Check logs
|
||||
kubectl logs -n api -l app=api --tail=100
|
||||
|
||||
# Check resource limits
|
||||
kubectl top pod -n api
|
||||
|
||||
# Check network policies
|
||||
kubectl get networkpolicies -n api
|
||||
```
|
||||
|
||||
### Database Performance Issues
|
||||
|
||||
```bash
|
||||
# Check slow queries
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10"
|
||||
|
||||
# Check table sizes
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10"
|
||||
|
||||
# Analyze tables
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "ANALYZE"
|
||||
```
|
||||
|
||||
### Keycloak Issues
|
||||
|
||||
```bash
|
||||
# Check Keycloak logs
|
||||
kubectl logs -n keycloak deployment/keycloak --tail=100
|
||||
|
||||
# Check database connection
|
||||
kubectl exec -it -n keycloak deployment/keycloak -- \
|
||||
curl http://localhost:8080/health/ready
|
||||
|
||||
# Restart Keycloak
|
||||
kubectl rollout restart deployment/keycloak -n keycloak
|
||||
```
|
||||
|
||||
### Proxmox Integration Issues
|
||||
|
||||
```bash
|
||||
# Check Crossplane provider
|
||||
kubectl get pods -n crossplane-system | grep proxmox
|
||||
|
||||
# Check provider logs
|
||||
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox
|
||||
|
||||
# Test Proxmox connection
|
||||
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
|
||||
curl https://proxmox-endpoint:8006/api2/json/version
|
||||
```
|
||||
|
||||
## Security Audit
|
||||
|
||||
### Monthly Security Review
|
||||
|
||||
1. Review access logs
|
||||
2. Check for failed authentication attempts
|
||||
3. Review policy violations
|
||||
4. Check for unusual API usage
|
||||
5. Review incident response logs
|
||||
6. Update security documentation
|
||||
|
||||
### Access Review
|
||||
|
||||
```bash
|
||||
# List all users
|
||||
query {
|
||||
users {
|
||||
id
|
||||
email
|
||||
role
|
||||
lastLogin
|
||||
}
|
||||
}
|
||||
|
||||
# Review tenant access
|
||||
query {
|
||||
tenant(id: "tenant-id") {
|
||||
users {
|
||||
id
|
||||
email
|
||||
role
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Emergency Contacts
|
||||
|
||||
- **On-Call Engineer**: (configure in PagerDuty/Opsgenie)
|
||||
- **Database Admin**: (configure)
|
||||
- **Security Team**: (configure)
|
||||
- **Management**: (configure)
|
||||
|
||||
## References
|
||||
|
||||
- Monitoring Guide: `docs/MONITORING_GUIDE.md`
|
||||
- Deployment Guide: `docs/DEPLOYMENT_GUIDE.md`
|
||||
- Keycloak Guide: `docs/KEYCLOAK_DEPLOYMENT.md`
|
||||
|
||||
Reference in New Issue
Block a user