- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
8.4 KiB
8.4 KiB
Operations Runbook
This runbook provides operational procedures for Sankofa Phoenix.
Table of Contents
- Daily Operations
- Tenant Management
- Backup Procedures
- Incident Response
- Maintenance Windows
- Troubleshooting
Daily Operations
Health Checks
# Check all pods
kubectl get pods --all-namespaces
# Check API health
curl https://api.sankofa.nexus/health
# Check Keycloak health
curl https://keycloak.sankofa.nexus/health
# Check database connections
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT 1"
Monitoring Dashboard Review
- Review system overview dashboard
- Check error rates and latency
- Review billing anomalies
- Check security events
- Review Proxmox infrastructure status
Log Review
# Recent errors
kubectl logs -n api deployment/api --tail=100 | grep -i error
# Authentication failures
kubectl logs -n api deployment/api | grep -i "auth.*fail"
# Billing issues
kubectl logs -n api deployment/api | grep -i billing
Tenant Management
Create New Tenant
# Via GraphQL
mutation {
createTenant(input: {
name: "New Tenant"
domain: "tenant.example.com"
tier: STANDARD
}) {
id
name
status
}
}
# Or via API
curl -X POST https://api.sankofa.nexus/graphql \
-H "Authorization: Bearer $TOKEN" \
-d '{"query": "mutation { createTenant(...) }"}'
Suspend Tenant
# Update tenant status
mutation {
updateTenant(id: "tenant-id", input: { status: SUSPENDED }) {
id
status
}
}
Delete Tenant
# Soft delete (recommended)
mutation {
updateTenant(id: "tenant-id", input: { status: DELETED }) {
id
status
}
}
# Hard delete (requires confirmation)
# This will delete all tenant resources
Tenant Resource Quotas
# Check quota usage
query {
tenant(id: "tenant-id") {
quotaLimits {
compute { vcpu memory instances }
storage { total perInstance }
}
usage {
totalCost
byResource {
resourceId
cost
}
}
}
}
Backup Procedures
Database Backups
Automated Backups
Backups run daily at 2 AM UTC:
# Check backup job status
kubectl get cronjob -n api postgres-backup
# View recent backups
kubectl get pvc -n api | grep backup
Manual Backup
# Create backup
kubectl exec -it -n api deployment/postgres -- \
pg_dump -U sankofa sankofa > backup-$(date +%Y%m%d).sql
# Restore from backup
kubectl exec -i -n api deployment/postgres -- \
psql -U sankofa sankofa < backup-20240101.sql
Keycloak Backups
# Export realm configuration
kubectl exec -it -n keycloak deployment/keycloak -- \
/opt/keycloak/bin/kcadm.sh get realms/master \
--realm master \
--server http://localhost:8080 \
--user admin \
--password $ADMIN_PASSWORD > keycloak-realm-$(date +%Y%m%d).json
Proxmox Backups
# Backup VM configuration
# Via Proxmox API or UI
# Store in version control or backup storage
Tenant-Specific Backups
# Export tenant data
query {
tenant(id: "tenant-id") {
id
name
resources {
id
name
type
}
}
}
# Backup tenant resources
# Use resource export API or database dump filtered by tenant_id
Incident Response
Incident Classification
- P0 - Critical: System down, data loss, security breach
- P1 - High: Major feature broken, performance degradation
- P2 - Medium: Minor feature broken, non-critical issues
- P3 - Low: Cosmetic issues, minor bugs
Incident Response Process
- Detection: Monitor alerts, user reports
- Triage: Classify severity, assign owner
- Containment: Isolate affected systems
- Investigation: Root cause analysis
- Resolution: Fix and verify
- Post-Mortem: Document and improve
Common Incidents
API Down
# Check pod status
kubectl get pods -n api
# Check logs
kubectl logs -n api deployment/api --tail=100
# Restart if needed
kubectl rollout restart deployment/api -n api
# Check database
kubectl exec -it -n api deployment/postgres -- \
psql -U sankofa -c "SELECT 1"
Database Connection Issues
# Check connection pool
kubectl exec -it -n api deployment/api -- \
curl http://localhost:4000/metrics | grep db_connections
# Restart API to reset connections
kubectl rollout restart deployment/api -n api
# Check database load
kubectl exec -it -n api deployment/postgres -- \
psql -U sankofa -c "SELECT * FROM pg_stat_activity"
High Error Rate
# Check error logs
kubectl logs -n api deployment/api | grep -i error | tail -50
# Check recent deployments
kubectl rollout history deployment/api -n api
# Rollback if needed
kubectl rollout undo deployment/api -n api
Billing Anomaly
# Check billing metrics
curl https://prometheus.sankofa.nexus/api/v1/query?query=sankofa_billing_cost_usd
# Review recent usage records
query {
usage(tenantId: "tenant-id", timeRange: {...}) {
totalCost
byResource {
resourceId
cost
}
}
}
# Check for resource leaks
kubectl get resources --all-namespaces | grep tenant-id
Maintenance Windows
Scheduled Maintenance
Maintenance windows are scheduled:
- Weekly: Sunday 2-4 AM UTC (low traffic)
- Monthly: First Sunday 2-6 AM UTC (major updates)
Pre-Maintenance Checklist
- Notify all tenants (24h advance)
- Create backup of database
- Create backup of Keycloak
- Review recent changes
- Prepare rollback plan
- Set maintenance mode flag
Maintenance Mode
# Enable maintenance mode
kubectl set env deployment/api -n api MAINTENANCE_MODE=true
# Disable maintenance mode
kubectl set env deployment/api -n api MAINTENANCE_MODE=false
Post-Maintenance Checklist
- Verify all services are up
- Run health checks
- Check error rates
- Verify backups completed
- Notify tenants of completion
- Update documentation
Troubleshooting
API Not Responding
# Check pod status
kubectl describe pod -n api -l app=api
# Check logs
kubectl logs -n api -l app=api --tail=100
# Check resource limits
kubectl top pod -n api
# Check network policies
kubectl get networkpolicies -n api
Database Performance Issues
# Check slow queries
kubectl exec -it -n api deployment/postgres -- \
psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10"
# Check table sizes
kubectl exec -it -n api deployment/postgres -- \
psql -U sankofa -c "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10"
# Analyze tables
kubectl exec -it -n api deployment/postgres -- \
psql -U sankofa -c "ANALYZE"
Keycloak Issues
# Check Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100
# Check database connection
kubectl exec -it -n keycloak deployment/keycloak -- \
curl http://localhost:8080/health/ready
# Restart Keycloak
kubectl rollout restart deployment/keycloak -n keycloak
Proxmox Integration Issues
# Check Crossplane provider
kubectl get pods -n crossplane-system | grep proxmox
# Check provider logs
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox
# Test Proxmox connection
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
curl https://proxmox-endpoint:8006/api2/json/version
Security Audit
Monthly Security Review
- Review access logs
- Check for failed authentication attempts
- Review policy violations
- Check for unusual API usage
- Review incident response logs
- Update security documentation
Access Review
# List all users
query {
users {
id
email
role
lastLogin
}
}
# Review tenant access
query {
tenant(id: "tenant-id") {
users {
id
email
role
}
}
}
Emergency Contacts
- On-Call Engineer: (configure in PagerDuty/Opsgenie)
- Database Admin: (configure)
- Security Team: (configure)
- Management: (configure)
References
- Monitoring Guide:
docs/MONITORING_GUIDE.md - Deployment Guide:
docs/DEPLOYMENT_GUIDE.md - Keycloak Guide:
docs/KEYCLOAK_DEPLOYMENT.md