- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
427 lines
8.4 KiB
Markdown
427 lines
8.4 KiB
Markdown
# Operations Runbook
|
|
|
|
This runbook provides operational procedures for Sankofa Phoenix.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Daily Operations](#daily-operations)
|
|
2. [Tenant Management](#tenant-management)
|
|
3. [Backup Procedures](#backup-procedures)
|
|
4. [Incident Response](#incident-response)
|
|
5. [Maintenance Windows](#maintenance-windows)
|
|
6. [Troubleshooting](#troubleshooting)
|
|
|
|
## Daily Operations
|
|
|
|
### Health Checks
|
|
|
|
```bash
|
|
# Check all pods
|
|
kubectl get pods --all-namespaces
|
|
|
|
# Check API health
|
|
curl https://api.sankofa.nexus/health
|
|
|
|
# Check Keycloak health
|
|
curl https://keycloak.sankofa.nexus/health
|
|
|
|
# Check database connections
|
|
kubectl exec -it -n api deployment/api -- \
|
|
psql $DATABASE_URL -c "SELECT 1"
|
|
```
|
|
|
|
### Monitoring Dashboard Review
|
|
|
|
1. Review system overview dashboard
|
|
2. Check error rates and latency
|
|
3. Review billing anomalies
|
|
4. Check security events
|
|
5. Review Proxmox infrastructure status
|
|
|
|
### Log Review
|
|
|
|
```bash
|
|
# Recent errors
|
|
kubectl logs -n api deployment/api --tail=100 | grep -i error
|
|
|
|
# Authentication failures
|
|
kubectl logs -n api deployment/api | grep -i "auth.*fail"
|
|
|
|
# Billing issues
|
|
kubectl logs -n api deployment/api | grep -i billing
|
|
```
|
|
|
|
## Tenant Management
|
|
|
|
### Create New Tenant
|
|
|
|
```bash
|
|
# Via GraphQL
|
|
mutation {
|
|
createTenant(input: {
|
|
name: "New Tenant"
|
|
domain: "tenant.example.com"
|
|
tier: STANDARD
|
|
}) {
|
|
id
|
|
name
|
|
status
|
|
}
|
|
}
|
|
|
|
# Or via API
|
|
curl -X POST https://api.sankofa.nexus/graphql \
|
|
-H "Authorization: Bearer $TOKEN" \
|
|
-d '{"query": "mutation { createTenant(...) }"}'
|
|
```
|
|
|
|
### Suspend Tenant
|
|
|
|
```bash
|
|
# Update tenant status
|
|
mutation {
|
|
updateTenant(id: "tenant-id", input: { status: SUSPENDED }) {
|
|
id
|
|
status
|
|
}
|
|
}
|
|
```
|
|
|
|
### Delete Tenant
|
|
|
|
```bash
|
|
# Soft delete (recommended)
|
|
mutation {
|
|
updateTenant(id: "tenant-id", input: { status: DELETED }) {
|
|
id
|
|
status
|
|
}
|
|
}
|
|
|
|
# Hard delete (requires confirmation)
|
|
# This will delete all tenant resources
|
|
```
|
|
|
|
### Tenant Resource Quotas
|
|
|
|
```bash
|
|
# Check quota usage
|
|
query {
|
|
tenant(id: "tenant-id") {
|
|
quotaLimits {
|
|
compute { vcpu memory instances }
|
|
storage { total perInstance }
|
|
}
|
|
usage {
|
|
totalCost
|
|
byResource {
|
|
resourceId
|
|
cost
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Backup Procedures
|
|
|
|
### Database Backups
|
|
|
|
#### Automated Backups
|
|
|
|
Backups run daily at 2 AM UTC:
|
|
|
|
```bash
|
|
# Check backup job status
|
|
kubectl get cronjob -n api postgres-backup
|
|
|
|
# View recent backups
|
|
kubectl get pvc -n api | grep backup
|
|
```
|
|
|
|
#### Manual Backup
|
|
|
|
```bash
|
|
# Create backup
|
|
kubectl exec -it -n api deployment/postgres -- \
|
|
pg_dump -U sankofa sankofa > backup-$(date +%Y%m%d).sql
|
|
|
|
# Restore from backup
|
|
kubectl exec -i -n api deployment/postgres -- \
|
|
psql -U sankofa sankofa < backup-20240101.sql
|
|
```
|
|
|
|
### Keycloak Backups
|
|
|
|
```bash
|
|
# Export realm configuration
|
|
kubectl exec -it -n keycloak deployment/keycloak -- \
|
|
/opt/keycloak/bin/kcadm.sh get realms/master \
|
|
--realm master \
|
|
--server http://localhost:8080 \
|
|
--user admin \
|
|
--password $ADMIN_PASSWORD > keycloak-realm-$(date +%Y%m%d).json
|
|
```
|
|
|
|
### Proxmox Backups
|
|
|
|
```bash
|
|
# Backup VM configuration
|
|
# Via Proxmox API or UI
|
|
# Store in version control or backup storage
|
|
```
|
|
|
|
### Tenant-Specific Backups
|
|
|
|
```bash
|
|
# Export tenant data
|
|
query {
|
|
tenant(id: "tenant-id") {
|
|
id
|
|
name
|
|
resources {
|
|
id
|
|
name
|
|
type
|
|
}
|
|
}
|
|
}
|
|
|
|
# Backup tenant resources
|
|
# Use resource export API or database dump filtered by tenant_id
|
|
```
|
|
|
|
## Incident Response
|
|
|
|
### Incident Classification
|
|
|
|
- **P0 - Critical**: System down, data loss, security breach
|
|
- **P1 - High**: Major feature broken, performance degradation
|
|
- **P2 - Medium**: Minor feature broken, non-critical issues
|
|
- **P3 - Low**: Cosmetic issues, minor bugs
|
|
|
|
### Incident Response Process
|
|
|
|
1. **Detection**: Monitor alerts, user reports
|
|
2. **Triage**: Classify severity, assign owner
|
|
3. **Containment**: Isolate affected systems
|
|
4. **Investigation**: Root cause analysis
|
|
5. **Resolution**: Fix and verify
|
|
6. **Post-Mortem**: Document and improve
|
|
|
|
### Common Incidents
|
|
|
|
#### API Down
|
|
|
|
```bash
|
|
# Check pod status
|
|
kubectl get pods -n api
|
|
|
|
# Check logs
|
|
kubectl logs -n api deployment/api --tail=100
|
|
|
|
# Restart if needed
|
|
kubectl rollout restart deployment/api -n api
|
|
|
|
# Check database
|
|
kubectl exec -it -n api deployment/postgres -- \
|
|
psql -U sankofa -c "SELECT 1"
|
|
```
|
|
|
|
#### Database Connection Issues
|
|
|
|
```bash
|
|
# Check connection pool
|
|
kubectl exec -it -n api deployment/api -- \
|
|
curl http://localhost:4000/metrics | grep db_connections
|
|
|
|
# Restart API to reset connections
|
|
kubectl rollout restart deployment/api -n api
|
|
|
|
# Check database load
|
|
kubectl exec -it -n api deployment/postgres -- \
|
|
psql -U sankofa -c "SELECT * FROM pg_stat_activity"
|
|
```
|
|
|
|
#### High Error Rate
|
|
|
|
```bash
|
|
# Check error logs
|
|
kubectl logs -n api deployment/api | grep -i error | tail -50
|
|
|
|
# Check recent deployments
|
|
kubectl rollout history deployment/api -n api
|
|
|
|
# Rollback if needed
|
|
kubectl rollout undo deployment/api -n api
|
|
```
|
|
|
|
#### Billing Anomaly
|
|
|
|
```bash
|
|
# Check billing metrics
|
|
curl https://prometheus.sankofa.nexus/api/v1/query?query=sankofa_billing_cost_usd
|
|
|
|
# Review recent usage records
|
|
query {
|
|
usage(tenantId: "tenant-id", timeRange: {...}) {
|
|
totalCost
|
|
byResource {
|
|
resourceId
|
|
cost
|
|
}
|
|
}
|
|
}
|
|
|
|
# Check for resource leaks
|
|
kubectl get resources --all-namespaces | grep tenant-id
|
|
```
|
|
|
|
## Maintenance Windows
|
|
|
|
### Scheduled Maintenance
|
|
|
|
Maintenance windows are scheduled:
|
|
- **Weekly**: Sunday 2-4 AM UTC (low traffic)
|
|
- **Monthly**: First Sunday 2-6 AM UTC (major updates)
|
|
|
|
### Pre-Maintenance Checklist
|
|
|
|
- [ ] Notify all tenants (24h advance)
|
|
- [ ] Create backup of database
|
|
- [ ] Create backup of Keycloak
|
|
- [ ] Review recent changes
|
|
- [ ] Prepare rollback plan
|
|
- [ ] Set maintenance mode flag
|
|
|
|
### Maintenance Mode
|
|
|
|
```bash
|
|
# Enable maintenance mode
|
|
kubectl set env deployment/api -n api MAINTENANCE_MODE=true
|
|
|
|
# Disable maintenance mode
|
|
kubectl set env deployment/api -n api MAINTENANCE_MODE=false
|
|
```
|
|
|
|
### Post-Maintenance Checklist
|
|
|
|
- [ ] Verify all services are up
|
|
- [ ] Run health checks
|
|
- [ ] Check error rates
|
|
- [ ] Verify backups completed
|
|
- [ ] Notify tenants of completion
|
|
- [ ] Update documentation
|
|
|
|
## Troubleshooting
|
|
|
|
### API Not Responding
|
|
|
|
```bash
|
|
# Check pod status
|
|
kubectl describe pod -n api -l app=api
|
|
|
|
# Check logs
|
|
kubectl logs -n api -l app=api --tail=100
|
|
|
|
# Check resource limits
|
|
kubectl top pod -n api
|
|
|
|
# Check network policies
|
|
kubectl get networkpolicies -n api
|
|
```
|
|
|
|
### Database Performance Issues
|
|
|
|
```bash
|
|
# Check slow queries
|
|
kubectl exec -it -n api deployment/postgres -- \
|
|
psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10"
|
|
|
|
# Check table sizes
|
|
kubectl exec -it -n api deployment/postgres -- \
|
|
psql -U sankofa -c "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10"
|
|
|
|
# Analyze tables
|
|
kubectl exec -it -n api deployment/postgres -- \
|
|
psql -U sankofa -c "ANALYZE"
|
|
```
|
|
|
|
### Keycloak Issues
|
|
|
|
```bash
|
|
# Check Keycloak logs
|
|
kubectl logs -n keycloak deployment/keycloak --tail=100
|
|
|
|
# Check database connection
|
|
kubectl exec -it -n keycloak deployment/keycloak -- \
|
|
curl http://localhost:8080/health/ready
|
|
|
|
# Restart Keycloak
|
|
kubectl rollout restart deployment/keycloak -n keycloak
|
|
```
|
|
|
|
### Proxmox Integration Issues
|
|
|
|
```bash
|
|
# Check Crossplane provider
|
|
kubectl get pods -n crossplane-system | grep proxmox
|
|
|
|
# Check provider logs
|
|
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox
|
|
|
|
# Test Proxmox connection
|
|
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
|
|
curl https://proxmox-endpoint:8006/api2/json/version
|
|
```
|
|
|
|
## Security Audit
|
|
|
|
### Monthly Security Review
|
|
|
|
1. Review access logs
|
|
2. Check for failed authentication attempts
|
|
3. Review policy violations
|
|
4. Check for unusual API usage
|
|
5. Review incident response logs
|
|
6. Update security documentation
|
|
|
|
### Access Review
|
|
|
|
```bash
|
|
# List all users
|
|
query {
|
|
users {
|
|
id
|
|
email
|
|
role
|
|
lastLogin
|
|
}
|
|
}
|
|
|
|
# Review tenant access
|
|
query {
|
|
tenant(id: "tenant-id") {
|
|
users {
|
|
id
|
|
email
|
|
role
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Emergency Contacts
|
|
|
|
- **On-Call Engineer**: (configure in PagerDuty/Opsgenie)
|
|
- **Database Admin**: (configure)
|
|
- **Security Team**: (configure)
|
|
- **Management**: (configure)
|
|
|
|
## References
|
|
|
|
- Monitoring Guide: `docs/MONITORING_GUIDE.md`
|
|
- Deployment Guide: `docs/DEPLOYMENT_GUIDE.md`
|
|
- Keycloak Guide: `docs/KEYCLOAK_DEPLOYMENT.md`
|
|
|