Files
Sankofa/docs/OPERATIONS_RUNBOOK.md
defiQUG 4952ecf453 Update documentation with last updated dates and improve navigation indexes
- Added "Last Updated" date to multiple documentation files for better tracking.
- Enhanced the README with quick navigation indexes for guides, references, and architecture documentation.
- Updated titles in Keycloak deployment and testing guide for consistency.
2025-12-12 19:51:48 -08:00

8.4 KiB

Operations Runbook

Last Updated: 2025-01-09

This runbook provides operational procedures for Sankofa Phoenix.

Table of Contents

  1. Daily Operations
  2. Tenant Management
  3. Backup Procedures
  4. Incident Response
  5. Maintenance Windows
  6. Troubleshooting

Daily Operations

Health Checks

# Check all pods
kubectl get pods --all-namespaces

# Check API health
curl https://api.sankofa.nexus/health

# Check Keycloak health
curl https://keycloak.sankofa.nexus/health

# Check database connections
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT 1"

Monitoring Dashboard Review

  1. Review system overview dashboard
  2. Check error rates and latency
  3. Review billing anomalies
  4. Check security events
  5. Review Proxmox infrastructure status

Log Review

# Recent errors
kubectl logs -n api deployment/api --tail=100 | grep -i error

# Authentication failures
kubectl logs -n api deployment/api | grep -i "auth.*fail"

# Billing issues
kubectl logs -n api deployment/api | grep -i billing

Tenant Management

Create New Tenant

# Via GraphQL
mutation {
  createTenant(input: {
    name: "New Tenant"
    domain: "tenant.example.com"
    tier: STANDARD
  }) {
    id
    name
    status
  }
}

# Or via API
curl -X POST https://api.sankofa.nexus/graphql \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"query": "mutation { createTenant(...) }"}'

Suspend Tenant

# Update tenant status
mutation {
  updateTenant(id: "tenant-id", input: { status: SUSPENDED }) {
    id
    status
  }
}

Delete Tenant

# Soft delete (recommended)
mutation {
  updateTenant(id: "tenant-id", input: { status: DELETED }) {
    id
    status
  }
}

# Hard delete (requires confirmation)
# This will delete all tenant resources

Tenant Resource Quotas

# Check quota usage
query {
  tenant(id: "tenant-id") {
    quotaLimits {
      compute { vcpu memory instances }
      storage { total perInstance }
    }
    usage {
      totalCost
      byResource {
        resourceId
        cost
      }
    }
  }
}

Backup Procedures

Database Backups

Automated Backups

Backups run daily at 2 AM UTC:

# Check backup job status
kubectl get cronjob -n api postgres-backup

# View recent backups
kubectl get pvc -n api | grep backup

Manual Backup

# Create backup
kubectl exec -it -n api deployment/postgres -- \
  pg_dump -U sankofa sankofa > backup-$(date +%Y%m%d).sql

# Restore from backup
kubectl exec -i -n api deployment/postgres -- \
  psql -U sankofa sankofa < backup-20240101.sql

Keycloak Backups

# Export realm configuration
kubectl exec -it -n keycloak deployment/keycloak -- \
  /opt/keycloak/bin/kcadm.sh get realms/master \
  --realm master \
  --server http://localhost:8080 \
  --user admin \
  --password $ADMIN_PASSWORD > keycloak-realm-$(date +%Y%m%d).json

Proxmox Backups

# Backup VM configuration
# Via Proxmox API or UI
# Store in version control or backup storage

Tenant-Specific Backups

# Export tenant data
query {
  tenant(id: "tenant-id") {
    id
    name
    resources {
      id
      name
      type
    }
  }
}

# Backup tenant resources
# Use resource export API or database dump filtered by tenant_id

Incident Response

Incident Classification

  • P0 - Critical: System down, data loss, security breach
  • P1 - High: Major feature broken, performance degradation
  • P2 - Medium: Minor feature broken, non-critical issues
  • P3 - Low: Cosmetic issues, minor bugs

Incident Response Process

  1. Detection: Monitor alerts, user reports
  2. Triage: Classify severity, assign owner
  3. Containment: Isolate affected systems
  4. Investigation: Root cause analysis
  5. Resolution: Fix and verify
  6. Post-Mortem: Document and improve

Common Incidents

API Down

# Check pod status
kubectl get pods -n api

# Check logs
kubectl logs -n api deployment/api --tail=100

# Restart if needed
kubectl rollout restart deployment/api -n api

# Check database
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT 1"

Database Connection Issues

# Check connection pool
kubectl exec -it -n api deployment/api -- \
  curl http://localhost:4000/metrics | grep db_connections

# Restart API to reset connections
kubectl rollout restart deployment/api -n api

# Check database load
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT * FROM pg_stat_activity"

High Error Rate

# Check error logs
kubectl logs -n api deployment/api | grep -i error | tail -50

# Check recent deployments
kubectl rollout history deployment/api -n api

# Rollback if needed
kubectl rollout undo deployment/api -n api

Billing Anomaly

# Check billing metrics
curl https://prometheus.sankofa.nexus/api/v1/query?query=sankofa_billing_cost_usd

# Review recent usage records
query {
  usage(tenantId: "tenant-id", timeRange: {...}) {
    totalCost
    byResource {
      resourceId
      cost
    }
  }
}

# Check for resource leaks
kubectl get resources --all-namespaces | grep tenant-id

Maintenance Windows

Scheduled Maintenance

Maintenance windows are scheduled:

  • Weekly: Sunday 2-4 AM UTC (low traffic)
  • Monthly: First Sunday 2-6 AM UTC (major updates)

Pre-Maintenance Checklist

  • Notify all tenants (24h advance)
  • Create backup of database
  • Create backup of Keycloak
  • Review recent changes
  • Prepare rollback plan
  • Set maintenance mode flag

Maintenance Mode

# Enable maintenance mode
kubectl set env deployment/api -n api MAINTENANCE_MODE=true

# Disable maintenance mode
kubectl set env deployment/api -n api MAINTENANCE_MODE=false

Post-Maintenance Checklist

  • Verify all services are up
  • Run health checks
  • Check error rates
  • Verify backups completed
  • Notify tenants of completion
  • Update documentation

Troubleshooting

API Not Responding

# Check pod status
kubectl describe pod -n api -l app=api

# Check logs
kubectl logs -n api -l app=api --tail=100

# Check resource limits
kubectl top pod -n api

# Check network policies
kubectl get networkpolicies -n api

Database Performance Issues

# Check slow queries
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10"

# Check table sizes
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10"

# Analyze tables
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "ANALYZE"

Keycloak Issues

# Check Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100

# Check database connection
kubectl exec -it -n keycloak deployment/keycloak -- \
  curl http://localhost:8080/health/ready

# Restart Keycloak
kubectl rollout restart deployment/keycloak -n keycloak

Proxmox Integration Issues

# Check Crossplane provider
kubectl get pods -n crossplane-system | grep proxmox

# Check provider logs
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox

# Test Proxmox connection
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
  curl https://proxmox-endpoint:8006/api2/json/version

Security Audit

Monthly Security Review

  1. Review access logs
  2. Check for failed authentication attempts
  3. Review policy violations
  4. Check for unusual API usage
  5. Review incident response logs
  6. Update security documentation

Access Review

# List all users
query {
  users {
    id
    email
    role
    lastLogin
  }
}

# Review tenant access
query {
  tenant(id: "tenant-id") {
    users {
      id
      email
      role
    }
  }
}

Emergency Contacts

  • On-Call Engineer: (configure in PagerDuty/Opsgenie)
  • Database Admin: (configure)
  • Security Team: (configure)
  • Management: (configure)

References

  • Monitoring Guide: docs/MONITORING_GUIDE.md
  • Deployment Guide: docs/DEPLOYMENT_GUIDE.md
  • Keycloak Guide: docs/KEYCLOAK_DEPLOYMENT.md