# Operations Runbook

This runbook provides operational procedures for Sankofa Phoenix.

## Table of Contents

1. [Daily Operations](#daily-operations)
2. [Tenant Management](#tenant-management)
3. [Backup Procedures](#backup-procedures)
4. [Incident Response](#incident-response)
5. [Maintenance Windows](#maintenance-windows)
6. [Troubleshooting](#troubleshooting)

## Daily Operations

### Health Checks

```bash
# Check all pods
kubectl get pods --all-namespaces

# Check API health
curl https://api.sankofa.nexus/health

# Check Keycloak health
curl https://keycloak.sankofa.nexus/health

# Check database connections
kubectl exec -it -n api deployment/api -- \
  psql $DATABASE_URL -c "SELECT 1"
```

### Monitoring Dashboard Review

1. Review system overview dashboard
2. Check error rates and latency
3. Review billing anomalies
4. Check security events
5. Review Proxmox infrastructure status

### Log Review

```bash
# Recent errors
kubectl logs -n api deployment/api --tail=100 | grep -i error

# Authentication failures
kubectl logs -n api deployment/api | grep -i "auth.*fail"

# Billing issues
kubectl logs -n api deployment/api | grep -i billing
```

## Tenant Management

### Create New Tenant

```bash
# Via GraphQL
mutation {
  createTenant(input: {
    name: "New Tenant"
    domain: "tenant.example.com"
    tier: STANDARD
  }) {
    id
    name
    status
  }
}

# Or via API
curl -X POST https://api.sankofa.nexus/graphql \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"query": "mutation { createTenant(...) }"}'
```

### Suspend Tenant

```bash
# Update tenant status
mutation {
  updateTenant(id: "tenant-id", input: { status: SUSPENDED }) {
    id
    status
  }
}
```

### Delete Tenant

```bash
# Soft delete (recommended)
mutation {
  updateTenant(id: "tenant-id", input: { status: DELETED }) {
    id
    status
  }
}

# Hard delete (requires confirmation)
# This will delete all tenant resources
```

### Tenant Resource Quotas

```bash
# Check quota usage
query {
  tenant(id: "tenant-id") {
    quotaLimits {
      compute { vcpu memory instances }
      storage { total perInstance }
    }
    usage {
      totalCost
      byResource {
        resourceId
        cost
      }
    }
  }
}
```

## Backup Procedures

### Database Backups

#### Automated Backups

Backups run daily at 2 AM UTC:

```bash
# Check backup job status
kubectl get cronjob -n api postgres-backup

# View recent backups
kubectl get pvc -n api | grep backup
```

#### Manual Backup

```bash
# Create backup
kubectl exec -it -n api deployment/postgres -- \
  pg_dump -U sankofa sankofa > backup-$(date +%Y%m%d).sql

# Restore from backup
kubectl exec -i -n api deployment/postgres -- \
  psql -U sankofa sankofa < backup-20240101.sql
```

### Keycloak Backups

```bash
# Export realm configuration
kubectl exec -it -n keycloak deployment/keycloak -- \
  /opt/keycloak/bin/kcadm.sh get realms/master \
  --realm master \
  --server http://localhost:8080 \
  --user admin \
  --password $ADMIN_PASSWORD > keycloak-realm-$(date +%Y%m%d).json
```

### Proxmox Backups

```bash
# Backup VM configuration
# Via Proxmox API or UI
# Store in version control or backup storage
```

### Tenant-Specific Backups

```bash
# Export tenant data
query {
  tenant(id: "tenant-id") {
    id
    name
    resources {
      id
      name
      type
    }
  }
}

# Backup tenant resources
# Use resource export API or database dump filtered by tenant_id
```

## Incident Response

### Incident Classification

- **P0 - Critical**: System down, data loss, security breach
- **P1 - High**: Major feature broken, performance degradation
- **P2 - Medium**: Minor feature broken, non-critical issues
- **P3 - Low**: Cosmetic issues, minor bugs

### Incident Response Process

1. **Detection**: Monitor alerts, user reports
2. **Triage**: Classify severity, assign owner
3. **Containment**: Isolate affected systems
4. **Investigation**: Root cause analysis
5. **Resolution**: Fix and verify
6. **Post-Mortem**: Document and improve

### Common Incidents

#### API Down

```bash
# Check pod status
kubectl get pods -n api

# Check logs
kubectl logs -n api deployment/api --tail=100

# Restart if needed
kubectl rollout restart deployment/api -n api

# Check database
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT 1"
```

#### Database Connection Issues

```bash
# Check connection pool
kubectl exec -it -n api deployment/api -- \
  curl http://localhost:4000/metrics | grep db_connections

# Restart API to reset connections
kubectl rollout restart deployment/api -n api

# Check database load
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT * FROM pg_stat_activity"
```

#### High Error Rate

```bash
# Check error logs
kubectl logs -n api deployment/api | grep -i error | tail -50

# Check recent deployments
kubectl rollout history deployment/api -n api

# Rollback if needed
kubectl rollout undo deployment/api -n api
```

#### Billing Anomaly

```bash
# Check billing metrics
curl https://prometheus.sankofa.nexus/api/v1/query?query=sankofa_billing_cost_usd

# Review recent usage records
query {
  usage(tenantId: "tenant-id", timeRange: {...}) {
    totalCost
    byResource {
      resourceId
      cost
    }
  }
}

# Check for resource leaks
kubectl get resources --all-namespaces | grep tenant-id
```

## Maintenance Windows

### Scheduled Maintenance

Maintenance windows are scheduled:
- **Weekly**: Sunday 2-4 AM UTC (low traffic)
- **Monthly**: First Sunday 2-6 AM UTC (major updates)

### Pre-Maintenance Checklist

- [ ] Notify all tenants (24h advance)
- [ ] Create backup of database
- [ ] Create backup of Keycloak
- [ ] Review recent changes
- [ ] Prepare rollback plan
- [ ] Set maintenance mode flag

### Maintenance Mode

```bash
# Enable maintenance mode
kubectl set env deployment/api -n api MAINTENANCE_MODE=true

# Disable maintenance mode
kubectl set env deployment/api -n api MAINTENANCE_MODE=false
```

### Post-Maintenance Checklist

- [ ] Verify all services are up
- [ ] Run health checks
- [ ] Check error rates
- [ ] Verify backups completed
- [ ] Notify tenants of completion
- [ ] Update documentation

## Troubleshooting

### API Not Responding

```bash
# Check pod status
kubectl describe pod -n api -l app=api

# Check logs
kubectl logs -n api -l app=api --tail=100

# Check resource limits
kubectl top pod -n api

# Check network policies
kubectl get networkpolicies -n api
```

### Database Performance Issues

```bash
# Check slow queries
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10"

# Check table sizes
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10"

# Analyze tables
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "ANALYZE"
```

### Keycloak Issues

```bash
# Check Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100

# Check database connection
kubectl exec -it -n keycloak deployment/keycloak -- \
  curl http://localhost:8080/health/ready

# Restart Keycloak
kubectl rollout restart deployment/keycloak -n keycloak
```

### Proxmox Integration Issues

```bash
# Check Crossplane provider
kubectl get pods -n crossplane-system | grep proxmox

# Check provider logs
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox

# Test Proxmox connection
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
  curl https://proxmox-endpoint:8006/api2/json/version
```

## Security Audit

### Monthly Security Review

1. Review access logs
2. Check for failed authentication attempts
3. Review policy violations
4. Check for unusual API usage
5. Review incident response logs
6. Update security documentation

### Access Review

```bash
# List all users
query {
  users {
    id
    email
    role
    lastLogin
  }
}

# Review tenant access
query {
  tenant(id: "tenant-id") {
    users {
      id
      email
      role
    }
  }
}
```

## Emergency Contacts

- **On-Call Engineer**: (configure in PagerDuty/Opsgenie)
- **Database Admin**: (configure)
- **Security Team**: (configure)
- **Management**: (configure)

## References

- Monitoring Guide: `docs/MONITORING_GUIDE.md`
- Deployment Guide: `docs/DEPLOYMENT_GUIDE.md`
- Keycloak Guide: `docs/KEYCLOAK_DEPLOYMENT.md`