Files
Sankofa/docs/TROUBLESHOOTING_GUIDE.md
defiQUG 9daf1fd378 Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution
- Enhance API schema with expanded type definitions and resolvers
- Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth
- Implement new services: AI optimization, billing, blockchain, compliance, marketplace
- Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage)
- Update Crossplane provider with enhanced VM management capabilities
- Add comprehensive test suite for API endpoints and services
- Update frontend components with improved GraphQL subscriptions and real-time updates
- Enhance security configurations and headers (CSP, CORS, etc.)
- Update documentation and configuration files
- Add new CI/CD workflows and validation scripts
- Implement design system improvements and UI enhancements
2025-12-12 18:01:35 -08:00

9.8 KiB

Troubleshooting Guide

Common issues and solutions for Sankofa Phoenix.

Table of Contents

  1. API Issues
  2. Database Issues
  3. Authentication Issues
  4. Resource Provisioning
  5. Billing Issues
  6. Performance Issues
  7. Deployment Issues

API Issues

API Not Responding

Symptoms:

  • 503 Service Unavailable
  • Connection timeout
  • Health check fails

Diagnosis:

# Check pod status
kubectl get pods -n api

# Check logs
kubectl logs -n api deployment/api --tail=100

# Check service
kubectl get svc -n api api

Solutions:

  1. Restart API deployment:

    kubectl rollout restart deployment/api -n api
    
  2. Check resource limits:

    kubectl describe pod -n api -l app=api
    
  3. Verify database connection:

    kubectl exec -it -n api deployment/api -- \
      psql $DATABASE_URL -c "SELECT 1"
    

GraphQL Query Errors

Symptoms:

  • GraphQL errors in response
  • "Internal server error"
  • Query timeouts

Diagnosis:

# Check API logs for errors
kubectl logs -n api deployment/api | grep -i error

# Test GraphQL endpoint
curl -X POST https://api.sankofa.nexus/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ health { status } }"}'

Solutions:

  1. Check query syntax
  2. Verify authentication token
  3. Check database query performance
  4. Review resolver logs

Rate Limiting

Symptoms:

  • 429 Too Many Requests
  • Rate limit headers present

Solutions:

  1. Implement request batching
  2. Use subscriptions for real-time updates
  3. Request rate limit increase (admin)
  4. Implement client-side caching

Database Issues

Connection Pool Exhausted

Symptoms:

  • "Too many connections" errors
  • Slow query responses
  • Database connection timeouts

Diagnosis:

# Check active connections
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT count(*) FROM pg_stat_activity"

# Check connection pool metrics
curl https://api.sankofa.nexus/metrics | grep db_connections

Solutions:

  1. Increase connection pool size:

    env:
      - name: DB_POOL_SIZE
        value: "30"
    
  2. Close idle connections:

    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes';
    
  3. Restart API to reset connections

Slow Queries

Symptoms:

  • High query latency
  • Timeout errors
  • Database CPU high

Diagnosis:

-- Find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Check table sizes
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Solutions:

  1. Add database indexes:

    CREATE INDEX idx_resources_tenant_id ON resources(tenant_id);
    CREATE INDEX idx_resources_status ON resources(status);
    
  2. Analyze tables:

    ANALYZE resources;
    
  3. Optimize queries

  4. Consider read replicas for heavy read workloads

Database Lock Issues

Symptoms:

  • Queries hanging
  • "Lock timeout" errors
  • Deadlock errors

Solutions:

  1. Check for long-running transactions:

    SELECT pid, state, query, now() - xact_start AS duration
    FROM pg_stat_activity
    WHERE state = 'active' AND xact_start IS NOT NULL
    ORDER BY duration DESC;
    
  2. Terminate blocking queries (if safe)

  3. Review transaction isolation levels

  4. Break up large transactions

Authentication Issues

Token Expired

Symptoms:

  • 401 Unauthorized
  • "Token expired" error
  • Keycloak errors

Solutions:

  1. Refresh token via Keycloak
  2. Re-authenticate
  3. Check token expiration settings in Keycloak

Invalid Token

Symptoms:

  • 401 Unauthorized
  • "Invalid token" error

Diagnosis:

# Verify Keycloak is accessible
curl https://keycloak.sankofa.nexus/health

# Check Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100

Solutions:

  1. Verify token format
  2. Check Keycloak client configuration
  3. Verify token signature
  4. Check clock synchronization

Permission Denied

Symptoms:

  • 403 Forbidden
  • "Access denied" error

Solutions:

  1. Verify user role in Keycloak
  2. Check tenant context
  3. Review RBAC policies
  4. Verify resource ownership

Resource Provisioning

VM Creation Fails

Symptoms:

  • Resource stuck in PENDING
  • Proxmox errors
  • Crossplane errors

Diagnosis:

# Check Crossplane provider
kubectl get pods -n crossplane-system | grep proxmox

# Check ProxmoxVM resource
kubectl describe proxmoxvm -n default test-vm

# Check Proxmox connectivity
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
  curl https://proxmox-endpoint:8006/api2/json/version

Solutions:

  1. Verify Proxmox credentials
  2. Check Proxmox node availability
  3. Verify resource quotas
  4. Check Crossplane provider logs

Resource Update Fails

Symptoms:

  • Update mutation fails
  • Resource not updating
  • Status mismatch

Solutions:

  1. Check resource state
  2. Verify update permissions
  3. Review resource constraints
  4. Check for conflicting updates

Billing Issues

Incorrect Costs

Symptoms:

  • Unexpected charges
  • Missing usage records
  • Cost discrepancies

Diagnosis:

-- Check usage records
SELECT * FROM usage_records
WHERE tenant_id = 'tenant-id'
ORDER BY timestamp DESC
LIMIT 100;

-- Check billing calculations
SELECT * FROM invoices
WHERE tenant_id = 'tenant-id'
ORDER BY created_at DESC;

Solutions:

  1. Review usage records
  2. Verify pricing configuration
  3. Check for duplicate records
  4. Recalculate costs if needed

Budget Alerts Not Triggering

Symptoms:

  • Budget exceeded but no alert
  • Alerts not sent

Diagnosis:

-- Check budget status
SELECT * FROM budgets
WHERE tenant_id = 'tenant-id';

-- Check alert configuration
SELECT * FROM billing_alerts
WHERE tenant_id = 'tenant-id' AND enabled = true;

Solutions:

  1. Verify alert configuration
  2. Check alert evaluation schedule
  3. Review notification channels
  4. Test alert manually

Invoice Generation Fails

Symptoms:

  • Invoice creation error
  • Missing line items
  • PDF generation fails

Solutions:

  1. Check usage records exist
  2. Verify billing period
  3. Check PDF service
  4. Review invoice template

Performance Issues

High Latency

Symptoms:

  • Slow API responses
  • Timeout errors
  • High P95 latency

Diagnosis:

# Check API metrics
curl https://api.sankofa.nexus/metrics | grep request_duration

# Check database performance
kubectl exec -it -n api deployment/postgres -- \
  psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10"

Solutions:

  1. Add caching layer
  2. Optimize database queries
  3. Scale API horizontally
  4. Review N+1 query problems

High Memory Usage

Symptoms:

  • OOM kills
  • Pod restarts
  • Memory warnings

Solutions:

  1. Increase memory limits
  2. Review memory leaks
  3. Optimize data structures
  4. Implement pagination

High CPU Usage

Symptoms:

  • Slow responses
  • CPU throttling
  • Pod evictions

Solutions:

  1. Scale horizontally
  2. Optimize algorithms
  3. Add caching
  4. Review expensive operations

Deployment Issues

Pods Not Starting

Symptoms:

  • Pods in Pending/CrashLoopBackOff
  • Image pull errors
  • Init container failures

Diagnosis:

# Check pod status
kubectl describe pod -n api <pod-name>

# Check events
kubectl get events -n api --sort-by='.lastTimestamp'

# Check logs
kubectl logs -n api <pod-name>

Solutions:

  1. Check image availability
  2. Verify resource requests/limits
  3. Check node resources
  4. Review init container logs

Service Not Accessible

Symptoms:

  • Service unreachable
  • DNS resolution fails
  • Ingress errors

Diagnosis:

# Check service
kubectl get svc -n api

# Check ingress
kubectl describe ingress -n api api

# Test service directly
kubectl port-forward -n api svc/api 8080:80
curl http://localhost:8080/health

Solutions:

  1. Verify service selector matches pods
  2. Check ingress configuration
  3. Verify DNS records
  4. Check network policies

Configuration Issues

Symptoms:

  • Wrong environment variables
  • Missing secrets
  • ConfigMap errors

Solutions:

  1. Verify environment variables:

    kubectl exec -n api deployment/api -- env | grep -E "DB_|KEYCLOAK_"
    
  2. Check secrets:

    kubectl get secrets -n api
    
  3. Review ConfigMaps:

    kubectl get configmaps -n api
    

Getting Help

Logs

# API logs
kubectl logs -n api deployment/api --tail=100 -f

# Database logs
kubectl logs -n api deployment/postgres --tail=100

# Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100

# Crossplane logs
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox --tail=100

Metrics

# Prometheus queries
curl 'https://prometheus.sankofa.nexus/api/v1/query?query=up'

# Grafana dashboards
# Access: https://grafana.sankofa.nexus

Support

  • Documentation: See docs/ directory
  • Operations Runbook: docs/OPERATIONS_RUNBOOK.md
  • API Documentation: docs/API_DOCUMENTATION.md

Common Error Messages

"Database connection failed"

  • Check database pod status
  • Verify connection string
  • Check network policies

"Authentication required"

  • Verify token in request
  • Check token expiration
  • Verify Keycloak is accessible

"Quota exceeded"

  • Review tenant quotas
  • Check resource usage
  • Request quota increase

"Resource not found"

  • Verify resource ID
  • Check tenant context
  • Review access permissions

"Internal server error"

  • Check application logs
  • Review error details
  • Check system resources