- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
9.8 KiB
Troubleshooting Guide
Common issues and solutions for Sankofa Phoenix.
Table of Contents
- API Issues
- Database Issues
- Authentication Issues
- Resource Provisioning
- Billing Issues
- Performance Issues
- Deployment Issues
API Issues
API Not Responding
Symptoms:
- 503 Service Unavailable
- Connection timeout
- Health check fails
Diagnosis:
# Check pod status
kubectl get pods -n api
# Check logs
kubectl logs -n api deployment/api --tail=100
# Check service
kubectl get svc -n api api
Solutions:
-
Restart API deployment:
kubectl rollout restart deployment/api -n api -
Check resource limits:
kubectl describe pod -n api -l app=api -
Verify database connection:
kubectl exec -it -n api deployment/api -- \ psql $DATABASE_URL -c "SELECT 1"
GraphQL Query Errors
Symptoms:
- GraphQL errors in response
- "Internal server error"
- Query timeouts
Diagnosis:
# Check API logs for errors
kubectl logs -n api deployment/api | grep -i error
# Test GraphQL endpoint
curl -X POST https://api.sankofa.nexus/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ health { status } }"}'
Solutions:
- Check query syntax
- Verify authentication token
- Check database query performance
- Review resolver logs
Rate Limiting
Symptoms:
- 429 Too Many Requests
- Rate limit headers present
Solutions:
- Implement request batching
- Use subscriptions for real-time updates
- Request rate limit increase (admin)
- Implement client-side caching
Database Issues
Connection Pool Exhausted
Symptoms:
- "Too many connections" errors
- Slow query responses
- Database connection timeouts
Diagnosis:
# Check active connections
kubectl exec -it -n api deployment/postgres -- \
psql -U sankofa -c "SELECT count(*) FROM pg_stat_activity"
# Check connection pool metrics
curl https://api.sankofa.nexus/metrics | grep db_connections
Solutions:
-
Increase connection pool size:
env: - name: DB_POOL_SIZE value: "30" -
Close idle connections:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes'; -
Restart API to reset connections
Slow Queries
Symptoms:
- High query latency
- Timeout errors
- Database CPU high
Diagnosis:
-- Find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Check table sizes
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
Solutions:
-
Add database indexes:
CREATE INDEX idx_resources_tenant_id ON resources(tenant_id); CREATE INDEX idx_resources_status ON resources(status); -
Analyze tables:
ANALYZE resources; -
Optimize queries
-
Consider read replicas for heavy read workloads
Database Lock Issues
Symptoms:
- Queries hanging
- "Lock timeout" errors
- Deadlock errors
Solutions:
-
Check for long-running transactions:
SELECT pid, state, query, now() - xact_start AS duration FROM pg_stat_activity WHERE state = 'active' AND xact_start IS NOT NULL ORDER BY duration DESC; -
Terminate blocking queries (if safe)
-
Review transaction isolation levels
-
Break up large transactions
Authentication Issues
Token Expired
Symptoms:
- 401 Unauthorized
- "Token expired" error
- Keycloak errors
Solutions:
- Refresh token via Keycloak
- Re-authenticate
- Check token expiration settings in Keycloak
Invalid Token
Symptoms:
- 401 Unauthorized
- "Invalid token" error
Diagnosis:
# Verify Keycloak is accessible
curl https://keycloak.sankofa.nexus/health
# Check Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100
Solutions:
- Verify token format
- Check Keycloak client configuration
- Verify token signature
- Check clock synchronization
Permission Denied
Symptoms:
- 403 Forbidden
- "Access denied" error
Solutions:
- Verify user role in Keycloak
- Check tenant context
- Review RBAC policies
- Verify resource ownership
Resource Provisioning
VM Creation Fails
Symptoms:
- Resource stuck in PENDING
- Proxmox errors
- Crossplane errors
Diagnosis:
# Check Crossplane provider
kubectl get pods -n crossplane-system | grep proxmox
# Check ProxmoxVM resource
kubectl describe proxmoxvm -n default test-vm
# Check Proxmox connectivity
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
curl https://proxmox-endpoint:8006/api2/json/version
Solutions:
- Verify Proxmox credentials
- Check Proxmox node availability
- Verify resource quotas
- Check Crossplane provider logs
Resource Update Fails
Symptoms:
- Update mutation fails
- Resource not updating
- Status mismatch
Solutions:
- Check resource state
- Verify update permissions
- Review resource constraints
- Check for conflicting updates
Billing Issues
Incorrect Costs
Symptoms:
- Unexpected charges
- Missing usage records
- Cost discrepancies
Diagnosis:
-- Check usage records
SELECT * FROM usage_records
WHERE tenant_id = 'tenant-id'
ORDER BY timestamp DESC
LIMIT 100;
-- Check billing calculations
SELECT * FROM invoices
WHERE tenant_id = 'tenant-id'
ORDER BY created_at DESC;
Solutions:
- Review usage records
- Verify pricing configuration
- Check for duplicate records
- Recalculate costs if needed
Budget Alerts Not Triggering
Symptoms:
- Budget exceeded but no alert
- Alerts not sent
Diagnosis:
-- Check budget status
SELECT * FROM budgets
WHERE tenant_id = 'tenant-id';
-- Check alert configuration
SELECT * FROM billing_alerts
WHERE tenant_id = 'tenant-id' AND enabled = true;
Solutions:
- Verify alert configuration
- Check alert evaluation schedule
- Review notification channels
- Test alert manually
Invoice Generation Fails
Symptoms:
- Invoice creation error
- Missing line items
- PDF generation fails
Solutions:
- Check usage records exist
- Verify billing period
- Check PDF service
- Review invoice template
Performance Issues
High Latency
Symptoms:
- Slow API responses
- Timeout errors
- High P95 latency
Diagnosis:
# Check API metrics
curl https://api.sankofa.nexus/metrics | grep request_duration
# Check database performance
kubectl exec -it -n api deployment/postgres -- \
psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10"
Solutions:
- Add caching layer
- Optimize database queries
- Scale API horizontally
- Review N+1 query problems
High Memory Usage
Symptoms:
- OOM kills
- Pod restarts
- Memory warnings
Solutions:
- Increase memory limits
- Review memory leaks
- Optimize data structures
- Implement pagination
High CPU Usage
Symptoms:
- Slow responses
- CPU throttling
- Pod evictions
Solutions:
- Scale horizontally
- Optimize algorithms
- Add caching
- Review expensive operations
Deployment Issues
Pods Not Starting
Symptoms:
- Pods in Pending/CrashLoopBackOff
- Image pull errors
- Init container failures
Diagnosis:
# Check pod status
kubectl describe pod -n api <pod-name>
# Check events
kubectl get events -n api --sort-by='.lastTimestamp'
# Check logs
kubectl logs -n api <pod-name>
Solutions:
- Check image availability
- Verify resource requests/limits
- Check node resources
- Review init container logs
Service Not Accessible
Symptoms:
- Service unreachable
- DNS resolution fails
- Ingress errors
Diagnosis:
# Check service
kubectl get svc -n api
# Check ingress
kubectl describe ingress -n api api
# Test service directly
kubectl port-forward -n api svc/api 8080:80
curl http://localhost:8080/health
Solutions:
- Verify service selector matches pods
- Check ingress configuration
- Verify DNS records
- Check network policies
Configuration Issues
Symptoms:
- Wrong environment variables
- Missing secrets
- ConfigMap errors
Solutions:
-
Verify environment variables:
kubectl exec -n api deployment/api -- env | grep -E "DB_|KEYCLOAK_" -
Check secrets:
kubectl get secrets -n api -
Review ConfigMaps:
kubectl get configmaps -n api
Getting Help
Logs
# API logs
kubectl logs -n api deployment/api --tail=100 -f
# Database logs
kubectl logs -n api deployment/postgres --tail=100
# Keycloak logs
kubectl logs -n keycloak deployment/keycloak --tail=100
# Crossplane logs
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox --tail=100
Metrics
# Prometheus queries
curl 'https://prometheus.sankofa.nexus/api/v1/query?query=up'
# Grafana dashboards
# Access: https://grafana.sankofa.nexus
Support
- Documentation: See
docs/directory - Operations Runbook:
docs/OPERATIONS_RUNBOOK.md - API Documentation:
docs/API_DOCUMENTATION.md
Common Error Messages
"Database connection failed"
- Check database pod status
- Verify connection string
- Check network policies
"Authentication required"
- Verify token in request
- Check token expiration
- Verify Keycloak is accessible
"Quota exceeded"
- Review tenant quotas
- Check resource usage
- Request quota increase
"Resource not found"
- Verify resource ID
- Check tenant context
- Review access permissions
"Internal server error"
- Check application logs
- Review error details
- Check system resources