Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements

- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
2025-12-12 18:01:35 -08:00
parent e01131efaf
commit 9daf1fd378
968 changed files with 160890 additions and 1092 deletions
--- a/docs/MONITORING_GUIDE.md
+++ b/docs/MONITORING_GUIDE.md
@@ -0,0 +1,337 @@
+# Monitoring and Observability Guide
+
+This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.
+
+## Overview
+
+Sankofa Phoenix uses a comprehensive monitoring stack:
+- **Prometheus**: Metrics collection and storage
+- **Grafana**: Visualization and dashboards
+- **Loki**: Log aggregation
+- **Alertmanager**: Alert routing and notification
+
+## Tenant-Aware Metrics
+
+All metrics are tagged with tenant IDs for multi-tenant isolation.
+
+### Metric Naming Convention
+
+```
+sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}
+```
+
+Examples:
+- `sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}`
+- `sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}`
+- `sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}`
+
+## Grafana Dashboards
+
+### 1. System Overview Dashboard
+
+**Location**: `grafana/dashboards/system-overview.json`
+
+**Metrics**:
+- API request rate and latency
+- Database connection pool usage
+- Keycloak authentication rate
+- System resource usage (CPU, memory, disk)
+
+**Panels**:
+- Request rate (requests/sec)
+- P95 latency (ms)
+- Error rate (%)
+- Active connections
+- Authentication success rate
+
+### 2. Tenant Dashboard
+
+**Location**: `grafana/dashboards/tenant-overview.json`
+
+**Metrics**:
+- Tenant resource usage
+- Tenant cost tracking
+- Tenant API usage
+- Tenant user activity
+
+**Panels**:
+- Resource usage by tenant
+- Cost breakdown by tenant
+- API calls by tenant
+- Active users by tenant
+
+### 3. Billing Dashboard
+
+**Location**: `grafana/dashboards/billing.json`
+
+**Metrics**:
+- Real-time cost tracking
+- Cost by service/resource
+- Budget vs actual spend
+- Cost forecast
+- Billing anomalies
+
+**Panels**:
+- Current month cost
+- Cost trend (7d, 30d)
+- Top resources by cost
+- Budget utilization
+- Anomaly detection alerts
+
+### 4. Proxmox Infrastructure Dashboard
+
+**Location**: `grafana/dashboards/proxmox-infrastructure.json`
+
+**Metrics**:
+- VM status and health
+- Node resource usage
+- Storage utilization
+- Network throughput
+- VM creation/deletion rate
+
+**Panels**:
+- VM status overview
+- Node CPU/memory usage
+- Storage pool usage
+- Network I/O
+- VM lifecycle events
+
+### 5. Security Dashboard
+
+**Location**: `grafana/dashboards/security.json`
+
+**Metrics**:
+- Authentication events
+- Failed login attempts
+- Policy violations
+- Incident response metrics
+- Audit log events
+
+**Panels**:
+- Authentication success/failure rate
+- Policy violations by severity
+- Incident response time
+- Audit log volume
+- Security events timeline
+
+## Prometheus Configuration
+
+### Scrape Configs
+
+```yaml
+scrape_configs:
+  - job_name: 'sankofa-api'
+    kubernetes_sd_configs:
+      - role: pod
+        namespaces:
+          names:
+            - api
+    relabel_configs:
+      - source_labels: [__meta_kubernetes_pod_label_app]
+        action: keep
+        regex: api
+    metric_relabel_configs:
+      - source_labels: [tenant_id]
+        target_label: tenant_id
+        regex: '(.+)'
+        replacement: '${1}'
+
+  - job_name: 'proxmox'
+    static_configs:
+      - targets:
+          - proxmox-exporter:9091
+    relabel_configs:
+      - source_labels: [__address__]
+        target_label: instance
+```
+
+### Recording Rules
+
+```yaml
+groups:
+  - name: sankofa_rules
+    interval: 30s
+    rules:
+      - record: sankofa:api:requests:rate5m
+        expr: rate(sankofa_api_requests_total[5m])
+      
+      - record: sankofa:billing:cost:rate1h
+        expr: rate(sankofa_billing_cost_usd[1h])
+      
+      - record: sankofa:proxmox:vm:count
+        expr: count(sankofa_proxmox_vm_info) by (tenant_id)
+```
+
+## Alerting Rules
+
+### Critical Alerts
+
+```yaml
+groups:
+  - name: sankofa_critical
+    interval: 30s
+    rules:
+      - alert: HighErrorRate
+        expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "High error rate detected"
+          description: "Error rate is {{ $value }} errors/sec"
+      
+      - alert: DatabaseConnectionPoolExhausted
+        expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
+        for: 2m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Database connection pool nearly exhausted"
+      
+      - alert: BudgetExceeded
+        expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
+        for: 1h
+        labels:
+          severity: warning
+        annotations:
+          summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"
+      
+      - alert: ProxmoxNodeDown
+        expr: up{job="proxmox"} == 0
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Proxmox node {{ $labels.instance }} is down"
+```
+
+### Billing Anomaly Detection
+
+```yaml
+  - name: sankofa_billing_anomalies
+    interval: 1h
+    rules:
+      - alert: CostAnomalyDetected
+        expr: |
+          (
+            sankofa_billing_cost_usd
+            - predict_linear(sankofa_billing_cost_usd[7d], 3600)
+          ) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
+        for: 2h
+        labels:
+          severity: warning
+        annotations:
+          summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"
+```
+
+## Real-Time Cost Tracking
+
+### Metrics Exposed
+
+- `sankofa_billing_cost_usd{tenant_id, service, resource_id}` - Current cost
+- `sankofa_billing_cost_rate_usd_per_hour{tenant_id}` - Cost rate
+- `sankofa_billing_budget_usd{tenant_id}` - Budget limit
+- `sankofa_billing_budget_utilization_percent{tenant_id}` - Budget usage %
+
+### Grafana Query Example
+
+```promql
+# Current month cost by tenant
+sum(sankofa_billing_cost_usd) by (tenant_id)
+
+# Cost trend (7 days)
+rate(sankofa_billing_cost_usd[1h]) * 24 * 7
+
+# Budget utilization
+sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100
+```
+
+## Log Aggregation
+
+### Loki Configuration
+
+Logs are collected with tenant context:
+
+```yaml
+clients:
+  - url: http://loki:3100/loki/api/v1/push
+    tenant_id: ${TENANT_ID}
+```
+
+### Log Labels
+
+- `tenant_id`: Tenant identifier
+- `service`: Service name (api, portal, etc.)
+- `level`: Log level (info, warn, error)
+- `component`: Component name
+
+### Log Queries
+
+```logql
+# Errors for a specific tenant
+{tenant_id="tenant-1", level="error"}
+
+# API errors in last hour
+{service="api", level="error"} | json | timestamp > now() - 1h
+
+# Authentication failures
+{component="auth"} | json | status="failed"
+```
+
+## Deployment
+
+### Install Monitoring Stack
+
+```bash
+# Add Prometheus Operator Helm repo
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo update
+
+# Install kube-prometheus-stack
+helm install monitoring prometheus-community/kube-prometheus-stack \
+  --namespace monitoring \
+  --create-namespace \
+  --values grafana/values.yaml
+
+# Apply custom dashboards
+kubectl apply -f grafana/dashboards/
+```
+
+### Import Dashboards
+
+```bash
+# Import all dashboards
+for dashboard in grafana/dashboards/*.json; do
+  kubectl create configmap $(basename $dashboard .json) \
+    --from-file=$dashboard \
+    --namespace=monitoring \
+    --dry-run=client -o yaml | kubectl apply -f -
+done
+```
+
+## Access
+
+- **Grafana**: https://grafana.sankofa.nexus
+- **Prometheus**: https://prometheus.sankofa.nexus
+- **Alertmanager**: https://alertmanager.sankofa.nexus
+
+Default credentials (change immediately):
+- Username: `admin`
+- Password: (from secret `monitoring-grafana`)
+
+## Best Practices
+
+1. **Tenant Isolation**: Always filter metrics by tenant_id
+2. **Retention**: Configure appropriate retention periods
+3. **Cardinality**: Avoid high-cardinality labels
+4. **Alerts**: Set up alerting for critical metrics
+5. **Dashboards**: Create tenant-specific dashboards
+6. **Cost Tracking**: Monitor billing metrics closely
+7. **Anomaly Detection**: Enable anomaly detection for billing
+
+## References
+
+- Dashboard definitions: `grafana/dashboards/`
+- Prometheus config: `monitoring/prometheus/`
+- Alert rules: `monitoring/alerts/`
+