Files
Sankofa/docs/MONITORING_GUIDE.md
defiQUG 9daf1fd378 Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution
- Enhance API schema with expanded type definitions and resolvers
- Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth
- Implement new services: AI optimization, billing, blockchain, compliance, marketplace
- Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage)
- Update Crossplane provider with enhanced VM management capabilities
- Add comprehensive test suite for API endpoints and services
- Update frontend components with improved GraphQL subscriptions and real-time updates
- Enhance security configurations and headers (CSP, CORS, etc.)
- Update documentation and configuration files
- Add new CI/CD workflows and validation scripts
- Implement design system improvements and UI enhancements
2025-12-12 18:01:35 -08:00

7.8 KiB

Monitoring and Observability Guide

This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.

Overview

Sankofa Phoenix uses a comprehensive monitoring stack:

  • Prometheus: Metrics collection and storage
  • Grafana: Visualization and dashboards
  • Loki: Log aggregation
  • Alertmanager: Alert routing and notification

Tenant-Aware Metrics

All metrics are tagged with tenant IDs for multi-tenant isolation.

Metric Naming Convention

sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}

Examples:

  • sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}
  • sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}
  • sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}

Grafana Dashboards

1. System Overview Dashboard

Location: grafana/dashboards/system-overview.json

Metrics:

  • API request rate and latency
  • Database connection pool usage
  • Keycloak authentication rate
  • System resource usage (CPU, memory, disk)

Panels:

  • Request rate (requests/sec)
  • P95 latency (ms)
  • Error rate (%)
  • Active connections
  • Authentication success rate

2. Tenant Dashboard

Location: grafana/dashboards/tenant-overview.json

Metrics:

  • Tenant resource usage
  • Tenant cost tracking
  • Tenant API usage
  • Tenant user activity

Panels:

  • Resource usage by tenant
  • Cost breakdown by tenant
  • API calls by tenant
  • Active users by tenant

3. Billing Dashboard

Location: grafana/dashboards/billing.json

Metrics:

  • Real-time cost tracking
  • Cost by service/resource
  • Budget vs actual spend
  • Cost forecast
  • Billing anomalies

Panels:

  • Current month cost
  • Cost trend (7d, 30d)
  • Top resources by cost
  • Budget utilization
  • Anomaly detection alerts

4. Proxmox Infrastructure Dashboard

Location: grafana/dashboards/proxmox-infrastructure.json

Metrics:

  • VM status and health
  • Node resource usage
  • Storage utilization
  • Network throughput
  • VM creation/deletion rate

Panels:

  • VM status overview
  • Node CPU/memory usage
  • Storage pool usage
  • Network I/O
  • VM lifecycle events

5. Security Dashboard

Location: grafana/dashboards/security.json

Metrics:

  • Authentication events
  • Failed login attempts
  • Policy violations
  • Incident response metrics
  • Audit log events

Panels:

  • Authentication success/failure rate
  • Policy violations by severity
  • Incident response time
  • Audit log volume
  • Security events timeline

Prometheus Configuration

Scrape Configs

scrape_configs:
  - job_name: 'sankofa-api'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - api
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: api
    metric_relabel_configs:
      - source_labels: [tenant_id]
        target_label: tenant_id
        regex: '(.+)'
        replacement: '${1}'

  - job_name: 'proxmox'
    static_configs:
      - targets:
          - proxmox-exporter:9091
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

Recording Rules

groups:
  - name: sankofa_rules
    interval: 30s
    rules:
      - record: sankofa:api:requests:rate5m
        expr: rate(sankofa_api_requests_total[5m])
      
      - record: sankofa:billing:cost:rate1h
        expr: rate(sankofa_billing_cost_usd[1h])
      
      - record: sankofa:proxmox:vm:count
        expr: count(sankofa_proxmox_vm_info) by (tenant_id)

Alerting Rules

Critical Alerts

groups:
  - name: sankofa_critical
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"
      
      - alert: DatabaseConnectionPoolExhausted
        expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool nearly exhausted"
      
      - alert: BudgetExceeded
        expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"
      
      - alert: ProxmoxNodeDown
        expr: up{job="proxmox"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Proxmox node {{ $labels.instance }} is down"

Billing Anomaly Detection

  - name: sankofa_billing_anomalies
    interval: 1h
    rules:
      - alert: CostAnomalyDetected
        expr: |
          (
            sankofa_billing_cost_usd
            - predict_linear(sankofa_billing_cost_usd[7d], 3600)
          ) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
        for: 2h
        labels:
          severity: warning
        annotations:
          summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"

Real-Time Cost Tracking

Metrics Exposed

  • sankofa_billing_cost_usd{tenant_id, service, resource_id} - Current cost
  • sankofa_billing_cost_rate_usd_per_hour{tenant_id} - Cost rate
  • sankofa_billing_budget_usd{tenant_id} - Budget limit
  • sankofa_billing_budget_utilization_percent{tenant_id} - Budget usage %

Grafana Query Example

# Current month cost by tenant
sum(sankofa_billing_cost_usd) by (tenant_id)

# Cost trend (7 days)
rate(sankofa_billing_cost_usd[1h]) * 24 * 7

# Budget utilization
sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100

Log Aggregation

Loki Configuration

Logs are collected with tenant context:

clients:
  - url: http://loki:3100/loki/api/v1/push
    tenant_id: ${TENANT_ID}

Log Labels

  • tenant_id: Tenant identifier
  • service: Service name (api, portal, etc.)
  • level: Log level (info, warn, error)
  • component: Component name

Log Queries

# Errors for a specific tenant
{tenant_id="tenant-1", level="error"}

# API errors in last hour
{service="api", level="error"} | json | timestamp > now() - 1h

# Authentication failures
{component="auth"} | json | status="failed"

Deployment

Install Monitoring Stack

# Add Prometheus Operator Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values grafana/values.yaml

# Apply custom dashboards
kubectl apply -f grafana/dashboards/

Import Dashboards

# Import all dashboards
for dashboard in grafana/dashboards/*.json; do
  kubectl create configmap $(basename $dashboard .json) \
    --from-file=$dashboard \
    --namespace=monitoring \
    --dry-run=client -o yaml | kubectl apply -f -
done

Access

Default credentials (change immediately):

  • Username: admin
  • Password: (from secret monitoring-grafana)

Best Practices

  1. Tenant Isolation: Always filter metrics by tenant_id
  2. Retention: Configure appropriate retention periods
  3. Cardinality: Avoid high-cardinality labels
  4. Alerts: Set up alerting for critical metrics
  5. Dashboards: Create tenant-specific dashboards
  6. Cost Tracking: Monitor billing metrics closely
  7. Anomaly Detection: Enable anomaly detection for billing

References

  • Dashboard definitions: grafana/dashboards/
  • Prometheus config: monitoring/prometheus/
  • Alert rules: monitoring/alerts/