Sankofa/docs/MONITORING_GUIDE.md

# Monitoring and Observability Guide

This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.

## Overview

Sankofa Phoenix uses a comprehensive monitoring stack:
- **Prometheus**: Metrics collection and storage
- **Grafana**: Visualization and dashboards
- **Loki**: Log aggregation
- **Alertmanager**: Alert routing and notification

## Tenant-Aware Metrics

All metrics are tagged with tenant IDs for multi-tenant isolation.

### Metric Naming Convention

```
sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}
```

Examples:
- `sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}`
- `sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}`
- `sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}`

## Grafana Dashboards

### 1. System Overview Dashboard

**Location**: `grafana/dashboards/system-overview.json`

**Metrics**:
- API request rate and latency
- Database connection pool usage
- Keycloak authentication rate
- System resource usage (CPU, memory, disk)

**Panels**:
- Request rate (requests/sec)
- P95 latency (ms)
- Error rate (%)
- Active connections
- Authentication success rate

### 2. Tenant Dashboard

**Location**: `grafana/dashboards/tenant-overview.json`

**Metrics**:
- Tenant resource usage
- Tenant cost tracking
- Tenant API usage
- Tenant user activity

**Panels**:
- Resource usage by tenant
- Cost breakdown by tenant
- API calls by tenant
- Active users by tenant

### 3. Billing Dashboard

**Location**: `grafana/dashboards/billing.json`

**Metrics**:
- Real-time cost tracking
- Cost by service/resource
- Budget vs actual spend
- Cost forecast
- Billing anomalies

**Panels**:
- Current month cost
- Cost trend (7d, 30d)
- Top resources by cost
- Budget utilization
- Anomaly detection alerts

### 4. Proxmox Infrastructure Dashboard

**Location**: `grafana/dashboards/proxmox-infrastructure.json`

**Metrics**:
- VM status and health
- Node resource usage
- Storage utilization
- Network throughput
- VM creation/deletion rate

**Panels**:
- VM status overview
- Node CPU/memory usage
- Storage pool usage
- Network I/O
- VM lifecycle events

### 5. Security Dashboard

**Location**: `grafana/dashboards/security.json`

**Metrics**:
- Authentication events
- Failed login attempts
- Policy violations
- Incident response metrics
- Audit log events

**Panels**:
- Authentication success/failure rate
- Policy violations by severity
- Incident response time
- Audit log volume
- Security events timeline

## Prometheus Configuration

### Scrape Configs

```yaml
scrape_configs:
  - job_name: 'sankofa-api'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - api
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: api
    metric_relabel_configs:
      - source_labels: [tenant_id]
        target_label: tenant_id
        regex: '(.+)'
        replacement: '${1}'

  - job_name: 'proxmox'
    static_configs:
      - targets:
          - proxmox-exporter:9091
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
```

### Recording Rules

```yaml
groups:
  - name: sankofa_rules
    interval: 30s
    rules:
      - record: sankofa:api:requests:rate5m
        expr: rate(sankofa_api_requests_total[5m])

      - record: sankofa:billing:cost:rate1h
        expr: rate(sankofa_billing_cost_usd[1h])

      - record: sankofa:proxmox:vm:count
        expr: count(sankofa_proxmox_vm_info) by (tenant_id)
```

## Alerting Rules

### Critical Alerts

```yaml
groups:
  - name: sankofa_critical
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: DatabaseConnectionPoolExhausted
        expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool nearly exhausted"

      - alert: BudgetExceeded
        expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"

      - alert: ProxmoxNodeDown
        expr: up{job="proxmox"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Proxmox node {{ $labels.instance }} is down"
```

### Billing Anomaly Detection

```yaml
  - name: sankofa_billing_anomalies
    interval: 1h
    rules:
      - alert: CostAnomalyDetected
        expr: |
          (
            sankofa_billing_cost_usd
            - predict_linear(sankofa_billing_cost_usd[7d], 3600)
          ) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
        for: 2h
        labels:
          severity: warning
        annotations:
          summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"
```

## Real-Time Cost Tracking

### Metrics Exposed

- `sankofa_billing_cost_usd{tenant_id, service, resource_id}` - Current cost
- `sankofa_billing_cost_rate_usd_per_hour{tenant_id}` - Cost rate
- `sankofa_billing_budget_usd{tenant_id}` - Budget limit
- `sankofa_billing_budget_utilization_percent{tenant_id}` - Budget usage %

### Grafana Query Example

```promql
# Current month cost by tenant
sum(sankofa_billing_cost_usd) by (tenant_id)

# Cost trend (7 days)
rate(sankofa_billing_cost_usd[1h]) * 24 * 7

# Budget utilization
sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100
```

## Log Aggregation

### Loki Configuration

Logs are collected with tenant context:

```yaml
clients:
  - url: http://loki:3100/loki/api/v1/push
    tenant_id: ${TENANT_ID}
```

### Log Labels

- `tenant_id`: Tenant identifier
- `service`: Service name (api, portal, etc.)
- `level`: Log level (info, warn, error)
- `component`: Component name

### Log Queries

```logql
# Errors for a specific tenant
{tenant_id="tenant-1", level="error"}

# API errors in last hour
{service="api", level="error"} | json | timestamp > now() - 1h

# Authentication failures
{component="auth"} | json | status="failed"
```

## Deployment

### Install Monitoring Stack

```bash
# Add Prometheus Operator Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values grafana/values.yaml

# Apply custom dashboards
kubectl apply -f grafana/dashboards/
```

### Import Dashboards

```bash
# Import all dashboards
for dashboard in grafana/dashboards/*.json; do
  kubectl create configmap $(basename $dashboard .json) \
    --from-file=$dashboard \
    --namespace=monitoring \
    --dry-run=client -o yaml | kubectl apply -f -
done
```

## Access

- **Grafana**: https://grafana.sankofa.nexus
- **Prometheus**: https://prometheus.sankofa.nexus
- **Alertmanager**: https://alertmanager.sankofa.nexus

Default credentials (change immediately):
- Username: `admin`
- Password: (from secret `monitoring-grafana`)

## Best Practices

1. **Tenant Isolation**: Always filter metrics by tenant_id
2. **Retention**: Configure appropriate retention periods
3. **Cardinality**: Avoid high-cardinality labels
4. **Alerts**: Set up alerting for critical metrics
5. **Dashboards**: Create tenant-specific dashboards
6. **Cost Tracking**: Monitor billing metrics closely
7. **Anomaly Detection**: Enable anomaly detection for billing

## References

- Dashboard definitions: `grafana/dashboards/`
- Prometheus config: `monitoring/prometheus/`
- Alert rules: `monitoring/alerts/`