- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
338 lines
7.8 KiB
Markdown
338 lines
7.8 KiB
Markdown
# Monitoring and Observability Guide
|
|
|
|
This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.
|
|
|
|
## Overview
|
|
|
|
Sankofa Phoenix uses a comprehensive monitoring stack:
|
|
- **Prometheus**: Metrics collection and storage
|
|
- **Grafana**: Visualization and dashboards
|
|
- **Loki**: Log aggregation
|
|
- **Alertmanager**: Alert routing and notification
|
|
|
|
## Tenant-Aware Metrics
|
|
|
|
All metrics are tagged with tenant IDs for multi-tenant isolation.
|
|
|
|
### Metric Naming Convention
|
|
|
|
```
|
|
sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}
|
|
```
|
|
|
|
Examples:
|
|
- `sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}`
|
|
- `sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}`
|
|
- `sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}`
|
|
|
|
## Grafana Dashboards
|
|
|
|
### 1. System Overview Dashboard
|
|
|
|
**Location**: `grafana/dashboards/system-overview.json`
|
|
|
|
**Metrics**:
|
|
- API request rate and latency
|
|
- Database connection pool usage
|
|
- Keycloak authentication rate
|
|
- System resource usage (CPU, memory, disk)
|
|
|
|
**Panels**:
|
|
- Request rate (requests/sec)
|
|
- P95 latency (ms)
|
|
- Error rate (%)
|
|
- Active connections
|
|
- Authentication success rate
|
|
|
|
### 2. Tenant Dashboard
|
|
|
|
**Location**: `grafana/dashboards/tenant-overview.json`
|
|
|
|
**Metrics**:
|
|
- Tenant resource usage
|
|
- Tenant cost tracking
|
|
- Tenant API usage
|
|
- Tenant user activity
|
|
|
|
**Panels**:
|
|
- Resource usage by tenant
|
|
- Cost breakdown by tenant
|
|
- API calls by tenant
|
|
- Active users by tenant
|
|
|
|
### 3. Billing Dashboard
|
|
|
|
**Location**: `grafana/dashboards/billing.json`
|
|
|
|
**Metrics**:
|
|
- Real-time cost tracking
|
|
- Cost by service/resource
|
|
- Budget vs actual spend
|
|
- Cost forecast
|
|
- Billing anomalies
|
|
|
|
**Panels**:
|
|
- Current month cost
|
|
- Cost trend (7d, 30d)
|
|
- Top resources by cost
|
|
- Budget utilization
|
|
- Anomaly detection alerts
|
|
|
|
### 4. Proxmox Infrastructure Dashboard
|
|
|
|
**Location**: `grafana/dashboards/proxmox-infrastructure.json`
|
|
|
|
**Metrics**:
|
|
- VM status and health
|
|
- Node resource usage
|
|
- Storage utilization
|
|
- Network throughput
|
|
- VM creation/deletion rate
|
|
|
|
**Panels**:
|
|
- VM status overview
|
|
- Node CPU/memory usage
|
|
- Storage pool usage
|
|
- Network I/O
|
|
- VM lifecycle events
|
|
|
|
### 5. Security Dashboard
|
|
|
|
**Location**: `grafana/dashboards/security.json`
|
|
|
|
**Metrics**:
|
|
- Authentication events
|
|
- Failed login attempts
|
|
- Policy violations
|
|
- Incident response metrics
|
|
- Audit log events
|
|
|
|
**Panels**:
|
|
- Authentication success/failure rate
|
|
- Policy violations by severity
|
|
- Incident response time
|
|
- Audit log volume
|
|
- Security events timeline
|
|
|
|
## Prometheus Configuration
|
|
|
|
### Scrape Configs
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'sankofa-api'
|
|
kubernetes_sd_configs:
|
|
- role: pod
|
|
namespaces:
|
|
names:
|
|
- api
|
|
relabel_configs:
|
|
- source_labels: [__meta_kubernetes_pod_label_app]
|
|
action: keep
|
|
regex: api
|
|
metric_relabel_configs:
|
|
- source_labels: [tenant_id]
|
|
target_label: tenant_id
|
|
regex: '(.+)'
|
|
replacement: '${1}'
|
|
|
|
- job_name: 'proxmox'
|
|
static_configs:
|
|
- targets:
|
|
- proxmox-exporter:9091
|
|
relabel_configs:
|
|
- source_labels: [__address__]
|
|
target_label: instance
|
|
```
|
|
|
|
### Recording Rules
|
|
|
|
```yaml
|
|
groups:
|
|
- name: sankofa_rules
|
|
interval: 30s
|
|
rules:
|
|
- record: sankofa:api:requests:rate5m
|
|
expr: rate(sankofa_api_requests_total[5m])
|
|
|
|
- record: sankofa:billing:cost:rate1h
|
|
expr: rate(sankofa_billing_cost_usd[1h])
|
|
|
|
- record: sankofa:proxmox:vm:count
|
|
expr: count(sankofa_proxmox_vm_info) by (tenant_id)
|
|
```
|
|
|
|
## Alerting Rules
|
|
|
|
### Critical Alerts
|
|
|
|
```yaml
|
|
groups:
|
|
- name: sankofa_critical
|
|
interval: 30s
|
|
rules:
|
|
- alert: HighErrorRate
|
|
expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "High error rate detected"
|
|
description: "Error rate is {{ $value }} errors/sec"
|
|
|
|
- alert: DatabaseConnectionPoolExhausted
|
|
expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Database connection pool nearly exhausted"
|
|
|
|
- alert: BudgetExceeded
|
|
expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
|
|
for: 1h
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"
|
|
|
|
- alert: ProxmoxNodeDown
|
|
expr: up{job="proxmox"} == 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Proxmox node {{ $labels.instance }} is down"
|
|
```
|
|
|
|
### Billing Anomaly Detection
|
|
|
|
```yaml
|
|
- name: sankofa_billing_anomalies
|
|
interval: 1h
|
|
rules:
|
|
- alert: CostAnomalyDetected
|
|
expr: |
|
|
(
|
|
sankofa_billing_cost_usd
|
|
- predict_linear(sankofa_billing_cost_usd[7d], 3600)
|
|
) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
|
|
for: 2h
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"
|
|
```
|
|
|
|
## Real-Time Cost Tracking
|
|
|
|
### Metrics Exposed
|
|
|
|
- `sankofa_billing_cost_usd{tenant_id, service, resource_id}` - Current cost
|
|
- `sankofa_billing_cost_rate_usd_per_hour{tenant_id}` - Cost rate
|
|
- `sankofa_billing_budget_usd{tenant_id}` - Budget limit
|
|
- `sankofa_billing_budget_utilization_percent{tenant_id}` - Budget usage %
|
|
|
|
### Grafana Query Example
|
|
|
|
```promql
|
|
# Current month cost by tenant
|
|
sum(sankofa_billing_cost_usd) by (tenant_id)
|
|
|
|
# Cost trend (7 days)
|
|
rate(sankofa_billing_cost_usd[1h]) * 24 * 7
|
|
|
|
# Budget utilization
|
|
sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100
|
|
```
|
|
|
|
## Log Aggregation
|
|
|
|
### Loki Configuration
|
|
|
|
Logs are collected with tenant context:
|
|
|
|
```yaml
|
|
clients:
|
|
- url: http://loki:3100/loki/api/v1/push
|
|
tenant_id: ${TENANT_ID}
|
|
```
|
|
|
|
### Log Labels
|
|
|
|
- `tenant_id`: Tenant identifier
|
|
- `service`: Service name (api, portal, etc.)
|
|
- `level`: Log level (info, warn, error)
|
|
- `component`: Component name
|
|
|
|
### Log Queries
|
|
|
|
```logql
|
|
# Errors for a specific tenant
|
|
{tenant_id="tenant-1", level="error"}
|
|
|
|
# API errors in last hour
|
|
{service="api", level="error"} | json | timestamp > now() - 1h
|
|
|
|
# Authentication failures
|
|
{component="auth"} | json | status="failed"
|
|
```
|
|
|
|
## Deployment
|
|
|
|
### Install Monitoring Stack
|
|
|
|
```bash
|
|
# Add Prometheus Operator Helm repo
|
|
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
|
helm repo update
|
|
|
|
# Install kube-prometheus-stack
|
|
helm install monitoring prometheus-community/kube-prometheus-stack \
|
|
--namespace monitoring \
|
|
--create-namespace \
|
|
--values grafana/values.yaml
|
|
|
|
# Apply custom dashboards
|
|
kubectl apply -f grafana/dashboards/
|
|
```
|
|
|
|
### Import Dashboards
|
|
|
|
```bash
|
|
# Import all dashboards
|
|
for dashboard in grafana/dashboards/*.json; do
|
|
kubectl create configmap $(basename $dashboard .json) \
|
|
--from-file=$dashboard \
|
|
--namespace=monitoring \
|
|
--dry-run=client -o yaml | kubectl apply -f -
|
|
done
|
|
```
|
|
|
|
## Access
|
|
|
|
- **Grafana**: https://grafana.sankofa.nexus
|
|
- **Prometheus**: https://prometheus.sankofa.nexus
|
|
- **Alertmanager**: https://alertmanager.sankofa.nexus
|
|
|
|
Default credentials (change immediately):
|
|
- Username: `admin`
|
|
- Password: (from secret `monitoring-grafana`)
|
|
|
|
## Best Practices
|
|
|
|
1. **Tenant Isolation**: Always filter metrics by tenant_id
|
|
2. **Retention**: Configure appropriate retention periods
|
|
3. **Cardinality**: Avoid high-cardinality labels
|
|
4. **Alerts**: Set up alerting for critical metrics
|
|
5. **Dashboards**: Create tenant-specific dashboards
|
|
6. **Cost Tracking**: Monitor billing metrics closely
|
|
7. **Anomaly Detection**: Enable anomaly detection for billing
|
|
|
|
## References
|
|
|
|
- Dashboard definitions: `grafana/dashboards/`
|
|
- Prometheus config: `monitoring/prometheus/`
|
|
- Alert rules: `monitoring/alerts/`
|
|
|