Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
This commit is contained in:
337
docs/MONITORING_GUIDE.md
Normal file
337
docs/MONITORING_GUIDE.md
Normal file
@@ -0,0 +1,337 @@
|
||||
# Monitoring and Observability Guide
|
||||
|
||||
This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.
|
||||
|
||||
## Overview
|
||||
|
||||
Sankofa Phoenix uses a comprehensive monitoring stack:
|
||||
- **Prometheus**: Metrics collection and storage
|
||||
- **Grafana**: Visualization and dashboards
|
||||
- **Loki**: Log aggregation
|
||||
- **Alertmanager**: Alert routing and notification
|
||||
|
||||
## Tenant-Aware Metrics
|
||||
|
||||
All metrics are tagged with tenant IDs for multi-tenant isolation.
|
||||
|
||||
### Metric Naming Convention
|
||||
|
||||
```
|
||||
sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}
|
||||
```
|
||||
|
||||
Examples:
|
||||
- `sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}`
|
||||
- `sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}`
|
||||
- `sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}`
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
### 1. System Overview Dashboard
|
||||
|
||||
**Location**: `grafana/dashboards/system-overview.json`
|
||||
|
||||
**Metrics**:
|
||||
- API request rate and latency
|
||||
- Database connection pool usage
|
||||
- Keycloak authentication rate
|
||||
- System resource usage (CPU, memory, disk)
|
||||
|
||||
**Panels**:
|
||||
- Request rate (requests/sec)
|
||||
- P95 latency (ms)
|
||||
- Error rate (%)
|
||||
- Active connections
|
||||
- Authentication success rate
|
||||
|
||||
### 2. Tenant Dashboard
|
||||
|
||||
**Location**: `grafana/dashboards/tenant-overview.json`
|
||||
|
||||
**Metrics**:
|
||||
- Tenant resource usage
|
||||
- Tenant cost tracking
|
||||
- Tenant API usage
|
||||
- Tenant user activity
|
||||
|
||||
**Panels**:
|
||||
- Resource usage by tenant
|
||||
- Cost breakdown by tenant
|
||||
- API calls by tenant
|
||||
- Active users by tenant
|
||||
|
||||
### 3. Billing Dashboard
|
||||
|
||||
**Location**: `grafana/dashboards/billing.json`
|
||||
|
||||
**Metrics**:
|
||||
- Real-time cost tracking
|
||||
- Cost by service/resource
|
||||
- Budget vs actual spend
|
||||
- Cost forecast
|
||||
- Billing anomalies
|
||||
|
||||
**Panels**:
|
||||
- Current month cost
|
||||
- Cost trend (7d, 30d)
|
||||
- Top resources by cost
|
||||
- Budget utilization
|
||||
- Anomaly detection alerts
|
||||
|
||||
### 4. Proxmox Infrastructure Dashboard
|
||||
|
||||
**Location**: `grafana/dashboards/proxmox-infrastructure.json`
|
||||
|
||||
**Metrics**:
|
||||
- VM status and health
|
||||
- Node resource usage
|
||||
- Storage utilization
|
||||
- Network throughput
|
||||
- VM creation/deletion rate
|
||||
|
||||
**Panels**:
|
||||
- VM status overview
|
||||
- Node CPU/memory usage
|
||||
- Storage pool usage
|
||||
- Network I/O
|
||||
- VM lifecycle events
|
||||
|
||||
### 5. Security Dashboard
|
||||
|
||||
**Location**: `grafana/dashboards/security.json`
|
||||
|
||||
**Metrics**:
|
||||
- Authentication events
|
||||
- Failed login attempts
|
||||
- Policy violations
|
||||
- Incident response metrics
|
||||
- Audit log events
|
||||
|
||||
**Panels**:
|
||||
- Authentication success/failure rate
|
||||
- Policy violations by severity
|
||||
- Incident response time
|
||||
- Audit log volume
|
||||
- Security events timeline
|
||||
|
||||
## Prometheus Configuration
|
||||
|
||||
### Scrape Configs
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'sankofa-api'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
namespaces:
|
||||
names:
|
||||
- api
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_pod_label_app]
|
||||
action: keep
|
||||
regex: api
|
||||
metric_relabel_configs:
|
||||
- source_labels: [tenant_id]
|
||||
target_label: tenant_id
|
||||
regex: '(.+)'
|
||||
replacement: '${1}'
|
||||
|
||||
- job_name: 'proxmox'
|
||||
static_configs:
|
||||
- targets:
|
||||
- proxmox-exporter:9091
|
||||
relabel_configs:
|
||||
- source_labels: [__address__]
|
||||
target_label: instance
|
||||
```
|
||||
|
||||
### Recording Rules
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: sankofa_rules
|
||||
interval: 30s
|
||||
rules:
|
||||
- record: sankofa:api:requests:rate5m
|
||||
expr: rate(sankofa_api_requests_total[5m])
|
||||
|
||||
- record: sankofa:billing:cost:rate1h
|
||||
expr: rate(sankofa_billing_cost_usd[1h])
|
||||
|
||||
- record: sankofa:proxmox:vm:count
|
||||
expr: count(sankofa_proxmox_vm_info) by (tenant_id)
|
||||
```
|
||||
|
||||
## Alerting Rules
|
||||
|
||||
### Critical Alerts
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: sankofa_critical
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: HighErrorRate
|
||||
expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "High error rate detected"
|
||||
description: "Error rate is {{ $value }} errors/sec"
|
||||
|
||||
- alert: DatabaseConnectionPoolExhausted
|
||||
expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Database connection pool nearly exhausted"
|
||||
|
||||
- alert: BudgetExceeded
|
||||
expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"
|
||||
|
||||
- alert: ProxmoxNodeDown
|
||||
expr: up{job="proxmox"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Proxmox node {{ $labels.instance }} is down"
|
||||
```
|
||||
|
||||
### Billing Anomaly Detection
|
||||
|
||||
```yaml
|
||||
- name: sankofa_billing_anomalies
|
||||
interval: 1h
|
||||
rules:
|
||||
- alert: CostAnomalyDetected
|
||||
expr: |
|
||||
(
|
||||
sankofa_billing_cost_usd
|
||||
- predict_linear(sankofa_billing_cost_usd[7d], 3600)
|
||||
) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
|
||||
for: 2h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"
|
||||
```
|
||||
|
||||
## Real-Time Cost Tracking
|
||||
|
||||
### Metrics Exposed
|
||||
|
||||
- `sankofa_billing_cost_usd{tenant_id, service, resource_id}` - Current cost
|
||||
- `sankofa_billing_cost_rate_usd_per_hour{tenant_id}` - Cost rate
|
||||
- `sankofa_billing_budget_usd{tenant_id}` - Budget limit
|
||||
- `sankofa_billing_budget_utilization_percent{tenant_id}` - Budget usage %
|
||||
|
||||
### Grafana Query Example
|
||||
|
||||
```promql
|
||||
# Current month cost by tenant
|
||||
sum(sankofa_billing_cost_usd) by (tenant_id)
|
||||
|
||||
# Cost trend (7 days)
|
||||
rate(sankofa_billing_cost_usd[1h]) * 24 * 7
|
||||
|
||||
# Budget utilization
|
||||
sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100
|
||||
```
|
||||
|
||||
## Log Aggregation
|
||||
|
||||
### Loki Configuration
|
||||
|
||||
Logs are collected with tenant context:
|
||||
|
||||
```yaml
|
||||
clients:
|
||||
- url: http://loki:3100/loki/api/v1/push
|
||||
tenant_id: ${TENANT_ID}
|
||||
```
|
||||
|
||||
### Log Labels
|
||||
|
||||
- `tenant_id`: Tenant identifier
|
||||
- `service`: Service name (api, portal, etc.)
|
||||
- `level`: Log level (info, warn, error)
|
||||
- `component`: Component name
|
||||
|
||||
### Log Queries
|
||||
|
||||
```logql
|
||||
# Errors for a specific tenant
|
||||
{tenant_id="tenant-1", level="error"}
|
||||
|
||||
# API errors in last hour
|
||||
{service="api", level="error"} | json | timestamp > now() - 1h
|
||||
|
||||
# Authentication failures
|
||||
{component="auth"} | json | status="failed"
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Install Monitoring Stack
|
||||
|
||||
```bash
|
||||
# Add Prometheus Operator Helm repo
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm repo update
|
||||
|
||||
# Install kube-prometheus-stack
|
||||
helm install monitoring prometheus-community/kube-prometheus-stack \
|
||||
--namespace monitoring \
|
||||
--create-namespace \
|
||||
--values grafana/values.yaml
|
||||
|
||||
# Apply custom dashboards
|
||||
kubectl apply -f grafana/dashboards/
|
||||
```
|
||||
|
||||
### Import Dashboards
|
||||
|
||||
```bash
|
||||
# Import all dashboards
|
||||
for dashboard in grafana/dashboards/*.json; do
|
||||
kubectl create configmap $(basename $dashboard .json) \
|
||||
--from-file=$dashboard \
|
||||
--namespace=monitoring \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
done
|
||||
```
|
||||
|
||||
## Access
|
||||
|
||||
- **Grafana**: https://grafana.sankofa.nexus
|
||||
- **Prometheus**: https://prometheus.sankofa.nexus
|
||||
- **Alertmanager**: https://alertmanager.sankofa.nexus
|
||||
|
||||
Default credentials (change immediately):
|
||||
- Username: `admin`
|
||||
- Password: (from secret `monitoring-grafana`)
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Tenant Isolation**: Always filter metrics by tenant_id
|
||||
2. **Retention**: Configure appropriate retention periods
|
||||
3. **Cardinality**: Avoid high-cardinality labels
|
||||
4. **Alerts**: Set up alerting for critical metrics
|
||||
5. **Dashboards**: Create tenant-specific dashboards
|
||||
6. **Cost Tracking**: Monitor billing metrics closely
|
||||
7. **Anomaly Detection**: Enable anomaly detection for billing
|
||||
|
||||
## References
|
||||
|
||||
- Dashboard definitions: `grafana/dashboards/`
|
||||
- Prometheus config: `monitoring/prometheus/`
|
||||
- Alert rules: `monitoring/alerts/`
|
||||
|
||||
Reference in New Issue
Block a user