Update documentation structure and enhance .gitignore

- Added generated index files and report directories to .gitignore to prevent unnecessary tracking of transient files. - Updated README links to reflect new documentation paths for better navigation. - Improved documentation organization by ensuring all links point to the correct locations, enhancing user experience and accessibility.
2025-12-12 21:18:55 -08:00
parent 664707d912
commit fe0365757a
106 changed files with 4666 additions and 2294 deletions
--- a/docs/guides/BUILD_AND_DEPLOY_INSTRUCTIONS.md
+++ b/docs/guides/BUILD_AND_DEPLOY_INSTRUCTIONS.md
@@ -0,0 +1,152 @@
+# Build and Deploy Instructions
+
+**Date**: 2025-12-11  
+**Status**: ✅ **CODE FIXED - NEEDS IMAGE LOADING**
+
+---
+
+## Build Status
+
+✅ **Provider code fixed and built successfully**
+- Fixed compilation errors
+- Added `findVMNode` function
+- Fixed variable scoping issue
+- Image built: `crossplane-provider-proxmox:latest`
+
+---
+
+## Deployment Steps
+
+### 1. Build Provider Image
+
+```bash
+cd crossplane-provider-proxmox
+docker build -t crossplane-provider-proxmox:latest .
+```
+
+✅ **COMPLETE**
+
+### 2. Load Image into Kind Cluster
+
+**Required**: `kind` command must be installed
+
+```bash
+kind load docker-image crossplane-provider-proxmox:latest --name sankofa
+```
+
+⚠️ **PENDING**: `kind` command not available in current environment
+
+**Alternative Methods**:
+
+#### Option A: Install kind
+```bash
+# Install kind
+curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
+chmod +x ./kind
+sudo mv ./kind /usr/local/bin/kind
+
+# Then load image
+kind load docker-image crossplane-provider-proxmox:latest --name sankofa
+```
+
+#### Option B: Use Registry
+```bash
+# Tag and push to registry
+docker tag crossplane-provider-proxmox:latest <registry>/crossplane-provider-proxmox:latest
+docker push <registry>/crossplane-provider-proxmox:latest
+
+# Update provider.yaml to use registry image
+# Change imagePullPolicy from "Never" to "Always" or "IfNotPresent"
+```
+
+#### Option C: Manual Copy (Advanced)
+```bash
+# Save image to file
+docker save crossplane-provider-proxmox:latest -o provider-image.tar
+
+# Copy to kind node and load
+docker cp provider-image.tar kind-sankofa-control-plane:/tmp/
+docker exec kind-sankofa-control-plane ctr -n=k8s.io images import /tmp/provider-image.tar
+```
+
+### 3. Restart Provider
+
+```bash
+kubectl rollout restart deployment/crossplane-provider-proxmox -n crossplane-system
+kubectl rollout status deployment/crossplane-provider-proxmox -n crossplane-system
+```
+
+✅ **COMPLETE** (but using old image until step 2 is done)
+
+### 4. Verify Deployment
+
+```bash
+kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
+kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=20
+```
+
+---
+
+## Current Status
+
+### ✅ Completed
+1. Code fixes applied
+2. Provider image built
+3. Templates updated to cloud image format
+4. Provider deployment restarted
+
+### ⏳ Pending
+1. **Load image into kind cluster** (requires `kind` command)
+2. Test VM creation with new provider
+
+---
+
+## Next Steps
+
+1. **Install kind** or use alternative image loading method
+2. **Load image** into cluster
+3. **Restart provider** (if not already done)
+4. **Test VM 100** creation
+5. **Verify** task monitoring works
+
+---
+
+## Verification
+
+After loading image and restarting:
+
+1. **Check provider logs** for task monitoring:
+   ```bash
+   kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox | grep -i "task\|importdisk\|upid"
+   ```
+
+2. **Deploy VM 100**:
+   ```bash
+   kubectl apply -f examples/production/vm-100.yaml
+   ```
+
+3. **Monitor creation**:
+   ```bash
+   kubectl get proxmoxvm vm-100 -w
+   ```
+
+4. **Check Proxmox**:
+   ```bash
+   qm status 100
+   qm config 100
+   ```
+
+---
+
+## Expected Behavior
+
+With the fixed provider:
+- ✅ Provider waits for `importdisk` task to complete
+- ✅ No lock timeouts
+- ✅ VM configured correctly after import
+- ✅ Boot disk attached properly
+
+---
+
+**Status**: ⏳ **AWAITING IMAGE LOAD INTO CLUSTER**
+
--- a/docs/guides/CODE_DOCUMENTATION_GUIDE.md
+++ b/docs/guides/CODE_DOCUMENTATION_GUIDE.md
@@ -0,0 +1,174 @@
+# Code Documentation Guide
+
+This guide outlines the standards and best practices for documenting code in the Sankofa Phoenix project.
+
+## JSDoc Standards
+
+### Function Documentation
+
+All public functions should include JSDoc comments with:
+
+- Description of what the function does
+- `@param` tags for each parameter
+- `@returns` tag describing the return value
+- `@throws` tags for exceptions that may be thrown
+- `@example` tag with usage example (for complex functions)
+
+**Example:**
+
+```typescript
+/**
+ * Authenticate a user and return JWT token
+ * 
+ * @param email - User email address
+ * @param password - User password
+ * @returns Authentication payload with JWT token and user information
+ * @throws {AuthenticationError} If credentials are invalid
+ * @example
+ * ```typescript
+ * const result = await login('user@example.com', 'password123');
+ * console.log(result.token); // JWT token
+ * ```
+ */
+export async function login(email: string, password: string): Promise<AuthPayload> {
+  // implementation
+}
+```
+
+### Class Documentation
+
+Classes should include:
+
+- Description of the class purpose
+- `@example` tag showing basic usage
+
+**Example:**
+
+```typescript
+/**
+ * Proxmox VE Infrastructure Adapter
+ * 
+ * Implements the InfrastructureAdapter interface for Proxmox VE infrastructure.
+ * Provides resource discovery, creation, update, deletion, metrics, and health checks.
+ * 
+ * @example
+ * ```typescript
+ * const adapter = new ProxmoxAdapter({
+ *   apiUrl: 'https://proxmox.example.com:8006',
+ *   apiToken: 'token-id=...'
+ * });
+ * const resources = await adapter.discoverResources();
+ * ```
+ */
+export class ProxmoxAdapter implements InfrastructureAdapter {
+  // implementation
+}
+```
+
+### Interface Documentation
+
+Complex interfaces should include documentation:
+
+```typescript
+/**
+ * Resource filter criteria for querying resources
+ * 
+ * @property type - Filter by resource type (e.g., 'VM', 'CONTAINER')
+ * @property status - Filter by resource status (e.g., 'RUNNING', 'STOPPED')
+ * @property siteId - Filter by site ID
+ * @property tenantId - Filter by tenant ID
+ */
+export interface ResourceFilter {
+  type?: string
+  status?: string
+  siteId?: string
+  tenantId?: string
+}
+```
+
+### Method Documentation
+
+Class methods should follow the same pattern as functions:
+
+```typescript
+/**
+ * Discover all resources across all Proxmox nodes
+ * 
+ * @returns Array of normalized resources (VMs) from all nodes
+ * @throws {Error} If API connection fails or nodes cannot be retrieved
+ * @example
+ * ```typescript
+ * const resources = await adapter.discoverResources();
+ * console.log(`Found ${resources.length} VMs`);
+ * ```
+ */
+async discoverResources(): Promise<NormalizedResource[]> {
+  // implementation
+}
+```
+
+## Inline Comments
+
+### When to Use Inline Comments
+
+- **Complex logic**: Explain non-obvious algorithms or business rules
+- **Workarounds**: Document temporary fixes or known issues
+- **Performance optimizations**: Explain why a particular approach was chosen
+- **Business rules**: Document domain-specific logic
+
+### Comment Style
+
+```typescript
+// Good: Explains why, not what
+// Tenant-aware filtering (superior to Azure multi-tenancy)
+if (context.tenantContext) {
+  // System admins can see all resources
+  if (context.tenantContext.isSystemAdmin) {
+    // No filtering needed
+  } else if (context.tenantContext.tenantId) {
+    // Filter by tenant ID
+    query += ` AND r.tenant_id = $${paramCount}`
+  }
+}
+
+// Bad: States the obvious
+// Loop through nodes
+for (const node of nodes) {
+  // Get VMs
+  const vms = await this.getVMs(node.node)
+}
+```
+
+## TODO Comments
+
+Use TODO comments for known improvements:
+
+```typescript
+// TODO: Add rate limiting to prevent API abuse
+// TODO: Implement caching for frequently accessed resources
+// FIXME: This workaround should be removed when upstream issue is fixed
+```
+
+## Documentation Checklist
+
+When adding new code, ensure:
+
+- [ ] Public functions have JSDoc comments
+- [ ] Complex private functions have inline comments
+- [ ] Classes have class-level documentation
+- [ ] Interfaces have documentation for complex types
+- [ ] Examples are provided for public APIs
+- [ ] Error cases are documented with `@throws`
+- [ ] Complex algorithms have explanatory comments
+- [ ] Business rules are documented
+
+## Tools
+
+- **TypeScript**: Built-in JSDoc support
+- **VS Code**: JSDoc snippets and IntelliSense
+- **tsdoc**: Standard for TypeScript documentation comments
+
+---
+
+**Last Updated**: 2025-01-09
+
--- a/docs/guides/CONTRIBUTING.md
+++ b/docs/guides/CONTRIBUTING.md
@@ -0,0 +1,77 @@
+# Contributing to Sankofa
+
+**Last Updated**: 2025-01-09
+
+Thank you for your interest in contributing to Sankofa! This document provides guidelines and instructions for contributing to the Sankofa ecosystem and Sankofa Phoenix cloud platform.
+
+## Code of Conduct
+
+- Be respectful and inclusive
+- Welcome newcomers and help them learn
+- Focus on constructive feedback
+- Respect different viewpoints and experiences
+
+## Getting Started
+
+1. Fork the repository
+2. Clone your fork: `git clone https://github.com/yourusername/Sankofa.git`
+3. Create a branch: `git checkout -b feature/your-feature-name`
+4. Make your changes
+5. Commit your changes: `git commit -m "Add your feature"`
+6. Push to your fork: `git push origin feature/your-feature-name`
+7. Open a Pull Request
+
+## Development Setup
+
+See [DEVELOPMENT.md](./DEVELOPMENT.md) for detailed setup instructions.
+
+## Pull Request Process
+
+1. Ensure your code follows the project's style guidelines
+2. Add tests for new features
+3. Ensure all tests pass: `pnpm test`
+4. Update documentation as needed
+5. Ensure your branch is up to date with the main branch
+6. Submit your PR with a clear description
+
+## Coding Standards
+
+### TypeScript/JavaScript
+
+- Use TypeScript for all new code
+- Follow the existing code style
+- Use meaningful variable and function names
+- Add JSDoc comments for public APIs
+- Avoid `any` types - use proper typing
+
+### React Components
+
+- Use functional components with hooks
+- Keep components small and focused
+- Extract reusable logic into custom hooks
+- Use proper prop types or TypeScript interfaces
+
+### Git Commits
+
+- Use clear, descriptive commit messages
+- Follow conventional commits format when possible
+- Keep commits focused on a single change
+
+## Testing
+
+- Write tests for all new features
+- Ensure existing tests still pass
+- Aim for >80% code coverage
+- Test both success and error cases
+
+## Documentation
+
+- Update README.md if needed
+- Add JSDoc comments for new functions
+- Update API documentation for backend changes
+- Keep architecture docs up to date
+
+## Questions?
+
+Feel free to open an issue for questions or reach out to the maintainers.
+
--- a/docs/guides/DEVELOPMENT.md
+++ b/docs/guides/DEVELOPMENT.md
@@ -0,0 +1,184 @@
+# Development Guide
+
+**Last Updated**: 2025-01-09
+
+This guide will help you set up your development environment for Sankofa Phoenix.
+
+## Prerequisites
+
+- Node.js 18+ and pnpm (or npm/yarn)
+- PostgreSQL 14+ (for API)
+- Go 1.21+ (for Crossplane provider)
+- Docker (optional, for local services)
+
+## Initial Setup
+
+### 1. Clone the Repository
+
+```bash
+git clone https://github.com/sankofa/Sankofa.git
+cd Sankofa
+```
+
+### 2. Install Dependencies
+
+```bash
+# Main application
+pnpm install
+
+# Portal
+cd portal
+npm install
+cd ..
+
+# API
+cd api
+npm install
+cd ..
+
+# Crossplane Provider
+cd crossplane-provider-proxmox
+go mod download
+cd ..
+```
+
+### 3. Set Up Environment Variables
+
+Create `.env.local` files:
+
+```bash
+# Root .env.local
+cp .env.example .env.local
+
+# Portal .env.local
+cd portal
+cp .env.example .env.local
+cd ..
+
+# API .env.local
+cd api
+cp .env.example .env.local
+cd ..
+```
+
+### 4. Set Up Database
+
+```bash
+# Create database
+createdb sankofa
+
+# Run migrations
+cd api
+npm run db:migrate
+```
+
+## Running the Application
+
+### Development Mode
+
+```bash
+# Main app (port 3000)
+pnpm dev
+
+# Portal (port 3001)
+cd portal
+npm run dev
+
+# API (port 4000)
+cd api
+npm run dev
+```
+
+### Running Tests
+
+```bash
+# Main app tests
+pnpm test
+
+# Portal tests
+cd portal
+npm test
+
+# Crossplane provider tests
+cd crossplane-provider-proxmox
+go test ./...
+```
+
+## Project Structure
+
+```
+Sankofa/
+├── src/              # Main Next.js app
+├── portal/           # Portal application
+├── api/              # GraphQL API server
+├── crossplane-provider-proxmox/  # Crossplane provider
+├── gitops/           # GitOps configurations
+├── cloudflare/       # Cloudflare configs
+└── docs/             # Documentation
+```
+
+## Common Tasks
+
+### Adding a New Component
+
+1. Create component in `src/components/`
+2. Add tests in `src/components/**/*.test.tsx`
+3. Export from appropriate index file
+4. Update Storybook (if applicable)
+
+### Adding a New API Endpoint
+
+1. Add GraphQL type definition in `api/src/schema/typeDefs.ts`
+2. Add resolver in `api/src/schema/resolvers.ts`
+3. Add service logic in `api/src/services/`
+4. Add tests
+
+### Database Migrations
+
+```bash
+cd api
+# Create migration
+npm run db:migrate:create migration-name
+
+# Run migrations
+npm run db:migrate
+```
+
+## Debugging
+
+### Frontend
+
+- Use React DevTools
+- Check browser console
+- Use Next.js debug mode: `NODE_OPTIONS='--inspect' pnpm dev`
+
+### Backend
+
+- Use VS Code debugger
+- Check API logs
+- Use GraphQL Playground at `http://localhost:4000/graphql`
+
+## Code Quality
+
+### Linting
+
+```bash
+pnpm lint
+```
+
+### Type Checking
+
+```bash
+pnpm type-check
+```
+
+### Formatting
+
+```bash
+pnpm format
+```
+
+## Troubleshooting
+
+See [TROUBLESHOOTING.md](./TROUBLESHOOTING.md) for common issues and solutions.
+
--- a/docs/guides/FORCE_UNLOCK_INSTRUCTIONS.md
+++ b/docs/guides/FORCE_UNLOCK_INSTRUCTIONS.md
@@ -0,0 +1,134 @@
+# Force Unlock VM Instructions
+
+**Date**: 2025-12-09  
+**Issue**: `qm unlock 100` is timing out
+
+---
+
+## Problem
+
+The `qm unlock` command is timing out, which indicates:
+- A stuck process is holding the lock
+- The lock file is corrupted or in an invalid state
+- Another operation is blocking the unlock
+
+---
+
+## Solution: Force Unlock
+
+### Option 1: Use the Script (Recommended)
+
+**On Proxmox Node (root@ml110-01)**:
+
+```bash
+# Copy the script to the Proxmox node
+# Or run commands manually (see Option 2)
+
+# Run the script
+bash force-unlock-vm-proxmox.sh 100
+```
+
+### Option 2: Manual Commands
+
+**On Proxmox Node (root@ml110-01)**:
+
+```bash
+# 1. Check for stuck processes
+ps aux | grep -E 'qm|qemu' | grep 100
+
+# 2. Check lock file
+ls -la /var/lock/qemu-server/lock-100.conf
+cat /var/lock/qemu-server/lock-100.conf 2>/dev/null
+
+# 3. Kill stuck processes (if found)
+pkill -9 -f 'qm.*100'
+pkill -9 -f 'qemu.*100'
+
+# 4. Wait a moment
+sleep 2
+
+# 5. Force remove lock file
+rm -f /var/lock/qemu-server/lock-100.conf
+
+# 6. Verify lock is gone
+ls -la /var/lock/qemu-server/lock-100.conf
+# Should show: No such file or directory
+
+# 7. Check VM status
+qm status 100
+
+# 8. Try unlock again (should work now)
+qm unlock 100
+```
+
+---
+
+## If Lock Persists
+
+### Check for Other Issues
+
+```bash
+# Check if VM is in a transitional state
+qm status 100
+
+# Check VM configuration
+qm config 100
+
+# Check for other locks
+ls -la /var/lock/qemu-server/lock-*.conf
+
+# Check system resources
+df -h
+free -h
+```
+
+### Nuclear Option: Restart Proxmox Services
+
+**⚠️ WARNING: This will affect all VMs on the node**
+
+```bash
+# Only if absolutely necessary
+systemctl restart pve-cluster
+systemctl restart pvedaemon
+```
+
+---
+
+## After Successful Unlock
+
+1. **Monitor VM Status**:
+   ```bash
+   qm status 100
+   ```
+
+2. **Check Provider Logs** (from Kubernetes):
+   ```bash
+   kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50 -f
+   ```
+
+3. **Watch VM Resource**:
+   ```bash
+   kubectl get proxmoxvm basic-vm-001 -w
+   ```
+
+4. **Expected Outcome**:
+   - Provider will retry within 1 minute
+   - VM configuration will complete
+   - VM will boot successfully
+
+---
+
+## Prevention
+
+To prevent this issue in the future:
+
+1. **Ensure proper VM shutdown** before operations
+2. **Wait for operations to complete** before starting new ones
+3. **Monitor for stuck processes** regularly
+4. **Implement lock timeout handling** in provider code (already added)
+
+---
+
+**Last Updated**: 2025-12-09  
+**Status**: ⚠️ **MANUAL FORCE UNLOCK REQUIRED**
+
--- a/docs/guides/KEYCLOAK_DEPLOYMENT.md
+++ b/docs/guides/KEYCLOAK_DEPLOYMENT.md
@@ -0,0 +1,217 @@
+# Keycloak Deployment
+
+**Last Updated**: 2025-01-09 Guide
+
+This guide covers deploying and configuring Keycloak for the Sankofa Phoenix platform.
+
+## Prerequisites
+
+- Kubernetes cluster with admin access
+- kubectl configured
+- Helm 3.x installed
+- PostgreSQL database (for Keycloak persistence)
+- Domain name configured (e.g., `keycloak.sankofa.nexus`)
+
+## Deployment Steps
+
+### 1. Deploy Keycloak via Helm
+
+```bash
+# Add Keycloak Helm repository
+helm repo add bitnami https://charts.bitnami.com/bitnami
+helm repo update
+
+# Create namespace
+kubectl create namespace keycloak
+
+# Deploy Keycloak
+helm install keycloak bitnami/keycloak \
+  --namespace keycloak \
+  --set auth.adminUser=admin \
+  --set auth.adminPassword=$(openssl rand -base64 32) \
+  --set postgresql.enabled=true \
+  --set postgresql.auth.postgresPassword=$(openssl rand -base64 32) \
+  --set ingress.enabled=true \
+  --set ingress.hostname=keycloak.sankofa.nexus \
+  --set ingress.tls=true \
+  --set ingress.certManager=true \
+  --set service.type=ClusterIP \
+  --set service.port=8080
+```
+
+### 2. Configure Keycloak Clients
+
+Apply the client configuration:
+
+```bash
+kubectl apply -f gitops/apps/keycloak/keycloak-clients.yaml
+```
+
+Or configure manually via Keycloak Admin Console:
+
+#### Portal Client
+- **Client ID**: `portal-client`
+- **Client Protocol**: `openid-connect`
+- **Access Type**: `confidential`
+- **Valid Redirect URIs**: 
+  - `https://portal.sankofa.nexus/*`
+  - `http://localhost:3000/*` (for development)
+- **Web Origins**: `+`
+- **Standard Flow Enabled**: Yes
+- **Direct Access Grants Enabled**: Yes
+
+#### API Client
+- **Client ID**: `api-client`
+- **Client Protocol**: `openid-connect`
+- **Access Type**: `confidential`
+- **Service Accounts Enabled**: Yes
+- **Standard Flow Enabled**: Yes
+
+### 3. Configure Multi-Realm Support
+
+For multi-tenant support, create realms per tenant:
+
+```bash
+# Create realm for tenant
+kubectl exec -it -n keycloak deployment/keycloak -- \
+  /opt/bitnami/keycloak/bin/kcadm.sh create realms \
+  -s realm=tenant-1 \
+  -s enabled=true \
+  --no-config \
+  --server http://localhost:8080 \
+  --realm master \
+  --user admin \
+  --password $(kubectl get secret keycloak-admin -n keycloak -o jsonpath='{.data.password}' | base64 -d)
+```
+
+### 4. Configure Identity Providers
+
+#### LDAP/Active Directory
+1. Navigate to Identity Providers in Keycloak Admin Console
+2. Add LDAP provider
+3. Configure connection settings:
+   - **Vendor**: Active Directory (or other)
+   - **Connection URL**: `ldap://your-ldap-server:389`
+   - **Users DN**: `ou=Users,dc=example,dc=com`
+   - **Bind DN**: `cn=admin,dc=example,dc=com`
+   - **Bind Credential**: (stored in secret)
+
+#### SAML Providers
+1. Add SAML 2.0 provider
+2. Configure:
+   - **Entity ID**: Your SAML entity ID
+   - **SSO URL**: Your SAML SSO endpoint
+   - **Signing Certificate**: Your SAML signing certificate
+
+### 5. Enable Blockchain Identity Verification
+
+For blockchain-based identity verification:
+
+1. Install Keycloak Identity Provider plugin (if available)
+2. Configure blockchain connection:
+   - **Blockchain RPC URL**: `https://besu.sankofa.nexus:8545`
+   - **Contract Address**: (deployed identity contract)
+   - **Private Key**: (stored in Kubernetes Secret)
+
+### 6. Configure Environment Variables
+
+Update API service environment variables:
+
+```yaml
+env:
+  - name: KEYCLOAK_URL
+    value: "https://keycloak.sankofa.nexus"
+  - name: KEYCLOAK_REALM
+    value: "master"  # or tenant-specific realm
+  - name: KEYCLOAK_CLIENT_ID
+    value: "api-client"
+  - name: KEYCLOAK_CLIENT_SECRET
+    valueFrom:
+      secretKeyRef:
+        name: keycloak-client-secret
+        key: api-client-secret
+```
+
+### 7. Set Up Secrets
+
+Create Kubernetes secrets for client credentials:
+
+```bash
+# Create secret for API client
+kubectl create secret generic keycloak-client-secret \
+  --from-literal=api-client-secret=$(openssl rand -base64 32) \
+  --namespace keycloak
+
+# Create secret for portal client
+kubectl create secret generic keycloak-portal-secret \
+  --from-literal=portal-client-secret=$(openssl rand -base64 32) \
+  --namespace keycloak
+```
+
+### 8. Configure Cloudflare Access
+
+If using Cloudflare Zero Trust:
+
+1. Configure Cloudflare Access application for Keycloak
+2. Set domain: `keycloak.sankofa.nexus`
+3. Configure access policies (see `cloudflare/access-policies.yaml`)
+4. Require MFA for admin access
+
+### 9. Verify Deployment
+
+```bash
+# Check Keycloak pods
+kubectl get pods -n keycloak
+
+# Check Keycloak service
+kubectl get svc -n keycloak
+
+# Test Keycloak health
+curl https://keycloak.sankofa.nexus/health
+
+# Access Admin Console
+# https://keycloak.sankofa.nexus/admin
+```
+
+### 10. Post-Deployment Configuration
+
+1. **Change Admin Password**: Change default admin password immediately
+2. **Configure Email**: Set up SMTP for password reset emails
+3. **Enable MFA**: Configure TOTP and backup codes
+4. **Set Up Themes**: Customize Keycloak themes for branding
+5. **Configure Events**: Set up event listeners for audit logging
+6. **Backup Configuration**: Export realm configuration regularly
+
+## Troubleshooting
+
+### Keycloak Not Starting
+- Check PostgreSQL connection
+- Verify resource limits
+- Check logs: `kubectl logs -n keycloak deployment/keycloak`
+
+### Client Authentication Failing
+- Verify client secret matches
+- Check redirect URIs are correct
+- Verify realm name matches
+
+### Multi-Realm Issues
+- Ensure realm names match tenant IDs
+- Verify realm is enabled
+- Check realm configuration
+
+## Security Best Practices
+
+1. **Use Strong Passwords**: Generate strong passwords for all accounts
+2. **Enable MFA**: Require MFA for admin and privileged users
+3. **Rotate Secrets**: Regularly rotate client secrets
+4. **Monitor Access**: Enable audit logging
+5. **Use HTTPS**: Always use TLS for Keycloak
+6. **Limit Admin Access**: Restrict admin console access via Cloudflare Access
+7. **Backup Regularly**: Export and backup realm configurations
+
+## References
+
+- [Keycloak Documentation](https://www.keycloak.org/documentation)
+- [Keycloak Helm Chart](https://github.com/bitnami/charts/tree/main/bitnami/keycloak)
+- Client configuration: `gitops/apps/keycloak/keycloak-clients.yaml`
+
--- a/docs/guides/MIGRATION_GUIDE.md
+++ b/docs/guides/MIGRATION_GUIDE.md
@@ -0,0 +1,237 @@
+# Migration Guide
+
+**Last Updated**: 2025-01-09
+
+## Overview
+
+This guide provides instructions for migrating between versions of Sankofa Phoenix and migrating from other platforms.
+
+## Table of Contents
+
+- [API Version Migration](#api-version-migration)
+- [Database Migration](#database-migration)
+- [Configuration Migration](#configuration-migration)
+- [Azure Migration](#azure-migration)
+- [Deployment Migration](#deployment-migration)
+
+---
+
+## API Version Migration
+
+### Migrating Between API Versions
+
+See [API Versioning Guide](./api/API_VERSIONING.md) for detailed API migration instructions.
+
+### Quick Steps
+
+1. Review API changelog for breaking changes
+2. Update client code to use new API version
+3. Test all API interactions
+4. Deploy updated client code
+5. Monitor for issues
+
+---
+
+## Database Migration
+
+### Schema Migrations
+
+Database migrations are managed automatically:
+
+```bash
+# Run migrations
+cd api
+npm run db:migrate
+
+# Rollback if needed
+npm run db:rollback
+```
+
+### Manual Migration Steps
+
+1. **Backup Database**: Always backup before migration
+   ```bash
+   pg_dump sankofa > backup_$(date +%Y%m%d).sql
+   ```
+
+2. **Run Migrations**: Execute migration scripts
+   ```bash
+   npm run db:migrate
+   ```
+
+3. **Verify Migration**: Check migration status
+   ```bash
+   npm run db:migrate:status
+   ```
+
+4. **Test Application**: Verify application functionality
+5. **Monitor**: Watch for errors post-migration
+
+### Data Migration
+
+For data migrations:
+
+1. **Export Data**: Export from source
+2. **Transform Data**: Apply necessary transformations
+3. **Import Data**: Import to new schema
+4. **Validate**: Verify data integrity
+5. **Update References**: Update any code references
+
+---
+
+## Configuration Migration
+
+### Environment Variables
+
+When updating configuration:
+
+1. **Review Changes**: Check configuration changes in release notes
+2. **Update `.env` Files**: Update environment variables
+3. **Test Configuration**: Verify configuration is correct
+4. **Deploy**: Deploy updated configuration
+
+### Configuration Files
+
+```bash
+# Backup current configuration
+cp .env.local .env.local.backup
+
+# Update configuration
+# Edit .env.local with new values
+
+# Verify configuration
+npm run config:validate
+```
+
+---
+
+## Azure Migration
+
+### From Azure to Sankofa Phoenix
+
+See [Azure Migration Guide](./tenants/AZURE_MIGRATION.md) for comprehensive Azure migration instructions.
+
+### Key Migration Areas
+
+1. **Identity**: Migrate from Azure AD to Keycloak
+2. **Resources**: Migrate VMs and resources
+3. **Networking**: Update network configurations
+4. **Storage**: Migrate data and storage
+5. **Applications**: Update application configurations
+
+---
+
+## Deployment Migration
+
+### Upgrading Deployment
+
+1. **Review Release Notes**: Check for breaking changes
+2. **Update Dependencies**: Update package versions
+3. **Run Tests**: Ensure all tests pass
+4. **Deploy**: Follow deployment procedures
+5. **Verify**: Confirm deployment success
+
+### Rolling Back Deployment
+
+1. **Identify Issue**: Determine what needs rollback
+2. **Stop Services**: Stop affected services
+3. **Restore Previous Version**: Deploy previous version
+4. **Restore Database** (if needed): Restore database backup
+5. **Verify**: Confirm rollback success
+
+---
+
+## Common Migration Scenarios
+
+### Scenario 1: Minor Version Update
+
+**Steps:**
+1. Review changelog
+2. Update dependencies
+3. Run tests
+4. Deploy
+5. Verify
+
+### Scenario 2: Major Version Update
+
+**Steps:**
+1. Review migration guide for major version
+2. Backup all data
+3. Update configuration
+4. Run database migrations
+5. Update code for breaking changes
+6. Test thoroughly
+7. Deploy in staging first
+8. Deploy to production
+9. Monitor closely
+
+### Scenario 3: Platform Migration
+
+**Steps:**
+1. Plan migration timeline
+2. Set up new platform
+3. Migrate data
+4. Migrate applications
+5. Update DNS/configurations
+6. Test thoroughly
+7. Cutover
+8. Monitor and verify
+
+---
+
+## Migration Checklist
+
+### Pre-Migration
+
+- [ ] Review migration documentation
+- [ ] Backup all data
+- [ ] Test migration in staging
+- [ ] Notify stakeholders
+- [ ] Schedule migration window
+
+### During Migration
+
+- [ ] Execute migration steps
+- [ ] Monitor progress
+- [ ] Verify each step
+- [ ] Document any issues
+
+### Post-Migration
+
+- [ ] Verify all functionality
+- [ ] Test critical paths
+- [ ] Monitor for errors
+- [ ] Update documentation
+- [ ] Communicate completion
+
+---
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Migration Fails**: Check logs, rollback if needed
+2. **Data Loss**: Restore from backup
+3. **Configuration Errors**: Verify environment variables
+4. **Service Downtime**: Check service status and logs
+
+### Getting Help
+
+- Check [Troubleshooting Guide](./TROUBLESHOOTING_GUIDE.md)
+- Review migration documentation
+- Check logs for specific errors
+- Contact support if needed
+
+---
+
+## Related Documentation
+
+- [API Versioning Guide](./api/API_VERSIONING.md)
+- [Deployment Guide](./DEPLOYMENT.md)
+- [Troubleshooting Guide](./TROUBLESHOOTING_GUIDE.md)
+- [Azure Migration Guide](./tenants/AZURE_MIGRATION.md)
+
+---
+
+**Note**: Always backup data before performing migrations. Test migrations in a staging environment first.
+
--- a/docs/guides/MONITORING_GUIDE.md
+++ b/docs/guides/MONITORING_GUIDE.md
@@ -0,0 +1,339 @@
+# Monitoring and Observability Guide
+
+**Last Updated**: 2025-01-09
+
+This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.
+
+## Overview
+
+Sankofa Phoenix uses a comprehensive monitoring stack:
+- **Prometheus**: Metrics collection and storage
+- **Grafana**: Visualization and dashboards
+- **Loki**: Log aggregation
+- **Alertmanager**: Alert routing and notification
+
+## Tenant-Aware Metrics
+
+All metrics are tagged with tenant IDs for multi-tenant isolation.
+
+### Metric Naming Convention
+
+```
+sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}
+```
+
+Examples:
+- `sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}`
+- `sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}`
+- `sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}`
+
+## Grafana Dashboards
+
+### 1. System Overview Dashboard
+
+**Location**: `grafana/dashboards/system-overview.json`
+
+**Metrics**:
+- API request rate and latency
+- Database connection pool usage
+- Keycloak authentication rate
+- System resource usage (CPU, memory, disk)
+
+**Panels**:
+- Request rate (requests/sec)
+- P95 latency (ms)
+- Error rate (%)
+- Active connections
+- Authentication success rate
+
+### 2. Tenant Dashboard
+
+**Location**: `grafana/dashboards/tenant-overview.json`
+
+**Metrics**:
+- Tenant resource usage
+- Tenant cost tracking
+- Tenant API usage
+- Tenant user activity
+
+**Panels**:
+- Resource usage by tenant
+- Cost breakdown by tenant
+- API calls by tenant
+- Active users by tenant
+
+### 3. Billing Dashboard
+
+**Location**: `grafana/dashboards/billing.json`
+
+**Metrics**:
+- Real-time cost tracking
+- Cost by service/resource
+- Budget vs actual spend
+- Cost forecast
+- Billing anomalies
+
+**Panels**:
+- Current month cost
+- Cost trend (7d, 30d)
+- Top resources by cost
+- Budget utilization
+- Anomaly detection alerts
+
+### 4. Proxmox Infrastructure Dashboard
+
+**Location**: `grafana/dashboards/proxmox-infrastructure.json`
+
+**Metrics**:
+- VM status and health
+- Node resource usage
+- Storage utilization
+- Network throughput
+- VM creation/deletion rate
+
+**Panels**:
+- VM status overview
+- Node CPU/memory usage
+- Storage pool usage
+- Network I/O
+- VM lifecycle events
+
+### 5. Security Dashboard
+
+**Location**: `grafana/dashboards/security.json`
+
+**Metrics**:
+- Authentication events
+- Failed login attempts
+- Policy violations
+- Incident response metrics
+- Audit log events
+
+**Panels**:
+- Authentication success/failure rate
+- Policy violations by severity
+- Incident response time
+- Audit log volume
+- Security events timeline
+
+## Prometheus Configuration
+
+### Scrape Configs
+
+```yaml
+scrape_configs:
+  - job_name: 'sankofa-api'
+    kubernetes_sd_configs:
+      - role: pod
+        namespaces:
+          names:
+            - api
+    relabel_configs:
+      - source_labels: [__meta_kubernetes_pod_label_app]
+        action: keep
+        regex: api
+    metric_relabel_configs:
+      - source_labels: [tenant_id]
+        target_label: tenant_id
+        regex: '(.+)'
+        replacement: '${1}'
+
+  - job_name: 'proxmox'
+    static_configs:
+      - targets:
+          - proxmox-exporter:9091
+    relabel_configs:
+      - source_labels: [__address__]
+        target_label: instance
+```
+
+### Recording Rules
+
+```yaml
+groups:
+  - name: sankofa_rules
+    interval: 30s
+    rules:
+      - record: sankofa:api:requests:rate5m
+        expr: rate(sankofa_api_requests_total[5m])
+      
+      - record: sankofa:billing:cost:rate1h
+        expr: rate(sankofa_billing_cost_usd[1h])
+      
+      - record: sankofa:proxmox:vm:count
+        expr: count(sankofa_proxmox_vm_info) by (tenant_id)
+```
+
+## Alerting Rules
+
+### Critical Alerts
+
+```yaml
+groups:
+  - name: sankofa_critical
+    interval: 30s
+    rules:
+      - alert: HighErrorRate
+        expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "High error rate detected"
+          description: "Error rate is {{ $value }} errors/sec"
+      
+      - alert: DatabaseConnectionPoolExhausted
+        expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
+        for: 2m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Database connection pool nearly exhausted"
+      
+      - alert: BudgetExceeded
+        expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
+        for: 1h
+        labels:
+          severity: warning
+        annotations:
+          summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"
+      
+      - alert: ProxmoxNodeDown
+        expr: up{job="proxmox"} == 0
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: "Proxmox node {{ $labels.instance }} is down"
+```
+
+### Billing Anomaly Detection
+
+```yaml
+  - name: sankofa_billing_anomalies
+    interval: 1h
+    rules:
+      - alert: CostAnomalyDetected
+        expr: |
+          (
+            sankofa_billing_cost_usd
+            - predict_linear(sankofa_billing_cost_usd[7d], 3600)
+          ) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
+        for: 2h
+        labels:
+          severity: warning
+        annotations:
+          summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"
+```
+
+## Real-Time Cost Tracking
+
+### Metrics Exposed
+
+- `sankofa_billing_cost_usd{tenant_id, service, resource_id}` - Current cost
+- `sankofa_billing_cost_rate_usd_per_hour{tenant_id}` - Cost rate
+- `sankofa_billing_budget_usd{tenant_id}` - Budget limit
+- `sankofa_billing_budget_utilization_percent{tenant_id}` - Budget usage %
+
+### Grafana Query Example
+
+```promql
+# Current month cost by tenant
+sum(sankofa_billing_cost_usd) by (tenant_id)
+
+# Cost trend (7 days)
+rate(sankofa_billing_cost_usd[1h]) * 24 * 7
+
+# Budget utilization
+sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100
+```
+
+## Log Aggregation
+
+### Loki Configuration
+
+Logs are collected with tenant context:
+
+```yaml
+clients:
+  - url: http://loki:3100/loki/api/v1/push
+    tenant_id: ${TENANT_ID}
+```
+
+### Log Labels
+
+- `tenant_id`: Tenant identifier
+- `service`: Service name (api, portal, etc.)
+- `level`: Log level (info, warn, error)
+- `component`: Component name
+
+### Log Queries
+
+```logql
+# Errors for a specific tenant
+{tenant_id="tenant-1", level="error"}
+
+# API errors in last hour
+{service="api", level="error"} | json | timestamp > now() - 1h
+
+# Authentication failures
+{component="auth"} | json | status="failed"
+```
+
+## Deployment
+
+### Install Monitoring Stack
+
+```bash
+# Add Prometheus Operator Helm repo
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo update
+
+# Install kube-prometheus-stack
+helm install monitoring prometheus-community/kube-prometheus-stack \
+  --namespace monitoring \
+  --create-namespace \
+  --values grafana/values.yaml
+
+# Apply custom dashboards
+kubectl apply -f grafana/dashboards/
+```
+
+### Import Dashboards
+
+```bash
+# Import all dashboards
+for dashboard in grafana/dashboards/*.json; do
+  kubectl create configmap $(basename $dashboard .json) \
+    --from-file=$dashboard \
+    --namespace=monitoring \
+    --dry-run=client -o yaml | kubectl apply -f -
+done
+```
+
+## Access
+
+- **Grafana**: https://grafana.sankofa.nexus
+- **Prometheus**: https://prometheus.sankofa.nexus
+- **Alertmanager**: https://alertmanager.sankofa.nexus
+
+Default credentials (change immediately):
+- Username: `admin`
+- Password: (from secret `monitoring-grafana`)
+
+## Best Practices
+
+1. **Tenant Isolation**: Always filter metrics by tenant_id
+2. **Retention**: Configure appropriate retention periods
+3. **Cardinality**: Avoid high-cardinality labels
+4. **Alerts**: Set up alerting for critical metrics
+5. **Dashboards**: Create tenant-specific dashboards
+6. **Cost Tracking**: Monitor billing metrics closely
+7. **Anomaly Detection**: Enable anomaly detection for billing
+
+## References
+
+- Dashboard definitions: `grafana/dashboards/`
+- Prometheus config: `monitoring/prometheus/`
+- Alert rules: `monitoring/alerts/`
+
--- a/docs/guides/OPERATIONS_RUNBOOK.md
+++ b/docs/guides/OPERATIONS_RUNBOOK.md
@@ -0,0 +1,428 @@
+# Operations Runbook
+
+**Last Updated**: 2025-01-09
+
+This runbook provides operational procedures for Sankofa Phoenix.
+
+## Table of Contents
+
+1. [Daily Operations](#daily-operations)
+2. [Tenant Management](#tenant-management)
+3. [Backup Procedures](#backup-procedures)
+4. [Incident Response](#incident-response)
+5. [Maintenance Windows](#maintenance-windows)
+6. [Troubleshooting](#troubleshooting)
+
+## Daily Operations
+
+### Health Checks
+
+```bash
+# Check all pods
+kubectl get pods --all-namespaces
+
+# Check API health
+curl https://api.sankofa.nexus/health
+
+# Check Keycloak health
+curl https://keycloak.sankofa.nexus/health
+
+# Check database connections
+kubectl exec -it -n api deployment/api -- \
+  psql $DATABASE_URL -c "SELECT 1"
+```
+
+### Monitoring Dashboard Review
+
+1. Review system overview dashboard
+2. Check error rates and latency
+3. Review billing anomalies
+4. Check security events
+5. Review Proxmox infrastructure status
+
+### Log Review
+
+```bash
+# Recent errors
+kubectl logs -n api deployment/api --tail=100 | grep -i error
+
+# Authentication failures
+kubectl logs -n api deployment/api | grep -i "auth.*fail"
+
+# Billing issues
+kubectl logs -n api deployment/api | grep -i billing
+```
+
+## Tenant Management
+
+### Create New Tenant
+
+```bash
+# Via GraphQL
+mutation {
+  createTenant(input: {
+    name: "New Tenant"
+    domain: "tenant.example.com"
+    tier: STANDARD
+  }) {
+    id
+    name
+    status
+  }
+}
+
+# Or via API
+curl -X POST https://api.sankofa.nexus/graphql \
+  -H "Authorization: Bearer $TOKEN" \
+  -d '{"query": "mutation { createTenant(...) }"}'
+```
+
+### Suspend Tenant
+
+```bash
+# Update tenant status
+mutation {
+  updateTenant(id: "tenant-id", input: { status: SUSPENDED }) {
+    id
+    status
+  }
+}
+```
+
+### Delete Tenant
+
+```bash
+# Soft delete (recommended)
+mutation {
+  updateTenant(id: "tenant-id", input: { status: DELETED }) {
+    id
+    status
+  }
+}
+
+# Hard delete (requires confirmation)
+# This will delete all tenant resources
+```
+
+### Tenant Resource Quotas
+
+```bash
+# Check quota usage
+query {
+  tenant(id: "tenant-id") {
+    quotaLimits {
+      compute { vcpu memory instances }
+      storage { total perInstance }
+    }
+    usage {
+      totalCost
+      byResource {
+        resourceId
+        cost
+      }
+    }
+  }
+}
+```
+
+## Backup Procedures
+
+### Database Backups
+
+#### Automated Backups
+
+Backups run daily at 2 AM UTC:
+
+```bash
+# Check backup job status
+kubectl get cronjob -n api postgres-backup
+
+# View recent backups
+kubectl get pvc -n api | grep backup
+```
+
+#### Manual Backup
+
+```bash
+# Create backup
+kubectl exec -it -n api deployment/postgres -- \
+  pg_dump -U sankofa sankofa > backup-$(date +%Y%m%d).sql
+
+# Restore from backup
+kubectl exec -i -n api deployment/postgres -- \
+  psql -U sankofa sankofa < backup-20240101.sql
+```
+
+### Keycloak Backups
+
+```bash
+# Export realm configuration
+kubectl exec -it -n keycloak deployment/keycloak -- \
+  /opt/keycloak/bin/kcadm.sh get realms/master \
+  --realm master \
+  --server http://localhost:8080 \
+  --user admin \
+  --password $ADMIN_PASSWORD > keycloak-realm-$(date +%Y%m%d).json
+```
+
+### Proxmox Backups
+
+```bash
+# Backup VM configuration
+# Via Proxmox API or UI
+# Store in version control or backup storage
+```
+
+### Tenant-Specific Backups
+
+```bash
+# Export tenant data
+query {
+  tenant(id: "tenant-id") {
+    id
+    name
+    resources {
+      id
+      name
+      type
+    }
+  }
+}
+
+# Backup tenant resources
+# Use resource export API or database dump filtered by tenant_id
+```
+
+## Incident Response
+
+### Incident Classification
+
+- **P0 - Critical**: System down, data loss, security breach
+- **P1 - High**: Major feature broken, performance degradation
+- **P2 - Medium**: Minor feature broken, non-critical issues
+- **P3 - Low**: Cosmetic issues, minor bugs
+
+### Incident Response Process
+
+1. **Detection**: Monitor alerts, user reports
+2. **Triage**: Classify severity, assign owner
+3. **Containment**: Isolate affected systems
+4. **Investigation**: Root cause analysis
+5. **Resolution**: Fix and verify
+6. **Post-Mortem**: Document and improve
+
+### Common Incidents
+
+#### API Down
+
+```bash
+# Check pod status
+kubectl get pods -n api
+
+# Check logs
+kubectl logs -n api deployment/api --tail=100
+
+# Restart if needed
+kubectl rollout restart deployment/api -n api
+
+# Check database
+kubectl exec -it -n api deployment/postgres -- \
+  psql -U sankofa -c "SELECT 1"
+```
+
+#### Database Connection Issues
+
+```bash
+# Check connection pool
+kubectl exec -it -n api deployment/api -- \
+  curl http://localhost:4000/metrics | grep db_connections
+
+# Restart API to reset connections
+kubectl rollout restart deployment/api -n api
+
+# Check database load
+kubectl exec -it -n api deployment/postgres -- \
+  psql -U sankofa -c "SELECT * FROM pg_stat_activity"
+```
+
+#### High Error Rate
+
+```bash
+# Check error logs
+kubectl logs -n api deployment/api | grep -i error | tail -50
+
+# Check recent deployments
+kubectl rollout history deployment/api -n api
+
+# Rollback if needed
+kubectl rollout undo deployment/api -n api
+```
+
+#### Billing Anomaly
+
+```bash
+# Check billing metrics
+curl https://prometheus.sankofa.nexus/api/v1/query?query=sankofa_billing_cost_usd
+
+# Review recent usage records
+query {
+  usage(tenantId: "tenant-id", timeRange: {...}) {
+    totalCost
+    byResource {
+      resourceId
+      cost
+    }
+  }
+}
+
+# Check for resource leaks
+kubectl get resources --all-namespaces | grep tenant-id
+```
+
+## Maintenance Windows
+
+### Scheduled Maintenance
+
+Maintenance windows are scheduled:
+- **Weekly**: Sunday 2-4 AM UTC (low traffic)
+- **Monthly**: First Sunday 2-6 AM UTC (major updates)
+
+### Pre-Maintenance Checklist
+
+- [ ] Notify all tenants (24h advance)
+- [ ] Create backup of database
+- [ ] Create backup of Keycloak
+- [ ] Review recent changes
+- [ ] Prepare rollback plan
+- [ ] Set maintenance mode flag
+
+### Maintenance Mode
+
+```bash
+# Enable maintenance mode
+kubectl set env deployment/api -n api MAINTENANCE_MODE=true
+
+# Disable maintenance mode
+kubectl set env deployment/api -n api MAINTENANCE_MODE=false
+```
+
+### Post-Maintenance Checklist
+
+- [ ] Verify all services are up
+- [ ] Run health checks
+- [ ] Check error rates
+- [ ] Verify backups completed
+- [ ] Notify tenants of completion
+- [ ] Update documentation
+
+## Troubleshooting
+
+### API Not Responding
+
+```bash
+# Check pod status
+kubectl describe pod -n api -l app=api
+
+# Check logs
+kubectl logs -n api -l app=api --tail=100
+
+# Check resource limits
+kubectl top pod -n api
+
+# Check network policies
+kubectl get networkpolicies -n api
+```
+
+### Database Performance Issues
+
+```bash
+# Check slow queries
+kubectl exec -it -n api deployment/postgres -- \
+  psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10"
+
+# Check table sizes
+kubectl exec -it -n api deployment/postgres -- \
+  psql -U sankofa -c "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10"
+
+# Analyze tables
+kubectl exec -it -n api deployment/postgres -- \
+  psql -U sankofa -c "ANALYZE"
+```
+
+### Keycloak Issues
+
+```bash
+# Check Keycloak logs
+kubectl logs -n keycloak deployment/keycloak --tail=100
+
+# Check database connection
+kubectl exec -it -n keycloak deployment/keycloak -- \
+  curl http://localhost:8080/health/ready
+
+# Restart Keycloak
+kubectl rollout restart deployment/keycloak -n keycloak
+```
+
+### Proxmox Integration Issues
+
+```bash
+# Check Crossplane provider
+kubectl get pods -n crossplane-system | grep proxmox
+
+# Check provider logs
+kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox
+
+# Test Proxmox connection
+kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
+  curl https://proxmox-endpoint:8006/api2/json/version
+```
+
+## Security Audit
+
+### Monthly Security Review
+
+1. Review access logs
+2. Check for failed authentication attempts
+3. Review policy violations
+4. Check for unusual API usage
+5. Review incident response logs
+6. Update security documentation
+
+### Access Review
+
+```bash
+# List all users
+query {
+  users {
+    id
+    email
+    role
+    lastLogin
+  }
+}
+
+# Review tenant access
+query {
+  tenant(id: "tenant-id") {
+    users {
+      id
+      email
+      role
+    }
+  }
+}
+```
+
+## Emergency Contacts
+
+- **On-Call Engineer**: (configure in PagerDuty/Opsgenie)
+- **Database Admin**: (configure)
+- **Security Team**: (configure)
+- **Management**: (configure)
+
+## References
+
+- Monitoring Guide: `docs/MONITORING_GUIDE.md`
+- Deployment Guide: `docs/DEPLOYMENT_GUIDE.md`
+- Keycloak Guide: `docs/KEYCLOAK_DEPLOYMENT.md`
+
--- a/docs/guides/PNPM_MIGRATION_GUIDE.md
+++ b/docs/guides/PNPM_MIGRATION_GUIDE.md
@@ -0,0 +1,136 @@
+# pnpm Migration Guide
+
+This guide explains the package management setup for the Sankofa Phoenix project.
+
+## Current Status
+
+The project supports both **pnpm** (recommended) and **npm** (fallback) for package management.
+
+- **Root**: Uses `pnpm` with `pnpm-lock.yaml`
+- **API**: Supports both `pnpm` and `npm` (via `.npmrc` configuration)
+- **Portal**: Supports both `pnpm` and `npm` (via `.npmrc` configuration)
+
+## Why pnpm?
+
+pnpm offers several advantages:
+
+1. **Disk Space Efficiency**: Shared dependency store across projects
+2. **Speed**: Faster installation due to content-addressable storage
+3. **Strict Dependency Resolution**: Prevents phantom dependencies
+4. **Better Monorepo Support**: Excellent for managing multiple packages
+
+## Installation
+
+### Using pnpm (Recommended)
+
+```bash
+# Install pnpm globally
+npm install -g pnpm
+
+# Or using corepack (Node.js 16.13+)
+corepack enable
+corepack prepare pnpm@latest --activate
+
+# Install dependencies
+pnpm install
+
+# In API directory
+cd api
+pnpm install
+
+# In Portal directory
+cd portal
+pnpm install
+```
+
+### Using npm (Fallback)
+
+```bash
+# Install dependencies with npm
+npm install
+
+# In API directory
+cd api
+npm install
+
+# In Portal directory
+cd portal
+npm install
+```
+
+## CI/CD
+
+The CI/CD pipeline (`.github/workflows/ci.yml`) supports both package managers:
+
+```yaml
+- name: Install dependencies
+  run: npm install --frozen-lockfile || pnpm install --frozen-lockfile
+```
+
+This ensures CI works regardless of which package manager is used locally.
+
+## Migration Steps (Optional)
+
+If you want to fully migrate to pnpm:
+
+1. **Remove package-lock.json files** (if any exist):
+   ```bash
+   find . -name "package-lock.json" -not -path "*/node_modules/*" -delete
+   ```
+
+2. **Install with pnpm**:
+   ```bash
+   pnpm install
+   ```
+
+3. **Verify installation**:
+   ```bash
+   pnpm list
+   ```
+
+4. **Update CI/CD** (optional):
+   - The current CI already supports both, so no changes needed
+   - You can make it pnpm-only if desired
+
+## Benefits of Current Setup
+
+The current flexible setup provides:
+
+- ✅ **Backward Compatibility**: Works with both package managers
+- ✅ **Team Flexibility**: Team members can use their preferred tool
+- ✅ **CI Resilience**: CI works with either package manager
+- ✅ **Gradual Migration**: Can migrate at own pace
+
+## Recommended Practice
+
+While both are supported, we recommend:
+
+- **Local Development**: Use `pnpm` for better performance
+- **CI/CD**: Current setup (both supported) is fine
+- **Documentation**: Update to reflect pnpm as primary, npm as fallback
+
+## Troubleshooting
+
+### Module not found errors
+
+If you encounter module resolution issues:
+
+1. Delete `node_modules` and lock file
+2. Reinstall with your chosen package manager:
+   ```bash
+   rm -rf node_modules package-lock.json
+   pnpm install  # or npm install
+   ```
+
+### Lock file conflicts
+
+If you see conflicts between `package-lock.json` and `pnpm-lock.yaml`:
+
+- Use `.gitignore` to exclude `package-lock.json` (already configured)
+- Team should agree on primary package manager
+- Document choice in README
+
+---
+
+**Last Updated**: 2025-01-09
+
--- a/docs/guides/QUICK_INSTALL_GUEST_AGENT.md
+++ b/docs/guides/QUICK_INSTALL_GUEST_AGENT.md
@@ -0,0 +1,70 @@
+# Quick Guide: Install Guest Agent via Proxmox Console
+
+## Problem
+VMs are not accessible via SSH from your current network location. Use Proxmox Web UI console instead.
+
+## Solution: Proxmox Web UI Console
+
+### Access Proxmox Web UI
+
+**Site 1:** https://192.168.11.10:8006  
+**Site 2:** https://192.168.11.11:8006
+
+### For Each VM (14 total):
+
+1. **Open VM Console:**
+   - Click on the VM in Proxmox Web UI
+   - Click **"Console"** button
+   - Console opens in browser
+
+2. **Login:**
+   - Username: `admin`
+   - Password: (your VM password)
+
+3. **Install Guest Agent:**
+   ```bash
+   sudo apt-get update
+   sudo apt-get install -y qemu-guest-agent
+   sudo systemctl enable qemu-guest-agent
+   sudo systemctl start qemu-guest-agent
+   sudo systemctl status qemu-guest-agent
+   ```
+
+4. **Verify:**
+   - Should see: `active (running)`
+
+### After Installing on All VMs
+
+Run verification:
+```bash
+./scripts/verify-guest-agent-complete.sh
+./scripts/check-all-vm-ips.sh
+```
+
+## VM List
+
+**Site 1 (8 VMs):**
+- 136: nginx-proxy-vm
+- 139: smom-management
+- 141: smom-rpc-node-01
+- 142: smom-rpc-node-02
+- 145: smom-sentry-01
+- 146: smom-sentry-02
+- 150: smom-validator-01
+- 151: smom-validator-02
+
+**Site 2 (6 VMs):**
+- 101: smom-rpc-node-03
+- 104: smom-validator-04
+- 137: cloudflare-tunnel-vm
+- 138: smom-blockscout
+- 144: smom-rpc-node-04
+- 148: smom-sentry-04
+
+## Expected Result
+
+Once guest agent is running:
+- ✅ Proxmox can automatically detect IP addresses
+- ✅ IP assignment capability fully functional
+- ✅ All guest agent features available
+
--- a/docs/guides/README.md
+++ b/docs/guides/README.md
@@ -0,0 +1,15 @@
+# Guides
+
+This directory contains step-by-step guides and how-to documentation.
+
+## Contents
+
+- **[Build and Deploy Instructions](BUILD_AND_DEPLOY_INSTRUCTIONS.md)** - Instructions for building and deploying the system
+- **[Force Unlock Instructions](FORCE_UNLOCK_INSTRUCTIONS.md)** - Instructions for force unlocking resources
+- **[Quick Install Guest Agent](QUICK_INSTALL_GUEST_AGENT.md)** - Quick installation guide for guest agent
+- **[Enable Guest Agent Manual](enable-guest-agent-manual.md)** - Manual steps for enabling guest agent
+
+---
+
+**Last Updated**: 2025-01-09
+
--- a/docs/guides/TESTING.md
+++ b/docs/guides/TESTING.md
@@ -0,0 +1,293 @@
+# Testing Guide
+
+**Last Updated**: 2025-01-09 for Sankofa Phoenix
+
+## Overview
+
+This guide covers testing strategies, test suites, and best practices for the Sankofa Phoenix platform.
+
+## Test Structure
+
+```
+api/
+  src/
+    services/
+      __tests__/
+        *.test.ts          # Unit tests for services
+    adapters/
+      __tests__/
+        *.test.ts          # Adapter tests
+    schema/
+      __tests__/
+        *.test.ts          # GraphQL resolver tests
+
+src/
+  components/
+    __tests__/
+      *.test.tsx           # Component tests
+  lib/
+    __tests__/
+      *.test.ts            # Utility tests
+
+blockchain/
+  tests/
+    *.test.ts              # Smart contract tests
+```
+
+## Running Tests
+
+### Frontend Tests
+
+```bash
+npm test                    # Run all frontend tests
+npm test -- --ui           # Run with Vitest UI
+npm test -- --coverage     # Generate coverage report
+```
+
+### Backend Tests
+
+```bash
+cd api
+npm test                    # Run all API tests
+npm test -- --coverage     # Generate coverage report
+```
+
+### Blockchain Tests
+
+```bash
+cd blockchain
+npm test                    # Run smart contract tests
+```
+
+### E2E Tests
+
+```bash
+npm run test:e2e           # Run end-to-end tests
+```
+
+## Test Types
+
+### 1. Unit Tests
+
+Test individual functions and methods in isolation.
+
+**Example: Resource Service Test**
+
+```typescript
+import { describe, it, expect, vi } from 'vitest'
+import { getResources } from '../services/resource'
+
+describe('getResources', () => {
+  it('should return resources', async () => {
+    const mockContext = createMockContext()
+    const result = await getResources(mockContext)
+    expect(result).toBeDefined()
+  })
+})
+```
+
+### 2. Integration Tests
+
+Test interactions between multiple components.
+
+**Example: GraphQL Resolver Test**
+
+```typescript
+import { describe, it, expect } from 'vitest'
+import { createTestSchema } from '../schema'
+import { graphql } from 'graphql'
+
+describe('Resource Resolvers', () => {
+  it('should query resources', async () => {
+    const query = `
+      query {
+        resources {
+          id
+          name
+        }
+      }
+    `
+    const result = await graphql(createTestSchema(), query)
+    expect(result.data).toBeDefined()
+  })
+})
+```
+
+### 3. Component Tests
+
+Test React components in isolation.
+
+**Example: ResourceList Component Test**
+
+```typescript
+import { render, screen } from '@testing-library/react'
+import { ResourceList } from '../ResourceList'
+
+describe('ResourceList', () => {
+  it('should render resources', async () => {
+    render(<ResourceList />)
+    await waitFor(() => {
+      expect(screen.getByText('Test Resource')).toBeInTheDocument()
+    })
+  })
+})
+```
+
+### 4. E2E Tests
+
+Test complete user workflows.
+
+**Example: Resource Provisioning E2E**
+
+```typescript
+import { test, expect } from '@playwright/test'
+
+test('should provision resource', async ({ page }) => {
+  await page.goto('/resources')
+  await page.click('text=Provision Resource')
+  await page.fill('[name="name"]', 'test-resource')
+  await page.selectOption('[name="type"]', 'VM')
+  await page.click('text=Create')
+  
+  await expect(page.locator('text=test-resource')).toBeVisible()
+})
+```
+
+## Test Coverage Goals
+
+- **Unit Tests**: >80% coverage
+- **Integration Tests**: >60% coverage
+- **Component Tests**: >70% coverage
+- **E2E Tests**: Critical user paths covered
+
+## Mocking
+
+### Mock Database
+
+```typescript
+const mockDb = {
+  query: vi.fn().mockResolvedValue({ rows: [] }),
+}
+```
+
+### Mock GraphQL Client
+
+```typescript
+vi.mock('@/lib/graphql/client', () => ({
+  apolloClient: {
+    query: vi.fn(),
+    mutate: vi.fn(),
+  },
+}))
+```
+
+### Mock Provider APIs
+
+```typescript
+global.fetch = vi.fn().mockResolvedValue({
+  ok: true,
+  json: async () => ({ data: [] }),
+})
+```
+
+## Test Utilities
+
+### Test Helpers
+
+```typescript
+// test-utils.tsx
+export function createMockContext(): Context {
+  return {
+    db: createMockDb(),
+    user: {
+      id: 'test-user',
+      email: 'test@sankofa.nexus',
+      name: 'Test User',
+      role: 'ADMIN',
+    },
+  }
+}
+```
+
+### Test Data Factories
+
+```typescript
+export function createMockResource(overrides = {}) {
+  return {
+    id: 'resource-1',
+    name: 'Test Resource',
+    type: 'VM',
+    status: 'RUNNING',
+    ...overrides,
+  }
+}
+```
+
+## CI/CD Integration
+
+Tests run automatically on:
+
+- **Pull Requests**: All test suites
+- **Main Branch**: All tests + coverage reports
+- **Releases**: Full test suite + E2E tests
+
+## Best Practices
+
+1. **Write tests before fixing bugs** (TDD approach)
+2. **Test edge cases and error conditions**
+3. **Keep tests independent and isolated**
+4. **Use descriptive test names**
+5. **Mock external dependencies**
+6. **Clean up after tests**
+7. **Maintain test coverage**
+
+## Performance Testing
+
+### Load Testing
+
+```bash
+# Use k6 for load testing
+k6 run tests/load/api-load-test.js
+```
+
+### Stress Testing
+
+```bash
+# Test API under load
+artillery run tests/stress/api-stress.yml
+```
+
+## Security Testing
+
+- **Dependency scanning**: `npm audit`
+- **SAST**: SonarQube analysis
+- **DAST**: OWASP ZAP scans
+- **Penetration testing**: Quarterly assessments
+
+## Test Reports
+
+Test reports are generated in:
+- `coverage/` - Coverage reports
+- `test-results/` - Test execution results
+- `playwright-report/` - E2E test reports
+
+## Troubleshooting Tests
+
+### Tests Timing Out
+
+- Check for unclosed connections
+- Verify mocks are properly reset
+- Increase timeout values if needed
+
+### Flaky Tests
+
+- Ensure tests are deterministic
+- Fix race conditions
+- Use proper wait conditions
+
+### Database Test Issues
+
+- Ensure test database is isolated
+- Clean up test data after each test
+- Use transactions for isolation
+
--- a/docs/guides/TEST_EXAMPLES.md
+++ b/docs/guides/TEST_EXAMPLES.md
@@ -0,0 +1,314 @@
+# Test Examples and Patterns
+
+This document provides examples and patterns for writing tests in the Sankofa Phoenix project.
+
+## Unit Tests
+
+### Testing Service Functions
+
+```typescript
+// api/src/services/auth.test.ts
+import { describe, it, expect, vi, beforeEach } from 'vitest'
+import { login } from './auth'
+import { getDb } from '../db'
+import { AppErrors } from '../lib/errors'
+
+// Mock dependencies
+vi.mock('../db')
+vi.mock('../lib/errors')
+
+describe('auth service', () => {
+  beforeEach(() => {
+    vi.clearAllMocks()
+  })
+
+  it('should authenticate valid user', async () => {
+    const mockDb = {
+      query: vi.fn().mockResolvedValue({
+        rows: [{
+          id: '1',
+          email: 'user@example.com',
+          name: 'Test User',
+          password_hash: '$2a$10$hashed',
+          role: 'USER',
+          created_at: new Date(),
+          updated_at: new Date(),
+        }]
+      })
+    }
+    
+    vi.mocked(getDb).mockReturnValue(mockDb as any)
+    // Mock bcrypt.compare to return true
+    vi.mock('bcryptjs', () => ({
+      compare: vi.fn().mockResolvedValue(true)
+    }))
+
+    const result = await login('user@example.com', 'password123')
+    
+    expect(result).toHaveProperty('token')
+    expect(result.user.email).toBe('user@example.com')
+  })
+
+  it('should throw error for invalid credentials', async () => {
+    const mockDb = {
+      query: vi.fn().mockResolvedValue({
+        rows: []
+      })
+    }
+    
+    vi.mocked(getDb).mockReturnValue(mockDb as any)
+
+    await expect(login('invalid@example.com', 'wrong')).rejects.toThrow()
+  })
+})
+```
+
+### Testing GraphQL Resolvers
+
+```typescript
+// api/src/schema/resolvers.test.ts
+import { describe, it, expect, vi } from 'vitest'
+import { resolvers } from './resolvers'
+import * as resourceService from '../services/resource'
+
+vi.mock('../services/resource')
+
+describe('GraphQL resolvers', () => {
+  it('should return resources', async () => {
+    const mockContext = {
+      user: { id: '1', email: 'test@example.com', role: 'USER' },
+      db: {} as any,
+      tenantContext: null
+    }
+
+    const mockResources = [
+      { id: '1', name: 'Resource 1', type: 'VM', status: 'RUNNING' }
+    ]
+
+    vi.mocked(resourceService.getResources).mockResolvedValue(mockResources as any)
+
+    const result = await resolvers.Query.resources({}, {}, mockContext)
+    
+    expect(result).toEqual(mockResources)
+    expect(resourceService.getResources).toHaveBeenCalledWith(mockContext, undefined)
+  })
+})
+```
+
+### Testing Adapters
+
+```typescript
+// api/src/adapters/proxmox/adapter.test.ts
+import { describe, it, expect, vi, beforeEach } from 'vitest'
+import { ProxmoxAdapter } from './adapter'
+
+// Mock fetch
+global.fetch = vi.fn()
+
+describe('ProxmoxAdapter', () => {
+  let adapter: ProxmoxAdapter
+
+  beforeEach(() => {
+    adapter = new ProxmoxAdapter({
+      apiUrl: 'https://proxmox.example.com:8006',
+      apiToken: 'test-token'
+    })
+    vi.clearAllMocks()
+  })
+
+  it('should discover resources', async () => {
+    vi.mocked(fetch)
+      .mockResolvedValueOnce({
+        ok: true,
+        json: async () => ({
+          data: [{ node: 'node1' }]
+        })
+      } as Response)
+      .mockResolvedValueOnce({
+        ok: true,
+        json: async () => ({
+          data: [
+            { vmid: 100, name: 'vm-100', status: 'running' }
+          ]
+        })
+      } as Response)
+
+    const resources = await adapter.discoverResources()
+    
+    expect(resources).toHaveLength(1)
+    expect(resources[0].name).toBe('vm-100')
+  })
+
+  it('should handle API errors', async () => {
+    vi.mocked(fetch).mockResolvedValueOnce({
+      ok: false,
+      status: 401,
+      statusText: 'Unauthorized',
+      text: async () => 'Authentication failed'
+    } as Response)
+
+    await expect(adapter.discoverResources()).rejects.toThrow()
+  })
+})
+```
+
+## Integration Tests
+
+### Testing Database Operations
+
+```typescript
+// api/src/services/resource.integration.test.ts
+import { describe, it, expect, beforeAll, afterAll } from 'vitest'
+import { getDb } from '../db'
+import { createResource, getResource } from './resource'
+
+describe('resource service integration', () => {
+  let db: any
+  let context: any
+
+  beforeAll(async () => {
+    db = getDb()
+    context = {
+      user: { id: 'test-user', role: 'ADMIN' },
+      db,
+      tenantContext: null
+    }
+  })
+
+  afterAll(async () => {
+    // Cleanup test data
+    await db.query('DELETE FROM resources WHERE name LIKE $1', ['test-%'])
+    await db.end()
+  })
+
+  it('should create and retrieve resource', async () => {
+    const input = {
+      name: 'test-vm',
+      type: 'VM',
+      siteId: 'test-site'
+    }
+
+    const created = await createResource(context, input)
+    expect(created.name).toBe('test-vm')
+
+    const retrieved = await getResource(context, created.id)
+    expect(retrieved.id).toBe(created.id)
+    expect(retrieved.name).toBe('test-vm')
+  })
+})
+```
+
+## E2E Tests
+
+### Testing API Endpoints
+
+```typescript
+// e2e/api.test.ts
+import { describe, it, expect, beforeAll } from 'vitest'
+import { request } from './helpers'
+
+describe('API E2E tests', () => {
+  let authToken: string
+
+  beforeAll(async () => {
+    // Login to get token
+    const response = await request('/graphql', {
+      method: 'POST',
+      body: JSON.stringify({
+        query: `
+          mutation {
+            login(email: "test@example.com", password: "test123") {
+              token
+            }
+          }
+        `
+      })
+    })
+    
+    const data = await response.json()
+    authToken = data.data.login.token
+  })
+
+  it('should get resources', async () => {
+    const response = await request('/graphql', {
+      method: 'POST',
+      headers: {
+        'Authorization': `Bearer ${authToken}`
+      },
+      body: JSON.stringify({
+        query: `
+          query {
+            resources {
+              id
+              name
+              type
+            }
+          }
+        `
+      })
+    })
+
+    const data = await response.json()
+    expect(data.data.resources).toBeInstanceOf(Array)
+  })
+})
+```
+
+## React Component Tests
+
+```typescript
+// portal/src/components/Dashboard.test.tsx
+import { describe, it, expect, vi } from 'vitest'
+import { render, screen, waitFor } from '@testing-library/react'
+import { Dashboard } from './Dashboard'
+
+vi.mock('../lib/crossplane-client', () => ({
+  createCrossplaneClient: () => ({
+    getVMs: vi.fn().mockResolvedValue([
+      { id: '1', name: 'vm-1', status: 'running' }
+    ])
+  })
+}))
+
+describe('Dashboard', () => {
+  it('should render VM list', async () => {
+    render(<Dashboard />)
+    
+    await waitFor(() => {
+      expect(screen.getByText('vm-1')).toBeInTheDocument()
+    })
+  })
+})
+```
+
+## Best Practices
+
+1. **Use descriptive test names**: Describe what is being tested
+2. **Arrange-Act-Assert pattern**: Structure tests clearly
+3. **Mock external dependencies**: Don't rely on real external services
+4. **Test error cases**: Verify error handling
+5. **Clean up test data**: Remove data created during tests
+6. **Use fixtures**: Create reusable test data
+7. **Test edge cases**: Include boundary conditions
+8. **Keep tests isolated**: Tests should not depend on each other
+
+## Running Tests
+
+```bash
+# Run all tests
+pnpm test
+
+# Run tests in watch mode
+pnpm test:watch
+
+# Run tests with coverage
+pnpm test:coverage
+
+# Run specific test file
+pnpm test path/to/test/file.test.ts
+```
+
+---
+
+**Last Updated**: 2025-01-09
+
--- a/docs/guides/TROUBLESHOOTING_GUIDE.md
+++ b/docs/guides/TROUBLESHOOTING_GUIDE.md
@@ -0,0 +1,523 @@
+# Troubleshooting Guide
+
+**Last Updated**: 2025-01-09
+
+Common issues and solutions for Sankofa Phoenix.
+
+## Table of Contents
+
+1. [API Issues](#api-issues)
+2. [Database Issues](#database-issues)
+3. [Authentication Issues](#authentication-issues)
+4. [Resource Provisioning](#resource-provisioning)
+5. [Billing Issues](#billing-issues)
+6. [Performance Issues](#performance-issues)
+7. [Deployment Issues](#deployment-issues)
+
+## API Issues
+
+### API Not Responding
+
+**Symptoms:**
+- 503 Service Unavailable
+- Connection timeout
+- Health check fails
+
+**Diagnosis:**
+```bash
+# Check pod status
+kubectl get pods -n api
+
+# Check logs
+kubectl logs -n api deployment/api --tail=100
+
+# Check service
+kubectl get svc -n api api
+```
+
+**Solutions:**
+1. Restart API deployment:
+   ```bash
+   kubectl rollout restart deployment/api -n api
+   ```
+
+2. Check resource limits:
+   ```bash
+   kubectl describe pod -n api -l app=api
+   ```
+
+3. Verify database connection:
+   ```bash
+   kubectl exec -it -n api deployment/api -- \
+     psql $DATABASE_URL -c "SELECT 1"
+   ```
+
+### GraphQL Query Errors
+
+**Symptoms:**
+- GraphQL errors in response
+- "Internal server error"
+- Query timeouts
+
+**Diagnosis:**
+```bash
+# Check API logs for errors
+kubectl logs -n api deployment/api | grep -i error
+
+# Test GraphQL endpoint
+curl -X POST https://api.sankofa.nexus/graphql \
+  -H "Content-Type: application/json" \
+  -d '{"query": "{ health { status } }"}'
+```
+
+**Solutions:**
+1. Check query syntax
+2. Verify authentication token
+3. Check database query performance
+4. Review resolver logs
+
+### Rate Limiting
+
+**Symptoms:**
+- 429 Too Many Requests
+- Rate limit headers present
+
+**Solutions:**
+1. Implement request batching
+2. Use subscriptions for real-time updates
+3. Request rate limit increase (admin)
+4. Implement client-side caching
+
+## Database Issues
+
+### Connection Pool Exhausted
+
+**Symptoms:**
+- "Too many connections" errors
+- Slow query responses
+- Database connection timeouts
+
+**Diagnosis:**
+```bash
+# Check active connections
+kubectl exec -it -n api deployment/postgres -- \
+  psql -U sankofa -c "SELECT count(*) FROM pg_stat_activity"
+
+# Check connection pool metrics
+curl https://api.sankofa.nexus/metrics | grep db_connections
+```
+
+**Solutions:**
+1. Increase connection pool size:
+   ```yaml
+   env:
+     - name: DB_POOL_SIZE
+       value: "30"
+   ```
+
+2. Close idle connections:
+   ```sql
+   SELECT pg_terminate_backend(pid)
+   FROM pg_stat_activity
+   WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes';
+   ```
+
+3. Restart API to reset connections
+
+### Slow Queries
+
+**Symptoms:**
+- High query latency
+- Timeout errors
+- Database CPU high
+
+**Diagnosis:**
+```sql
+-- Find slow queries
+SELECT query, mean_exec_time, calls
+FROM pg_stat_statements
+ORDER BY mean_exec_time DESC
+LIMIT 10;
+
+-- Check table sizes
+SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
+FROM pg_tables
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
+```
+
+**Solutions:**
+1. Add database indexes:
+   ```sql
+   CREATE INDEX idx_resources_tenant_id ON resources(tenant_id);
+   CREATE INDEX idx_resources_status ON resources(status);
+   ```
+
+2. Analyze tables:
+   ```sql
+   ANALYZE resources;
+   ```
+
+3. Optimize queries
+4. Consider read replicas for heavy read workloads
+
+### Database Lock Issues
+
+**Symptoms:**
+- Queries hanging
+- "Lock timeout" errors
+- Deadlock errors
+
+**Solutions:**
+1. Check for long-running transactions:
+   ```sql
+   SELECT pid, state, query, now() - xact_start AS duration
+   FROM pg_stat_activity
+   WHERE state = 'active' AND xact_start IS NOT NULL
+   ORDER BY duration DESC;
+   ```
+
+2. Terminate blocking queries (if safe)
+3. Review transaction isolation levels
+4. Break up large transactions
+
+## Authentication Issues
+
+### Token Expired
+
+**Symptoms:**
+- 401 Unauthorized
+- "Token expired" error
+- Keycloak errors
+
+**Solutions:**
+1. Refresh token via Keycloak
+2. Re-authenticate
+3. Check token expiration settings in Keycloak
+
+### Invalid Token
+
+**Symptoms:**
+- 401 Unauthorized
+- "Invalid token" error
+
+**Diagnosis:**
+```bash
+# Verify Keycloak is accessible
+curl https://keycloak.sankofa.nexus/health
+
+# Check Keycloak logs
+kubectl logs -n keycloak deployment/keycloak --tail=100
+```
+
+**Solutions:**
+1. Verify token format
+2. Check Keycloak client configuration
+3. Verify token signature
+4. Check clock synchronization
+
+### Permission Denied
+
+**Symptoms:**
+- 403 Forbidden
+- "Access denied" error
+
+**Solutions:**
+1. Verify user role in Keycloak
+2. Check tenant context
+3. Review RBAC policies
+4. Verify resource ownership
+
+## Resource Provisioning
+
+### VM Creation Fails
+
+**Symptoms:**
+- Resource stuck in PENDING
+- Proxmox errors
+- Crossplane errors
+
+**Diagnosis:**
+```bash
+# Check Crossplane provider
+kubectl get pods -n crossplane-system | grep proxmox
+
+# Check ProxmoxVM resource
+kubectl describe proxmoxvm -n default test-vm
+
+# Check Proxmox connectivity
+kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
+  curl https://proxmox-endpoint:8006/api2/json/version
+```
+
+**Solutions:**
+1. Verify Proxmox credentials
+2. Check Proxmox node availability
+3. Verify resource quotas
+4. Check Crossplane provider logs
+
+### Resource Update Fails
+
+**Symptoms:**
+- Update mutation fails
+- Resource not updating
+- Status mismatch
+
+**Solutions:**
+1. Check resource state
+2. Verify update permissions
+3. Review resource constraints
+4. Check for conflicting updates
+
+## Billing Issues
+
+### Incorrect Costs
+
+**Symptoms:**
+- Unexpected charges
+- Missing usage records
+- Cost discrepancies
+
+**Diagnosis:**
+```sql
+-- Check usage records
+SELECT * FROM usage_records
+WHERE tenant_id = 'tenant-id'
+ORDER BY timestamp DESC
+LIMIT 100;
+
+-- Check billing calculations
+SELECT * FROM invoices
+WHERE tenant_id = 'tenant-id'
+ORDER BY created_at DESC;
+```
+
+**Solutions:**
+1. Review usage records
+2. Verify pricing configuration
+3. Check for duplicate records
+4. Recalculate costs if needed
+
+### Budget Alerts Not Triggering
+
+**Symptoms:**
+- Budget exceeded but no alert
+- Alerts not sent
+
+**Diagnosis:**
+```sql
+-- Check budget status
+SELECT * FROM budgets
+WHERE tenant_id = 'tenant-id';
+
+-- Check alert configuration
+SELECT * FROM billing_alerts
+WHERE tenant_id = 'tenant-id' AND enabled = true;
+```
+
+**Solutions:**
+1. Verify alert configuration
+2. Check alert evaluation schedule
+3. Review notification channels
+4. Test alert manually
+
+### Invoice Generation Fails
+
+**Symptoms:**
+- Invoice creation error
+- Missing line items
+- PDF generation fails
+
+**Solutions:**
+1. Check usage records exist
+2. Verify billing period
+3. Check PDF service
+4. Review invoice template
+
+## Performance Issues
+
+### High Latency
+
+**Symptoms:**
+- Slow API responses
+- Timeout errors
+- High P95 latency
+
+**Diagnosis:**
+```bash
+# Check API metrics
+curl https://api.sankofa.nexus/metrics | grep request_duration
+
+# Check database performance
+kubectl exec -it -n api deployment/postgres -- \
+  psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10"
+```
+
+**Solutions:**
+1. Add caching layer
+2. Optimize database queries
+3. Scale API horizontally
+4. Review N+1 query problems
+
+### High Memory Usage
+
+**Symptoms:**
+- OOM kills
+- Pod restarts
+- Memory warnings
+
+**Solutions:**
+1. Increase memory limits
+2. Review memory leaks
+3. Optimize data structures
+4. Implement pagination
+
+### High CPU Usage
+
+**Symptoms:**
+- Slow responses
+- CPU throttling
+- Pod evictions
+
+**Solutions:**
+1. Scale horizontally
+2. Optimize algorithms
+3. Add caching
+4. Review expensive operations
+
+## Deployment Issues
+
+### Pods Not Starting
+
+**Symptoms:**
+- Pods in Pending/CrashLoopBackOff
+- Image pull errors
+- Init container failures
+
+**Diagnosis:**
+```bash
+# Check pod status
+kubectl describe pod -n api <pod-name>
+
+# Check events
+kubectl get events -n api --sort-by='.lastTimestamp'
+
+# Check logs
+kubectl logs -n api <pod-name>
+```
+
+**Solutions:**
+1. Check image availability
+2. Verify resource requests/limits
+3. Check node resources
+4. Review init container logs
+
+### Service Not Accessible
+
+**Symptoms:**
+- Service unreachable
+- DNS resolution fails
+- Ingress errors
+
+**Diagnosis:**
+```bash
+# Check service
+kubectl get svc -n api
+
+# Check ingress
+kubectl describe ingress -n api api
+
+# Test service directly
+kubectl port-forward -n api svc/api 8080:80
+curl http://localhost:8080/health
+```
+
+**Solutions:**
+1. Verify service selector matches pods
+2. Check ingress configuration
+3. Verify DNS records
+4. Check network policies
+
+### Configuration Issues
+
+**Symptoms:**
+- Wrong environment variables
+- Missing secrets
+- ConfigMap errors
+
+**Solutions:**
+1. Verify environment variables:
+   ```bash
+   kubectl exec -n api deployment/api -- env | grep -E "DB_|KEYCLOAK_"
+   ```
+
+2. Check secrets:
+   ```bash
+   kubectl get secrets -n api
+   ```
+
+3. Review ConfigMaps:
+   ```bash
+   kubectl get configmaps -n api
+   ```
+
+## Getting Help
+
+### Logs
+
+```bash
+# API logs
+kubectl logs -n api deployment/api --tail=100 -f
+
+# Database logs
+kubectl logs -n api deployment/postgres --tail=100
+
+# Keycloak logs
+kubectl logs -n keycloak deployment/keycloak --tail=100
+
+# Crossplane logs
+kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox --tail=100
+```
+
+### Metrics
+
+```bash
+# Prometheus queries
+curl 'https://prometheus.sankofa.nexus/api/v1/query?query=up'
+
+# Grafana dashboards
+# Access: https://grafana.sankofa.nexus
+```
+
+### Support
+
+- **Documentation**: See `docs/` directory
+- **Operations Runbook**: `docs/OPERATIONS_RUNBOOK.md`
+- **API Documentation**: `docs/API_DOCUMENTATION.md`
+
+## Common Error Messages
+
+### "Database connection failed"
+- Check database pod status
+- Verify connection string
+- Check network policies
+
+### "Authentication required"
+- Verify token in request
+- Check token expiration
+- Verify Keycloak is accessible
+
+### "Quota exceeded"
+- Review tenant quotas
+- Check resource usage
+- Request quota increase
+
+### "Resource not found"
+- Verify resource ID
+- Check tenant context
+- Review access permissions
+
+### "Internal server error"
+- Check application logs
+- Review error details
+- Check system resources
+
--- a/docs/guides/enable-guest-agent-manual.md
+++ b/docs/guides/enable-guest-agent-manual.md
@@ -0,0 +1,153 @@
+# Enable Guest Agent on VMs
+
+## Automated Scripts (Recommended)
+
+The project includes automated scripts for managing guest agent:
+
+### Enable Guest Agent
+
+```bash
+./scripts/enable-guest-agent-existing-vms.sh
+```
+
+This script will:
+- Automatically discover all nodes on each Proxmox site
+- Automatically discover all VMs on each node
+- Check if guest agent is already enabled
+- Enable guest agent on VMs that need it
+- Provide detailed summary statistics
+
+### Verify Guest Agent Status
+
+```bash
+./scripts/verify-guest-agent.sh
+```
+
+This script will:
+- List all VMs with their guest agent status
+- Show which VMs have guest agent enabled/disabled
+- Provide per-node and per-site summaries
+- Display VM names and VMIDs for easy identification
+
+## Manual Instructions (Alternative)
+
+If the automated script doesn't work, you can use Proxmox CLI via SSH.
+
+## Site 1 (ml110-01) - 192.168.11.10
+
+### Step 1: Connect to Proxmox Host
+```bash
+ssh root@192.168.11.10
+```
+
+### Step 2: Enable Guest Agent for All VMs
+```bash
+for vmid in 118 132 133 127 128 123 124 121; do
+  echo "Enabling guest agent on VMID $vmid..."
+  qm set $vmid --agent 1
+  echo "✅ VMID $vmid done"
+done
+```
+
+### Step 3: Verify (Optional)
+```bash
+for vmid in 118 132 133 127 128 123 124 121; do
+  agent=$(qm config $vmid | grep '^agent:' | cut -d: -f2 | tr -d ' ')
+  echo "VMID $vmid: agent=${agent:-not set}"
+done
+```
+
+### Step 4: Exit
+```bash
+exit
+```
+
+## Site 2 (r630-01) - 192.168.11.11
+
+### Step 1: Connect to Proxmox Host
+```bash
+ssh root@192.168.11.11
+```
+
+### Step 2: Enable Guest Agent for All VMs
+```bash
+for vmid in 119 134 135 122 129 130 125 126 131 120; do
+  echo "Enabling guest agent on VMID $vmid..."
+  qm set $vmid --agent 1
+  echo "✅ VMID $vmid done"
+done
+```
+
+### Step 3: Verify (Optional)
+```bash
+for vmid in 119 134 135 122 129 130 125 126 131 120; do
+  agent=$(qm config $vmid | grep '^agent:' | cut -d: -f2 | tr -d ' ')
+  echo "VMID $vmid: agent=${agent:-not set}"
+done
+```
+
+### Step 4: Exit
+```bash
+exit
+```
+
+## Quick One-Liners (Alternative)
+
+If you have SSH key-based authentication set up, you can run these one-liners:
+
+```bash
+# Site 1
+ssh root@192.168.11.10 "for vmid in 118 132 133 127 128 123 124 121; do qm set \$vmid --agent 1; done"
+
+# Site 2
+ssh root@192.168.11.11 "for vmid in 119 134 135 122 129 130 125 126 131 120; do qm set \$vmid --agent 1; done"
+```
+
+## VMID Reference
+
+### Site 1 (ml110-01)
+- 118: nginx-proxy-vm
+- 132: smom-validator-01
+- 133: smom-validator-02
+- 127: smom-sentry-01
+- 128: smom-sentry-02
+- 123: smom-rpc-node-01
+- 124: smom-rpc-node-02
+- 121: smom-management
+
+### Site 2 (r630-01)
+- 119: cloudflare-tunnel-vm
+- 134: smom-validator-03
+- 135: smom-validator-04
+- 122: smom-sentry-03
+- 129: smom-sentry-04
+- 130: smom-rpc-node-03
+- 125: smom-rpc-node-04
+- 126: smom-services
+- 131: smom-blockscout
+- 120: smom-monitoring
+
+## Next Steps
+
+After enabling guest agent in Proxmox:
+
+1. **Wait for VMs to get IP addresses** (if they don't have them yet)
+2. **Install guest agent package in each VM** (if not already installed):
+   ```bash
+   ssh admin@<vm-ip>
+   sudo apt-get update
+   sudo apt-get install -y qemu-guest-agent
+   sudo systemctl enable qemu-guest-agent
+   sudo systemctl start qemu-guest-agent
+   ```
+
+## Automatic Guest Agent Enablement
+
+**New VMs** created with the updated Crossplane provider will automatically have guest agent enabled in Proxmox configuration. The provider code has been updated to set `agent=1` for all new VMs, cloned VMs, and when updating existing VMs.
+
+The guest agent package (`qemu-guest-agent`) is also automatically installed via cloud-init userData in the VM manifests, so new VMs will have both:
+1. Guest agent enabled in Proxmox config (`agent=1`)
+2. Guest agent package installed and running in the OS
+
+For existing VMs, use the automated script above or follow the manual instructions below.
+