Update documentation structure and enhance .gitignore
- Added generated index files and report directories to .gitignore to prevent unnecessary tracking of transient files. - Updated README links to reflect new documentation paths for better navigation. - Improved documentation organization by ensuring all links point to the correct locations, enhancing user experience and accessibility.
This commit is contained in:
152
docs/guides/BUILD_AND_DEPLOY_INSTRUCTIONS.md
Normal file
152
docs/guides/BUILD_AND_DEPLOY_INSTRUCTIONS.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# Build and Deploy Instructions
|
||||
|
||||
**Date**: 2025-12-11
|
||||
**Status**: ✅ **CODE FIXED - NEEDS IMAGE LOADING**
|
||||
|
||||
---
|
||||
|
||||
## Build Status
|
||||
|
||||
✅ **Provider code fixed and built successfully**
|
||||
- Fixed compilation errors
|
||||
- Added `findVMNode` function
|
||||
- Fixed variable scoping issue
|
||||
- Image built: `crossplane-provider-proxmox:latest`
|
||||
|
||||
---
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### 1. Build Provider Image
|
||||
|
||||
```bash
|
||||
cd crossplane-provider-proxmox
|
||||
docker build -t crossplane-provider-proxmox:latest .
|
||||
```
|
||||
|
||||
✅ **COMPLETE**
|
||||
|
||||
### 2. Load Image into Kind Cluster
|
||||
|
||||
**Required**: `kind` command must be installed
|
||||
|
||||
```bash
|
||||
kind load docker-image crossplane-provider-proxmox:latest --name sankofa
|
||||
```
|
||||
|
||||
⚠️ **PENDING**: `kind` command not available in current environment
|
||||
|
||||
**Alternative Methods**:
|
||||
|
||||
#### Option A: Install kind
|
||||
```bash
|
||||
# Install kind
|
||||
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
|
||||
chmod +x ./kind
|
||||
sudo mv ./kind /usr/local/bin/kind
|
||||
|
||||
# Then load image
|
||||
kind load docker-image crossplane-provider-proxmox:latest --name sankofa
|
||||
```
|
||||
|
||||
#### Option B: Use Registry
|
||||
```bash
|
||||
# Tag and push to registry
|
||||
docker tag crossplane-provider-proxmox:latest <registry>/crossplane-provider-proxmox:latest
|
||||
docker push <registry>/crossplane-provider-proxmox:latest
|
||||
|
||||
# Update provider.yaml to use registry image
|
||||
# Change imagePullPolicy from "Never" to "Always" or "IfNotPresent"
|
||||
```
|
||||
|
||||
#### Option C: Manual Copy (Advanced)
|
||||
```bash
|
||||
# Save image to file
|
||||
docker save crossplane-provider-proxmox:latest -o provider-image.tar
|
||||
|
||||
# Copy to kind node and load
|
||||
docker cp provider-image.tar kind-sankofa-control-plane:/tmp/
|
||||
docker exec kind-sankofa-control-plane ctr -n=k8s.io images import /tmp/provider-image.tar
|
||||
```
|
||||
|
||||
### 3. Restart Provider
|
||||
|
||||
```bash
|
||||
kubectl rollout restart deployment/crossplane-provider-proxmox -n crossplane-system
|
||||
kubectl rollout status deployment/crossplane-provider-proxmox -n crossplane-system
|
||||
```
|
||||
|
||||
✅ **COMPLETE** (but using old image until step 2 is done)
|
||||
|
||||
### 4. Verify Deployment
|
||||
|
||||
```bash
|
||||
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
|
||||
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Current Status
|
||||
|
||||
### ✅ Completed
|
||||
1. Code fixes applied
|
||||
2. Provider image built
|
||||
3. Templates updated to cloud image format
|
||||
4. Provider deployment restarted
|
||||
|
||||
### ⏳ Pending
|
||||
1. **Load image into kind cluster** (requires `kind` command)
|
||||
2. Test VM creation with new provider
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Install kind** or use alternative image loading method
|
||||
2. **Load image** into cluster
|
||||
3. **Restart provider** (if not already done)
|
||||
4. **Test VM 100** creation
|
||||
5. **Verify** task monitoring works
|
||||
|
||||
---
|
||||
|
||||
## Verification
|
||||
|
||||
After loading image and restarting:
|
||||
|
||||
1. **Check provider logs** for task monitoring:
|
||||
```bash
|
||||
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox | grep -i "task\|importdisk\|upid"
|
||||
```
|
||||
|
||||
2. **Deploy VM 100**:
|
||||
```bash
|
||||
kubectl apply -f examples/production/vm-100.yaml
|
||||
```
|
||||
|
||||
3. **Monitor creation**:
|
||||
```bash
|
||||
kubectl get proxmoxvm vm-100 -w
|
||||
```
|
||||
|
||||
4. **Check Proxmox**:
|
||||
```bash
|
||||
qm status 100
|
||||
qm config 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Expected Behavior
|
||||
|
||||
With the fixed provider:
|
||||
- ✅ Provider waits for `importdisk` task to complete
|
||||
- ✅ No lock timeouts
|
||||
- ✅ VM configured correctly after import
|
||||
- ✅ Boot disk attached properly
|
||||
|
||||
---
|
||||
|
||||
**Status**: ⏳ **AWAITING IMAGE LOAD INTO CLUSTER**
|
||||
|
||||
174
docs/guides/CODE_DOCUMENTATION_GUIDE.md
Normal file
174
docs/guides/CODE_DOCUMENTATION_GUIDE.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Code Documentation Guide
|
||||
|
||||
This guide outlines the standards and best practices for documenting code in the Sankofa Phoenix project.
|
||||
|
||||
## JSDoc Standards
|
||||
|
||||
### Function Documentation
|
||||
|
||||
All public functions should include JSDoc comments with:
|
||||
|
||||
- Description of what the function does
|
||||
- `@param` tags for each parameter
|
||||
- `@returns` tag describing the return value
|
||||
- `@throws` tags for exceptions that may be thrown
|
||||
- `@example` tag with usage example (for complex functions)
|
||||
|
||||
**Example:**
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Authenticate a user and return JWT token
|
||||
*
|
||||
* @param email - User email address
|
||||
* @param password - User password
|
||||
* @returns Authentication payload with JWT token and user information
|
||||
* @throws {AuthenticationError} If credentials are invalid
|
||||
* @example
|
||||
* ```typescript
|
||||
* const result = await login('user@example.com', 'password123');
|
||||
* console.log(result.token); // JWT token
|
||||
* ```
|
||||
*/
|
||||
export async function login(email: string, password: string): Promise<AuthPayload> {
|
||||
// implementation
|
||||
}
|
||||
```
|
||||
|
||||
### Class Documentation
|
||||
|
||||
Classes should include:
|
||||
|
||||
- Description of the class purpose
|
||||
- `@example` tag showing basic usage
|
||||
|
||||
**Example:**
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Proxmox VE Infrastructure Adapter
|
||||
*
|
||||
* Implements the InfrastructureAdapter interface for Proxmox VE infrastructure.
|
||||
* Provides resource discovery, creation, update, deletion, metrics, and health checks.
|
||||
*
|
||||
* @example
|
||||
* ```typescript
|
||||
* const adapter = new ProxmoxAdapter({
|
||||
* apiUrl: 'https://proxmox.example.com:8006',
|
||||
* apiToken: 'token-id=...'
|
||||
* });
|
||||
* const resources = await adapter.discoverResources();
|
||||
* ```
|
||||
*/
|
||||
export class ProxmoxAdapter implements InfrastructureAdapter {
|
||||
// implementation
|
||||
}
|
||||
```
|
||||
|
||||
### Interface Documentation
|
||||
|
||||
Complex interfaces should include documentation:
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Resource filter criteria for querying resources
|
||||
*
|
||||
* @property type - Filter by resource type (e.g., 'VM', 'CONTAINER')
|
||||
* @property status - Filter by resource status (e.g., 'RUNNING', 'STOPPED')
|
||||
* @property siteId - Filter by site ID
|
||||
* @property tenantId - Filter by tenant ID
|
||||
*/
|
||||
export interface ResourceFilter {
|
||||
type?: string
|
||||
status?: string
|
||||
siteId?: string
|
||||
tenantId?: string
|
||||
}
|
||||
```
|
||||
|
||||
### Method Documentation
|
||||
|
||||
Class methods should follow the same pattern as functions:
|
||||
|
||||
```typescript
|
||||
/**
|
||||
* Discover all resources across all Proxmox nodes
|
||||
*
|
||||
* @returns Array of normalized resources (VMs) from all nodes
|
||||
* @throws {Error} If API connection fails or nodes cannot be retrieved
|
||||
* @example
|
||||
* ```typescript
|
||||
* const resources = await adapter.discoverResources();
|
||||
* console.log(`Found ${resources.length} VMs`);
|
||||
* ```
|
||||
*/
|
||||
async discoverResources(): Promise<NormalizedResource[]> {
|
||||
// implementation
|
||||
}
|
||||
```
|
||||
|
||||
## Inline Comments
|
||||
|
||||
### When to Use Inline Comments
|
||||
|
||||
- **Complex logic**: Explain non-obvious algorithms or business rules
|
||||
- **Workarounds**: Document temporary fixes or known issues
|
||||
- **Performance optimizations**: Explain why a particular approach was chosen
|
||||
- **Business rules**: Document domain-specific logic
|
||||
|
||||
### Comment Style
|
||||
|
||||
```typescript
|
||||
// Good: Explains why, not what
|
||||
// Tenant-aware filtering (superior to Azure multi-tenancy)
|
||||
if (context.tenantContext) {
|
||||
// System admins can see all resources
|
||||
if (context.tenantContext.isSystemAdmin) {
|
||||
// No filtering needed
|
||||
} else if (context.tenantContext.tenantId) {
|
||||
// Filter by tenant ID
|
||||
query += ` AND r.tenant_id = $${paramCount}`
|
||||
}
|
||||
}
|
||||
|
||||
// Bad: States the obvious
|
||||
// Loop through nodes
|
||||
for (const node of nodes) {
|
||||
// Get VMs
|
||||
const vms = await this.getVMs(node.node)
|
||||
}
|
||||
```
|
||||
|
||||
## TODO Comments
|
||||
|
||||
Use TODO comments for known improvements:
|
||||
|
||||
```typescript
|
||||
// TODO: Add rate limiting to prevent API abuse
|
||||
// TODO: Implement caching for frequently accessed resources
|
||||
// FIXME: This workaround should be removed when upstream issue is fixed
|
||||
```
|
||||
|
||||
## Documentation Checklist
|
||||
|
||||
When adding new code, ensure:
|
||||
|
||||
- [ ] Public functions have JSDoc comments
|
||||
- [ ] Complex private functions have inline comments
|
||||
- [ ] Classes have class-level documentation
|
||||
- [ ] Interfaces have documentation for complex types
|
||||
- [ ] Examples are provided for public APIs
|
||||
- [ ] Error cases are documented with `@throws`
|
||||
- [ ] Complex algorithms have explanatory comments
|
||||
- [ ] Business rules are documented
|
||||
|
||||
## Tools
|
||||
|
||||
- **TypeScript**: Built-in JSDoc support
|
||||
- **VS Code**: JSDoc snippets and IntelliSense
|
||||
- **tsdoc**: Standard for TypeScript documentation comments
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-01-09
|
||||
|
||||
77
docs/guides/CONTRIBUTING.md
Normal file
77
docs/guides/CONTRIBUTING.md
Normal file
@@ -0,0 +1,77 @@
|
||||
# Contributing to Sankofa
|
||||
|
||||
**Last Updated**: 2025-01-09
|
||||
|
||||
Thank you for your interest in contributing to Sankofa! This document provides guidelines and instructions for contributing to the Sankofa ecosystem and Sankofa Phoenix cloud platform.
|
||||
|
||||
## Code of Conduct
|
||||
|
||||
- Be respectful and inclusive
|
||||
- Welcome newcomers and help them learn
|
||||
- Focus on constructive feedback
|
||||
- Respect different viewpoints and experiences
|
||||
|
||||
## Getting Started
|
||||
|
||||
1. Fork the repository
|
||||
2. Clone your fork: `git clone https://github.com/yourusername/Sankofa.git`
|
||||
3. Create a branch: `git checkout -b feature/your-feature-name`
|
||||
4. Make your changes
|
||||
5. Commit your changes: `git commit -m "Add your feature"`
|
||||
6. Push to your fork: `git push origin feature/your-feature-name`
|
||||
7. Open a Pull Request
|
||||
|
||||
## Development Setup
|
||||
|
||||
See [DEVELOPMENT.md](./DEVELOPMENT.md) for detailed setup instructions.
|
||||
|
||||
## Pull Request Process
|
||||
|
||||
1. Ensure your code follows the project's style guidelines
|
||||
2. Add tests for new features
|
||||
3. Ensure all tests pass: `pnpm test`
|
||||
4. Update documentation as needed
|
||||
5. Ensure your branch is up to date with the main branch
|
||||
6. Submit your PR with a clear description
|
||||
|
||||
## Coding Standards
|
||||
|
||||
### TypeScript/JavaScript
|
||||
|
||||
- Use TypeScript for all new code
|
||||
- Follow the existing code style
|
||||
- Use meaningful variable and function names
|
||||
- Add JSDoc comments for public APIs
|
||||
- Avoid `any` types - use proper typing
|
||||
|
||||
### React Components
|
||||
|
||||
- Use functional components with hooks
|
||||
- Keep components small and focused
|
||||
- Extract reusable logic into custom hooks
|
||||
- Use proper prop types or TypeScript interfaces
|
||||
|
||||
### Git Commits
|
||||
|
||||
- Use clear, descriptive commit messages
|
||||
- Follow conventional commits format when possible
|
||||
- Keep commits focused on a single change
|
||||
|
||||
## Testing
|
||||
|
||||
- Write tests for all new features
|
||||
- Ensure existing tests still pass
|
||||
- Aim for >80% code coverage
|
||||
- Test both success and error cases
|
||||
|
||||
## Documentation
|
||||
|
||||
- Update README.md if needed
|
||||
- Add JSDoc comments for new functions
|
||||
- Update API documentation for backend changes
|
||||
- Keep architecture docs up to date
|
||||
|
||||
## Questions?
|
||||
|
||||
Feel free to open an issue for questions or reach out to the maintainers.
|
||||
|
||||
184
docs/guides/DEVELOPMENT.md
Normal file
184
docs/guides/DEVELOPMENT.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# Development Guide
|
||||
|
||||
**Last Updated**: 2025-01-09
|
||||
|
||||
This guide will help you set up your development environment for Sankofa Phoenix.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Node.js 18+ and pnpm (or npm/yarn)
|
||||
- PostgreSQL 14+ (for API)
|
||||
- Go 1.21+ (for Crossplane provider)
|
||||
- Docker (optional, for local services)
|
||||
|
||||
## Initial Setup
|
||||
|
||||
### 1. Clone the Repository
|
||||
|
||||
```bash
|
||||
git clone https://github.com/sankofa/Sankofa.git
|
||||
cd Sankofa
|
||||
```
|
||||
|
||||
### 2. Install Dependencies
|
||||
|
||||
```bash
|
||||
# Main application
|
||||
pnpm install
|
||||
|
||||
# Portal
|
||||
cd portal
|
||||
npm install
|
||||
cd ..
|
||||
|
||||
# API
|
||||
cd api
|
||||
npm install
|
||||
cd ..
|
||||
|
||||
# Crossplane Provider
|
||||
cd crossplane-provider-proxmox
|
||||
go mod download
|
||||
cd ..
|
||||
```
|
||||
|
||||
### 3. Set Up Environment Variables
|
||||
|
||||
Create `.env.local` files:
|
||||
|
||||
```bash
|
||||
# Root .env.local
|
||||
cp .env.example .env.local
|
||||
|
||||
# Portal .env.local
|
||||
cd portal
|
||||
cp .env.example .env.local
|
||||
cd ..
|
||||
|
||||
# API .env.local
|
||||
cd api
|
||||
cp .env.example .env.local
|
||||
cd ..
|
||||
```
|
||||
|
||||
### 4. Set Up Database
|
||||
|
||||
```bash
|
||||
# Create database
|
||||
createdb sankofa
|
||||
|
||||
# Run migrations
|
||||
cd api
|
||||
npm run db:migrate
|
||||
```
|
||||
|
||||
## Running the Application
|
||||
|
||||
### Development Mode
|
||||
|
||||
```bash
|
||||
# Main app (port 3000)
|
||||
pnpm dev
|
||||
|
||||
# Portal (port 3001)
|
||||
cd portal
|
||||
npm run dev
|
||||
|
||||
# API (port 4000)
|
||||
cd api
|
||||
npm run dev
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Main app tests
|
||||
pnpm test
|
||||
|
||||
# Portal tests
|
||||
cd portal
|
||||
npm test
|
||||
|
||||
# Crossplane provider tests
|
||||
cd crossplane-provider-proxmox
|
||||
go test ./...
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
Sankofa/
|
||||
├── src/ # Main Next.js app
|
||||
├── portal/ # Portal application
|
||||
├── api/ # GraphQL API server
|
||||
├── crossplane-provider-proxmox/ # Crossplane provider
|
||||
├── gitops/ # GitOps configurations
|
||||
├── cloudflare/ # Cloudflare configs
|
||||
└── docs/ # Documentation
|
||||
```
|
||||
|
||||
## Common Tasks
|
||||
|
||||
### Adding a New Component
|
||||
|
||||
1. Create component in `src/components/`
|
||||
2. Add tests in `src/components/**/*.test.tsx`
|
||||
3. Export from appropriate index file
|
||||
4. Update Storybook (if applicable)
|
||||
|
||||
### Adding a New API Endpoint
|
||||
|
||||
1. Add GraphQL type definition in `api/src/schema/typeDefs.ts`
|
||||
2. Add resolver in `api/src/schema/resolvers.ts`
|
||||
3. Add service logic in `api/src/services/`
|
||||
4. Add tests
|
||||
|
||||
### Database Migrations
|
||||
|
||||
```bash
|
||||
cd api
|
||||
# Create migration
|
||||
npm run db:migrate:create migration-name
|
||||
|
||||
# Run migrations
|
||||
npm run db:migrate
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
### Frontend
|
||||
|
||||
- Use React DevTools
|
||||
- Check browser console
|
||||
- Use Next.js debug mode: `NODE_OPTIONS='--inspect' pnpm dev`
|
||||
|
||||
### Backend
|
||||
|
||||
- Use VS Code debugger
|
||||
- Check API logs
|
||||
- Use GraphQL Playground at `http://localhost:4000/graphql`
|
||||
|
||||
## Code Quality
|
||||
|
||||
### Linting
|
||||
|
||||
```bash
|
||||
pnpm lint
|
||||
```
|
||||
|
||||
### Type Checking
|
||||
|
||||
```bash
|
||||
pnpm type-check
|
||||
```
|
||||
|
||||
### Formatting
|
||||
|
||||
```bash
|
||||
pnpm format
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
See [TROUBLESHOOTING.md](./TROUBLESHOOTING.md) for common issues and solutions.
|
||||
|
||||
134
docs/guides/FORCE_UNLOCK_INSTRUCTIONS.md
Normal file
134
docs/guides/FORCE_UNLOCK_INSTRUCTIONS.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# Force Unlock VM Instructions
|
||||
|
||||
**Date**: 2025-12-09
|
||||
**Issue**: `qm unlock 100` is timing out
|
||||
|
||||
---
|
||||
|
||||
## Problem
|
||||
|
||||
The `qm unlock` command is timing out, which indicates:
|
||||
- A stuck process is holding the lock
|
||||
- The lock file is corrupted or in an invalid state
|
||||
- Another operation is blocking the unlock
|
||||
|
||||
---
|
||||
|
||||
## Solution: Force Unlock
|
||||
|
||||
### Option 1: Use the Script (Recommended)
|
||||
|
||||
**On Proxmox Node (root@ml110-01)**:
|
||||
|
||||
```bash
|
||||
# Copy the script to the Proxmox node
|
||||
# Or run commands manually (see Option 2)
|
||||
|
||||
# Run the script
|
||||
bash force-unlock-vm-proxmox.sh 100
|
||||
```
|
||||
|
||||
### Option 2: Manual Commands
|
||||
|
||||
**On Proxmox Node (root@ml110-01)**:
|
||||
|
||||
```bash
|
||||
# 1. Check for stuck processes
|
||||
ps aux | grep -E 'qm|qemu' | grep 100
|
||||
|
||||
# 2. Check lock file
|
||||
ls -la /var/lock/qemu-server/lock-100.conf
|
||||
cat /var/lock/qemu-server/lock-100.conf 2>/dev/null
|
||||
|
||||
# 3. Kill stuck processes (if found)
|
||||
pkill -9 -f 'qm.*100'
|
||||
pkill -9 -f 'qemu.*100'
|
||||
|
||||
# 4. Wait a moment
|
||||
sleep 2
|
||||
|
||||
# 5. Force remove lock file
|
||||
rm -f /var/lock/qemu-server/lock-100.conf
|
||||
|
||||
# 6. Verify lock is gone
|
||||
ls -la /var/lock/qemu-server/lock-100.conf
|
||||
# Should show: No such file or directory
|
||||
|
||||
# 7. Check VM status
|
||||
qm status 100
|
||||
|
||||
# 8. Try unlock again (should work now)
|
||||
qm unlock 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## If Lock Persists
|
||||
|
||||
### Check for Other Issues
|
||||
|
||||
```bash
|
||||
# Check if VM is in a transitional state
|
||||
qm status 100
|
||||
|
||||
# Check VM configuration
|
||||
qm config 100
|
||||
|
||||
# Check for other locks
|
||||
ls -la /var/lock/qemu-server/lock-*.conf
|
||||
|
||||
# Check system resources
|
||||
df -h
|
||||
free -h
|
||||
```
|
||||
|
||||
### Nuclear Option: Restart Proxmox Services
|
||||
|
||||
**⚠️ WARNING: This will affect all VMs on the node**
|
||||
|
||||
```bash
|
||||
# Only if absolutely necessary
|
||||
systemctl restart pve-cluster
|
||||
systemctl restart pvedaemon
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## After Successful Unlock
|
||||
|
||||
1. **Monitor VM Status**:
|
||||
```bash
|
||||
qm status 100
|
||||
```
|
||||
|
||||
2. **Check Provider Logs** (from Kubernetes):
|
||||
```bash
|
||||
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50 -f
|
||||
```
|
||||
|
||||
3. **Watch VM Resource**:
|
||||
```bash
|
||||
kubectl get proxmoxvm basic-vm-001 -w
|
||||
```
|
||||
|
||||
4. **Expected Outcome**:
|
||||
- Provider will retry within 1 minute
|
||||
- VM configuration will complete
|
||||
- VM will boot successfully
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
To prevent this issue in the future:
|
||||
|
||||
1. **Ensure proper VM shutdown** before operations
|
||||
2. **Wait for operations to complete** before starting new ones
|
||||
3. **Monitor for stuck processes** regularly
|
||||
4. **Implement lock timeout handling** in provider code (already added)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-12-09
|
||||
**Status**: ⚠️ **MANUAL FORCE UNLOCK REQUIRED**
|
||||
|
||||
217
docs/guides/KEYCLOAK_DEPLOYMENT.md
Normal file
217
docs/guides/KEYCLOAK_DEPLOYMENT.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# Keycloak Deployment
|
||||
|
||||
**Last Updated**: 2025-01-09 Guide
|
||||
|
||||
This guide covers deploying and configuring Keycloak for the Sankofa Phoenix platform.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Kubernetes cluster with admin access
|
||||
- kubectl configured
|
||||
- Helm 3.x installed
|
||||
- PostgreSQL database (for Keycloak persistence)
|
||||
- Domain name configured (e.g., `keycloak.sankofa.nexus`)
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### 1. Deploy Keycloak via Helm
|
||||
|
||||
```bash
|
||||
# Add Keycloak Helm repository
|
||||
helm repo add bitnami https://charts.bitnami.com/bitnami
|
||||
helm repo update
|
||||
|
||||
# Create namespace
|
||||
kubectl create namespace keycloak
|
||||
|
||||
# Deploy Keycloak
|
||||
helm install keycloak bitnami/keycloak \
|
||||
--namespace keycloak \
|
||||
--set auth.adminUser=admin \
|
||||
--set auth.adminPassword=$(openssl rand -base64 32) \
|
||||
--set postgresql.enabled=true \
|
||||
--set postgresql.auth.postgresPassword=$(openssl rand -base64 32) \
|
||||
--set ingress.enabled=true \
|
||||
--set ingress.hostname=keycloak.sankofa.nexus \
|
||||
--set ingress.tls=true \
|
||||
--set ingress.certManager=true \
|
||||
--set service.type=ClusterIP \
|
||||
--set service.port=8080
|
||||
```
|
||||
|
||||
### 2. Configure Keycloak Clients
|
||||
|
||||
Apply the client configuration:
|
||||
|
||||
```bash
|
||||
kubectl apply -f gitops/apps/keycloak/keycloak-clients.yaml
|
||||
```
|
||||
|
||||
Or configure manually via Keycloak Admin Console:
|
||||
|
||||
#### Portal Client
|
||||
- **Client ID**: `portal-client`
|
||||
- **Client Protocol**: `openid-connect`
|
||||
- **Access Type**: `confidential`
|
||||
- **Valid Redirect URIs**:
|
||||
- `https://portal.sankofa.nexus/*`
|
||||
- `http://localhost:3000/*` (for development)
|
||||
- **Web Origins**: `+`
|
||||
- **Standard Flow Enabled**: Yes
|
||||
- **Direct Access Grants Enabled**: Yes
|
||||
|
||||
#### API Client
|
||||
- **Client ID**: `api-client`
|
||||
- **Client Protocol**: `openid-connect`
|
||||
- **Access Type**: `confidential`
|
||||
- **Service Accounts Enabled**: Yes
|
||||
- **Standard Flow Enabled**: Yes
|
||||
|
||||
### 3. Configure Multi-Realm Support
|
||||
|
||||
For multi-tenant support, create realms per tenant:
|
||||
|
||||
```bash
|
||||
# Create realm for tenant
|
||||
kubectl exec -it -n keycloak deployment/keycloak -- \
|
||||
/opt/bitnami/keycloak/bin/kcadm.sh create realms \
|
||||
-s realm=tenant-1 \
|
||||
-s enabled=true \
|
||||
--no-config \
|
||||
--server http://localhost:8080 \
|
||||
--realm master \
|
||||
--user admin \
|
||||
--password $(kubectl get secret keycloak-admin -n keycloak -o jsonpath='{.data.password}' | base64 -d)
|
||||
```
|
||||
|
||||
### 4. Configure Identity Providers
|
||||
|
||||
#### LDAP/Active Directory
|
||||
1. Navigate to Identity Providers in Keycloak Admin Console
|
||||
2. Add LDAP provider
|
||||
3. Configure connection settings:
|
||||
- **Vendor**: Active Directory (or other)
|
||||
- **Connection URL**: `ldap://your-ldap-server:389`
|
||||
- **Users DN**: `ou=Users,dc=example,dc=com`
|
||||
- **Bind DN**: `cn=admin,dc=example,dc=com`
|
||||
- **Bind Credential**: (stored in secret)
|
||||
|
||||
#### SAML Providers
|
||||
1. Add SAML 2.0 provider
|
||||
2. Configure:
|
||||
- **Entity ID**: Your SAML entity ID
|
||||
- **SSO URL**: Your SAML SSO endpoint
|
||||
- **Signing Certificate**: Your SAML signing certificate
|
||||
|
||||
### 5. Enable Blockchain Identity Verification
|
||||
|
||||
For blockchain-based identity verification:
|
||||
|
||||
1. Install Keycloak Identity Provider plugin (if available)
|
||||
2. Configure blockchain connection:
|
||||
- **Blockchain RPC URL**: `https://besu.sankofa.nexus:8545`
|
||||
- **Contract Address**: (deployed identity contract)
|
||||
- **Private Key**: (stored in Kubernetes Secret)
|
||||
|
||||
### 6. Configure Environment Variables
|
||||
|
||||
Update API service environment variables:
|
||||
|
||||
```yaml
|
||||
env:
|
||||
- name: KEYCLOAK_URL
|
||||
value: "https://keycloak.sankofa.nexus"
|
||||
- name: KEYCLOAK_REALM
|
||||
value: "master" # or tenant-specific realm
|
||||
- name: KEYCLOAK_CLIENT_ID
|
||||
value: "api-client"
|
||||
- name: KEYCLOAK_CLIENT_SECRET
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: keycloak-client-secret
|
||||
key: api-client-secret
|
||||
```
|
||||
|
||||
### 7. Set Up Secrets
|
||||
|
||||
Create Kubernetes secrets for client credentials:
|
||||
|
||||
```bash
|
||||
# Create secret for API client
|
||||
kubectl create secret generic keycloak-client-secret \
|
||||
--from-literal=api-client-secret=$(openssl rand -base64 32) \
|
||||
--namespace keycloak
|
||||
|
||||
# Create secret for portal client
|
||||
kubectl create secret generic keycloak-portal-secret \
|
||||
--from-literal=portal-client-secret=$(openssl rand -base64 32) \
|
||||
--namespace keycloak
|
||||
```
|
||||
|
||||
### 8. Configure Cloudflare Access
|
||||
|
||||
If using Cloudflare Zero Trust:
|
||||
|
||||
1. Configure Cloudflare Access application for Keycloak
|
||||
2. Set domain: `keycloak.sankofa.nexus`
|
||||
3. Configure access policies (see `cloudflare/access-policies.yaml`)
|
||||
4. Require MFA for admin access
|
||||
|
||||
### 9. Verify Deployment
|
||||
|
||||
```bash
|
||||
# Check Keycloak pods
|
||||
kubectl get pods -n keycloak
|
||||
|
||||
# Check Keycloak service
|
||||
kubectl get svc -n keycloak
|
||||
|
||||
# Test Keycloak health
|
||||
curl https://keycloak.sankofa.nexus/health
|
||||
|
||||
# Access Admin Console
|
||||
# https://keycloak.sankofa.nexus/admin
|
||||
```
|
||||
|
||||
### 10. Post-Deployment Configuration
|
||||
|
||||
1. **Change Admin Password**: Change default admin password immediately
|
||||
2. **Configure Email**: Set up SMTP for password reset emails
|
||||
3. **Enable MFA**: Configure TOTP and backup codes
|
||||
4. **Set Up Themes**: Customize Keycloak themes for branding
|
||||
5. **Configure Events**: Set up event listeners for audit logging
|
||||
6. **Backup Configuration**: Export realm configuration regularly
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Keycloak Not Starting
|
||||
- Check PostgreSQL connection
|
||||
- Verify resource limits
|
||||
- Check logs: `kubectl logs -n keycloak deployment/keycloak`
|
||||
|
||||
### Client Authentication Failing
|
||||
- Verify client secret matches
|
||||
- Check redirect URIs are correct
|
||||
- Verify realm name matches
|
||||
|
||||
### Multi-Realm Issues
|
||||
- Ensure realm names match tenant IDs
|
||||
- Verify realm is enabled
|
||||
- Check realm configuration
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
1. **Use Strong Passwords**: Generate strong passwords for all accounts
|
||||
2. **Enable MFA**: Require MFA for admin and privileged users
|
||||
3. **Rotate Secrets**: Regularly rotate client secrets
|
||||
4. **Monitor Access**: Enable audit logging
|
||||
5. **Use HTTPS**: Always use TLS for Keycloak
|
||||
6. **Limit Admin Access**: Restrict admin console access via Cloudflare Access
|
||||
7. **Backup Regularly**: Export and backup realm configurations
|
||||
|
||||
## References
|
||||
|
||||
- [Keycloak Documentation](https://www.keycloak.org/documentation)
|
||||
- [Keycloak Helm Chart](https://github.com/bitnami/charts/tree/main/bitnami/keycloak)
|
||||
- Client configuration: `gitops/apps/keycloak/keycloak-clients.yaml`
|
||||
|
||||
237
docs/guides/MIGRATION_GUIDE.md
Normal file
237
docs/guides/MIGRATION_GUIDE.md
Normal file
@@ -0,0 +1,237 @@
|
||||
# Migration Guide
|
||||
|
||||
**Last Updated**: 2025-01-09
|
||||
|
||||
## Overview
|
||||
|
||||
This guide provides instructions for migrating between versions of Sankofa Phoenix and migrating from other platforms.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [API Version Migration](#api-version-migration)
|
||||
- [Database Migration](#database-migration)
|
||||
- [Configuration Migration](#configuration-migration)
|
||||
- [Azure Migration](#azure-migration)
|
||||
- [Deployment Migration](#deployment-migration)
|
||||
|
||||
---
|
||||
|
||||
## API Version Migration
|
||||
|
||||
### Migrating Between API Versions
|
||||
|
||||
See [API Versioning Guide](./api/API_VERSIONING.md) for detailed API migration instructions.
|
||||
|
||||
### Quick Steps
|
||||
|
||||
1. Review API changelog for breaking changes
|
||||
2. Update client code to use new API version
|
||||
3. Test all API interactions
|
||||
4. Deploy updated client code
|
||||
5. Monitor for issues
|
||||
|
||||
---
|
||||
|
||||
## Database Migration
|
||||
|
||||
### Schema Migrations
|
||||
|
||||
Database migrations are managed automatically:
|
||||
|
||||
```bash
|
||||
# Run migrations
|
||||
cd api
|
||||
npm run db:migrate
|
||||
|
||||
# Rollback if needed
|
||||
npm run db:rollback
|
||||
```
|
||||
|
||||
### Manual Migration Steps
|
||||
|
||||
1. **Backup Database**: Always backup before migration
|
||||
```bash
|
||||
pg_dump sankofa > backup_$(date +%Y%m%d).sql
|
||||
```
|
||||
|
||||
2. **Run Migrations**: Execute migration scripts
|
||||
```bash
|
||||
npm run db:migrate
|
||||
```
|
||||
|
||||
3. **Verify Migration**: Check migration status
|
||||
```bash
|
||||
npm run db:migrate:status
|
||||
```
|
||||
|
||||
4. **Test Application**: Verify application functionality
|
||||
5. **Monitor**: Watch for errors post-migration
|
||||
|
||||
### Data Migration
|
||||
|
||||
For data migrations:
|
||||
|
||||
1. **Export Data**: Export from source
|
||||
2. **Transform Data**: Apply necessary transformations
|
||||
3. **Import Data**: Import to new schema
|
||||
4. **Validate**: Verify data integrity
|
||||
5. **Update References**: Update any code references
|
||||
|
||||
---
|
||||
|
||||
## Configuration Migration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
When updating configuration:
|
||||
|
||||
1. **Review Changes**: Check configuration changes in release notes
|
||||
2. **Update `.env` Files**: Update environment variables
|
||||
3. **Test Configuration**: Verify configuration is correct
|
||||
4. **Deploy**: Deploy updated configuration
|
||||
|
||||
### Configuration Files
|
||||
|
||||
```bash
|
||||
# Backup current configuration
|
||||
cp .env.local .env.local.backup
|
||||
|
||||
# Update configuration
|
||||
# Edit .env.local with new values
|
||||
|
||||
# Verify configuration
|
||||
npm run config:validate
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Azure Migration
|
||||
|
||||
### From Azure to Sankofa Phoenix
|
||||
|
||||
See [Azure Migration Guide](./tenants/AZURE_MIGRATION.md) for comprehensive Azure migration instructions.
|
||||
|
||||
### Key Migration Areas
|
||||
|
||||
1. **Identity**: Migrate from Azure AD to Keycloak
|
||||
2. **Resources**: Migrate VMs and resources
|
||||
3. **Networking**: Update network configurations
|
||||
4. **Storage**: Migrate data and storage
|
||||
5. **Applications**: Update application configurations
|
||||
|
||||
---
|
||||
|
||||
## Deployment Migration
|
||||
|
||||
### Upgrading Deployment
|
||||
|
||||
1. **Review Release Notes**: Check for breaking changes
|
||||
2. **Update Dependencies**: Update package versions
|
||||
3. **Run Tests**: Ensure all tests pass
|
||||
4. **Deploy**: Follow deployment procedures
|
||||
5. **Verify**: Confirm deployment success
|
||||
|
||||
### Rolling Back Deployment
|
||||
|
||||
1. **Identify Issue**: Determine what needs rollback
|
||||
2. **Stop Services**: Stop affected services
|
||||
3. **Restore Previous Version**: Deploy previous version
|
||||
4. **Restore Database** (if needed): Restore database backup
|
||||
5. **Verify**: Confirm rollback success
|
||||
|
||||
---
|
||||
|
||||
## Common Migration Scenarios
|
||||
|
||||
### Scenario 1: Minor Version Update
|
||||
|
||||
**Steps:**
|
||||
1. Review changelog
|
||||
2. Update dependencies
|
||||
3. Run tests
|
||||
4. Deploy
|
||||
5. Verify
|
||||
|
||||
### Scenario 2: Major Version Update
|
||||
|
||||
**Steps:**
|
||||
1. Review migration guide for major version
|
||||
2. Backup all data
|
||||
3. Update configuration
|
||||
4. Run database migrations
|
||||
5. Update code for breaking changes
|
||||
6. Test thoroughly
|
||||
7. Deploy in staging first
|
||||
8. Deploy to production
|
||||
9. Monitor closely
|
||||
|
||||
### Scenario 3: Platform Migration
|
||||
|
||||
**Steps:**
|
||||
1. Plan migration timeline
|
||||
2. Set up new platform
|
||||
3. Migrate data
|
||||
4. Migrate applications
|
||||
5. Update DNS/configurations
|
||||
6. Test thoroughly
|
||||
7. Cutover
|
||||
8. Monitor and verify
|
||||
|
||||
---
|
||||
|
||||
## Migration Checklist
|
||||
|
||||
### Pre-Migration
|
||||
|
||||
- [ ] Review migration documentation
|
||||
- [ ] Backup all data
|
||||
- [ ] Test migration in staging
|
||||
- [ ] Notify stakeholders
|
||||
- [ ] Schedule migration window
|
||||
|
||||
### During Migration
|
||||
|
||||
- [ ] Execute migration steps
|
||||
- [ ] Monitor progress
|
||||
- [ ] Verify each step
|
||||
- [ ] Document any issues
|
||||
|
||||
### Post-Migration
|
||||
|
||||
- [ ] Verify all functionality
|
||||
- [ ] Test critical paths
|
||||
- [ ] Monitor for errors
|
||||
- [ ] Update documentation
|
||||
- [ ] Communicate completion
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Migration Fails**: Check logs, rollback if needed
|
||||
2. **Data Loss**: Restore from backup
|
||||
3. **Configuration Errors**: Verify environment variables
|
||||
4. **Service Downtime**: Check service status and logs
|
||||
|
||||
### Getting Help
|
||||
|
||||
- Check [Troubleshooting Guide](./TROUBLESHOOTING_GUIDE.md)
|
||||
- Review migration documentation
|
||||
- Check logs for specific errors
|
||||
- Contact support if needed
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [API Versioning Guide](./api/API_VERSIONING.md)
|
||||
- [Deployment Guide](./DEPLOYMENT.md)
|
||||
- [Troubleshooting Guide](./TROUBLESHOOTING_GUIDE.md)
|
||||
- [Azure Migration Guide](./tenants/AZURE_MIGRATION.md)
|
||||
|
||||
---
|
||||
|
||||
**Note**: Always backup data before performing migrations. Test migrations in a staging environment first.
|
||||
|
||||
339
docs/guides/MONITORING_GUIDE.md
Normal file
339
docs/guides/MONITORING_GUIDE.md
Normal file
@@ -0,0 +1,339 @@
|
||||
# Monitoring and Observability Guide
|
||||
|
||||
**Last Updated**: 2025-01-09
|
||||
|
||||
This guide covers monitoring setup, Grafana dashboards, and observability for Sankofa Phoenix.
|
||||
|
||||
## Overview
|
||||
|
||||
Sankofa Phoenix uses a comprehensive monitoring stack:
|
||||
- **Prometheus**: Metrics collection and storage
|
||||
- **Grafana**: Visualization and dashboards
|
||||
- **Loki**: Log aggregation
|
||||
- **Alertmanager**: Alert routing and notification
|
||||
|
||||
## Tenant-Aware Metrics
|
||||
|
||||
All metrics are tagged with tenant IDs for multi-tenant isolation.
|
||||
|
||||
### Metric Naming Convention
|
||||
|
||||
```
|
||||
sankofa_<component>_<metric>_<unit>{tenant_id="<id>",...}
|
||||
```
|
||||
|
||||
Examples:
|
||||
- `sankofa_api_requests_total{tenant_id="tenant-1",method="POST",status="200"}`
|
||||
- `sankofa_billing_cost_usd{tenant_id="tenant-1",service="compute"}`
|
||||
- `sankofa_proxmox_vm_cpu_usage_percent{tenant_id="tenant-1",vm_id="101"}`
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
### 1. System Overview Dashboard
|
||||
|
||||
**Location**: `grafana/dashboards/system-overview.json`
|
||||
|
||||
**Metrics**:
|
||||
- API request rate and latency
|
||||
- Database connection pool usage
|
||||
- Keycloak authentication rate
|
||||
- System resource usage (CPU, memory, disk)
|
||||
|
||||
**Panels**:
|
||||
- Request rate (requests/sec)
|
||||
- P95 latency (ms)
|
||||
- Error rate (%)
|
||||
- Active connections
|
||||
- Authentication success rate
|
||||
|
||||
### 2. Tenant Dashboard
|
||||
|
||||
**Location**: `grafana/dashboards/tenant-overview.json`
|
||||
|
||||
**Metrics**:
|
||||
- Tenant resource usage
|
||||
- Tenant cost tracking
|
||||
- Tenant API usage
|
||||
- Tenant user activity
|
||||
|
||||
**Panels**:
|
||||
- Resource usage by tenant
|
||||
- Cost breakdown by tenant
|
||||
- API calls by tenant
|
||||
- Active users by tenant
|
||||
|
||||
### 3. Billing Dashboard
|
||||
|
||||
**Location**: `grafana/dashboards/billing.json`
|
||||
|
||||
**Metrics**:
|
||||
- Real-time cost tracking
|
||||
- Cost by service/resource
|
||||
- Budget vs actual spend
|
||||
- Cost forecast
|
||||
- Billing anomalies
|
||||
|
||||
**Panels**:
|
||||
- Current month cost
|
||||
- Cost trend (7d, 30d)
|
||||
- Top resources by cost
|
||||
- Budget utilization
|
||||
- Anomaly detection alerts
|
||||
|
||||
### 4. Proxmox Infrastructure Dashboard
|
||||
|
||||
**Location**: `grafana/dashboards/proxmox-infrastructure.json`
|
||||
|
||||
**Metrics**:
|
||||
- VM status and health
|
||||
- Node resource usage
|
||||
- Storage utilization
|
||||
- Network throughput
|
||||
- VM creation/deletion rate
|
||||
|
||||
**Panels**:
|
||||
- VM status overview
|
||||
- Node CPU/memory usage
|
||||
- Storage pool usage
|
||||
- Network I/O
|
||||
- VM lifecycle events
|
||||
|
||||
### 5. Security Dashboard
|
||||
|
||||
**Location**: `grafana/dashboards/security.json`
|
||||
|
||||
**Metrics**:
|
||||
- Authentication events
|
||||
- Failed login attempts
|
||||
- Policy violations
|
||||
- Incident response metrics
|
||||
- Audit log events
|
||||
|
||||
**Panels**:
|
||||
- Authentication success/failure rate
|
||||
- Policy violations by severity
|
||||
- Incident response time
|
||||
- Audit log volume
|
||||
- Security events timeline
|
||||
|
||||
## Prometheus Configuration
|
||||
|
||||
### Scrape Configs
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'sankofa-api'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
namespaces:
|
||||
names:
|
||||
- api
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_pod_label_app]
|
||||
action: keep
|
||||
regex: api
|
||||
metric_relabel_configs:
|
||||
- source_labels: [tenant_id]
|
||||
target_label: tenant_id
|
||||
regex: '(.+)'
|
||||
replacement: '${1}'
|
||||
|
||||
- job_name: 'proxmox'
|
||||
static_configs:
|
||||
- targets:
|
||||
- proxmox-exporter:9091
|
||||
relabel_configs:
|
||||
- source_labels: [__address__]
|
||||
target_label: instance
|
||||
```
|
||||
|
||||
### Recording Rules
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: sankofa_rules
|
||||
interval: 30s
|
||||
rules:
|
||||
- record: sankofa:api:requests:rate5m
|
||||
expr: rate(sankofa_api_requests_total[5m])
|
||||
|
||||
- record: sankofa:billing:cost:rate1h
|
||||
expr: rate(sankofa_billing_cost_usd[1h])
|
||||
|
||||
- record: sankofa:proxmox:vm:count
|
||||
expr: count(sankofa_proxmox_vm_info) by (tenant_id)
|
||||
```
|
||||
|
||||
## Alerting Rules
|
||||
|
||||
### Critical Alerts
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: sankofa_critical
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: HighErrorRate
|
||||
expr: rate(sankofa_api_requests_total{status=~"5.."}[5m]) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "High error rate detected"
|
||||
description: "Error rate is {{ $value }} errors/sec"
|
||||
|
||||
- alert: DatabaseConnectionPoolExhausted
|
||||
expr: sankofa_db_connections_active / sankofa_db_connections_max > 0.9
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Database connection pool nearly exhausted"
|
||||
|
||||
- alert: BudgetExceeded
|
||||
expr: sankofa_billing_cost_usd / sankofa_billing_budget_usd > 1.0
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Budget exceeded for tenant {{ $labels.tenant_id }}"
|
||||
|
||||
- alert: ProxmoxNodeDown
|
||||
expr: up{job="proxmox"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Proxmox node {{ $labels.instance }} is down"
|
||||
```
|
||||
|
||||
### Billing Anomaly Detection
|
||||
|
||||
```yaml
|
||||
- name: sankofa_billing_anomalies
|
||||
interval: 1h
|
||||
rules:
|
||||
- alert: CostAnomalyDetected
|
||||
expr: |
|
||||
(
|
||||
sankofa_billing_cost_usd
|
||||
- predict_linear(sankofa_billing_cost_usd[7d], 3600)
|
||||
) / predict_linear(sankofa_billing_cost_usd[7d], 3600) > 0.5
|
||||
for: 2h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Unusual cost increase detected for tenant {{ $labels.tenant_id }}"
|
||||
```
|
||||
|
||||
## Real-Time Cost Tracking
|
||||
|
||||
### Metrics Exposed
|
||||
|
||||
- `sankofa_billing_cost_usd{tenant_id, service, resource_id}` - Current cost
|
||||
- `sankofa_billing_cost_rate_usd_per_hour{tenant_id}` - Cost rate
|
||||
- `sankofa_billing_budget_usd{tenant_id}` - Budget limit
|
||||
- `sankofa_billing_budget_utilization_percent{tenant_id}` - Budget usage %
|
||||
|
||||
### Grafana Query Example
|
||||
|
||||
```promql
|
||||
# Current month cost by tenant
|
||||
sum(sankofa_billing_cost_usd) by (tenant_id)
|
||||
|
||||
# Cost trend (7 days)
|
||||
rate(sankofa_billing_cost_usd[1h]) * 24 * 7
|
||||
|
||||
# Budget utilization
|
||||
sankofa_billing_cost_usd / sankofa_billing_budget_usd * 100
|
||||
```
|
||||
|
||||
## Log Aggregation
|
||||
|
||||
### Loki Configuration
|
||||
|
||||
Logs are collected with tenant context:
|
||||
|
||||
```yaml
|
||||
clients:
|
||||
- url: http://loki:3100/loki/api/v1/push
|
||||
tenant_id: ${TENANT_ID}
|
||||
```
|
||||
|
||||
### Log Labels
|
||||
|
||||
- `tenant_id`: Tenant identifier
|
||||
- `service`: Service name (api, portal, etc.)
|
||||
- `level`: Log level (info, warn, error)
|
||||
- `component`: Component name
|
||||
|
||||
### Log Queries
|
||||
|
||||
```logql
|
||||
# Errors for a specific tenant
|
||||
{tenant_id="tenant-1", level="error"}
|
||||
|
||||
# API errors in last hour
|
||||
{service="api", level="error"} | json | timestamp > now() - 1h
|
||||
|
||||
# Authentication failures
|
||||
{component="auth"} | json | status="failed"
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Install Monitoring Stack
|
||||
|
||||
```bash
|
||||
# Add Prometheus Operator Helm repo
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm repo update
|
||||
|
||||
# Install kube-prometheus-stack
|
||||
helm install monitoring prometheus-community/kube-prometheus-stack \
|
||||
--namespace monitoring \
|
||||
--create-namespace \
|
||||
--values grafana/values.yaml
|
||||
|
||||
# Apply custom dashboards
|
||||
kubectl apply -f grafana/dashboards/
|
||||
```
|
||||
|
||||
### Import Dashboards
|
||||
|
||||
```bash
|
||||
# Import all dashboards
|
||||
for dashboard in grafana/dashboards/*.json; do
|
||||
kubectl create configmap $(basename $dashboard .json) \
|
||||
--from-file=$dashboard \
|
||||
--namespace=monitoring \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
done
|
||||
```
|
||||
|
||||
## Access
|
||||
|
||||
- **Grafana**: https://grafana.sankofa.nexus
|
||||
- **Prometheus**: https://prometheus.sankofa.nexus
|
||||
- **Alertmanager**: https://alertmanager.sankofa.nexus
|
||||
|
||||
Default credentials (change immediately):
|
||||
- Username: `admin`
|
||||
- Password: (from secret `monitoring-grafana`)
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Tenant Isolation**: Always filter metrics by tenant_id
|
||||
2. **Retention**: Configure appropriate retention periods
|
||||
3. **Cardinality**: Avoid high-cardinality labels
|
||||
4. **Alerts**: Set up alerting for critical metrics
|
||||
5. **Dashboards**: Create tenant-specific dashboards
|
||||
6. **Cost Tracking**: Monitor billing metrics closely
|
||||
7. **Anomaly Detection**: Enable anomaly detection for billing
|
||||
|
||||
## References
|
||||
|
||||
- Dashboard definitions: `grafana/dashboards/`
|
||||
- Prometheus config: `monitoring/prometheus/`
|
||||
- Alert rules: `monitoring/alerts/`
|
||||
|
||||
428
docs/guides/OPERATIONS_RUNBOOK.md
Normal file
428
docs/guides/OPERATIONS_RUNBOOK.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# Operations Runbook
|
||||
|
||||
**Last Updated**: 2025-01-09
|
||||
|
||||
This runbook provides operational procedures for Sankofa Phoenix.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Daily Operations](#daily-operations)
|
||||
2. [Tenant Management](#tenant-management)
|
||||
3. [Backup Procedures](#backup-procedures)
|
||||
4. [Incident Response](#incident-response)
|
||||
5. [Maintenance Windows](#maintenance-windows)
|
||||
6. [Troubleshooting](#troubleshooting)
|
||||
|
||||
## Daily Operations
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# Check all pods
|
||||
kubectl get pods --all-namespaces
|
||||
|
||||
# Check API health
|
||||
curl https://api.sankofa.nexus/health
|
||||
|
||||
# Check Keycloak health
|
||||
curl https://keycloak.sankofa.nexus/health
|
||||
|
||||
# Check database connections
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT 1"
|
||||
```
|
||||
|
||||
### Monitoring Dashboard Review
|
||||
|
||||
1. Review system overview dashboard
|
||||
2. Check error rates and latency
|
||||
3. Review billing anomalies
|
||||
4. Check security events
|
||||
5. Review Proxmox infrastructure status
|
||||
|
||||
### Log Review
|
||||
|
||||
```bash
|
||||
# Recent errors
|
||||
kubectl logs -n api deployment/api --tail=100 | grep -i error
|
||||
|
||||
# Authentication failures
|
||||
kubectl logs -n api deployment/api | grep -i "auth.*fail"
|
||||
|
||||
# Billing issues
|
||||
kubectl logs -n api deployment/api | grep -i billing
|
||||
```
|
||||
|
||||
## Tenant Management
|
||||
|
||||
### Create New Tenant
|
||||
|
||||
```bash
|
||||
# Via GraphQL
|
||||
mutation {
|
||||
createTenant(input: {
|
||||
name: "New Tenant"
|
||||
domain: "tenant.example.com"
|
||||
tier: STANDARD
|
||||
}) {
|
||||
id
|
||||
name
|
||||
status
|
||||
}
|
||||
}
|
||||
|
||||
# Or via API
|
||||
curl -X POST https://api.sankofa.nexus/graphql \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-d '{"query": "mutation { createTenant(...) }"}'
|
||||
```
|
||||
|
||||
### Suspend Tenant
|
||||
|
||||
```bash
|
||||
# Update tenant status
|
||||
mutation {
|
||||
updateTenant(id: "tenant-id", input: { status: SUSPENDED }) {
|
||||
id
|
||||
status
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Delete Tenant
|
||||
|
||||
```bash
|
||||
# Soft delete (recommended)
|
||||
mutation {
|
||||
updateTenant(id: "tenant-id", input: { status: DELETED }) {
|
||||
id
|
||||
status
|
||||
}
|
||||
}
|
||||
|
||||
# Hard delete (requires confirmation)
|
||||
# This will delete all tenant resources
|
||||
```
|
||||
|
||||
### Tenant Resource Quotas
|
||||
|
||||
```bash
|
||||
# Check quota usage
|
||||
query {
|
||||
tenant(id: "tenant-id") {
|
||||
quotaLimits {
|
||||
compute { vcpu memory instances }
|
||||
storage { total perInstance }
|
||||
}
|
||||
usage {
|
||||
totalCost
|
||||
byResource {
|
||||
resourceId
|
||||
cost
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Database Backups
|
||||
|
||||
#### Automated Backups
|
||||
|
||||
Backups run daily at 2 AM UTC:
|
||||
|
||||
```bash
|
||||
# Check backup job status
|
||||
kubectl get cronjob -n api postgres-backup
|
||||
|
||||
# View recent backups
|
||||
kubectl get pvc -n api | grep backup
|
||||
```
|
||||
|
||||
#### Manual Backup
|
||||
|
||||
```bash
|
||||
# Create backup
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
pg_dump -U sankofa sankofa > backup-$(date +%Y%m%d).sql
|
||||
|
||||
# Restore from backup
|
||||
kubectl exec -i -n api deployment/postgres -- \
|
||||
psql -U sankofa sankofa < backup-20240101.sql
|
||||
```
|
||||
|
||||
### Keycloak Backups
|
||||
|
||||
```bash
|
||||
# Export realm configuration
|
||||
kubectl exec -it -n keycloak deployment/keycloak -- \
|
||||
/opt/keycloak/bin/kcadm.sh get realms/master \
|
||||
--realm master \
|
||||
--server http://localhost:8080 \
|
||||
--user admin \
|
||||
--password $ADMIN_PASSWORD > keycloak-realm-$(date +%Y%m%d).json
|
||||
```
|
||||
|
||||
### Proxmox Backups
|
||||
|
||||
```bash
|
||||
# Backup VM configuration
|
||||
# Via Proxmox API or UI
|
||||
# Store in version control or backup storage
|
||||
```
|
||||
|
||||
### Tenant-Specific Backups
|
||||
|
||||
```bash
|
||||
# Export tenant data
|
||||
query {
|
||||
tenant(id: "tenant-id") {
|
||||
id
|
||||
name
|
||||
resources {
|
||||
id
|
||||
name
|
||||
type
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Backup tenant resources
|
||||
# Use resource export API or database dump filtered by tenant_id
|
||||
```
|
||||
|
||||
## Incident Response
|
||||
|
||||
### Incident Classification
|
||||
|
||||
- **P0 - Critical**: System down, data loss, security breach
|
||||
- **P1 - High**: Major feature broken, performance degradation
|
||||
- **P2 - Medium**: Minor feature broken, non-critical issues
|
||||
- **P3 - Low**: Cosmetic issues, minor bugs
|
||||
|
||||
### Incident Response Process
|
||||
|
||||
1. **Detection**: Monitor alerts, user reports
|
||||
2. **Triage**: Classify severity, assign owner
|
||||
3. **Containment**: Isolate affected systems
|
||||
4. **Investigation**: Root cause analysis
|
||||
5. **Resolution**: Fix and verify
|
||||
6. **Post-Mortem**: Document and improve
|
||||
|
||||
### Common Incidents
|
||||
|
||||
#### API Down
|
||||
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods -n api
|
||||
|
||||
# Check logs
|
||||
kubectl logs -n api deployment/api --tail=100
|
||||
|
||||
# Restart if needed
|
||||
kubectl rollout restart deployment/api -n api
|
||||
|
||||
# Check database
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT 1"
|
||||
```
|
||||
|
||||
#### Database Connection Issues
|
||||
|
||||
```bash
|
||||
# Check connection pool
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
curl http://localhost:4000/metrics | grep db_connections
|
||||
|
||||
# Restart API to reset connections
|
||||
kubectl rollout restart deployment/api -n api
|
||||
|
||||
# Check database load
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT * FROM pg_stat_activity"
|
||||
```
|
||||
|
||||
#### High Error Rate
|
||||
|
||||
```bash
|
||||
# Check error logs
|
||||
kubectl logs -n api deployment/api | grep -i error | tail -50
|
||||
|
||||
# Check recent deployments
|
||||
kubectl rollout history deployment/api -n api
|
||||
|
||||
# Rollback if needed
|
||||
kubectl rollout undo deployment/api -n api
|
||||
```
|
||||
|
||||
#### Billing Anomaly
|
||||
|
||||
```bash
|
||||
# Check billing metrics
|
||||
curl https://prometheus.sankofa.nexus/api/v1/query?query=sankofa_billing_cost_usd
|
||||
|
||||
# Review recent usage records
|
||||
query {
|
||||
usage(tenantId: "tenant-id", timeRange: {...}) {
|
||||
totalCost
|
||||
byResource {
|
||||
resourceId
|
||||
cost
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Check for resource leaks
|
||||
kubectl get resources --all-namespaces | grep tenant-id
|
||||
```
|
||||
|
||||
## Maintenance Windows
|
||||
|
||||
### Scheduled Maintenance
|
||||
|
||||
Maintenance windows are scheduled:
|
||||
- **Weekly**: Sunday 2-4 AM UTC (low traffic)
|
||||
- **Monthly**: First Sunday 2-6 AM UTC (major updates)
|
||||
|
||||
### Pre-Maintenance Checklist
|
||||
|
||||
- [ ] Notify all tenants (24h advance)
|
||||
- [ ] Create backup of database
|
||||
- [ ] Create backup of Keycloak
|
||||
- [ ] Review recent changes
|
||||
- [ ] Prepare rollback plan
|
||||
- [ ] Set maintenance mode flag
|
||||
|
||||
### Maintenance Mode
|
||||
|
||||
```bash
|
||||
# Enable maintenance mode
|
||||
kubectl set env deployment/api -n api MAINTENANCE_MODE=true
|
||||
|
||||
# Disable maintenance mode
|
||||
kubectl set env deployment/api -n api MAINTENANCE_MODE=false
|
||||
```
|
||||
|
||||
### Post-Maintenance Checklist
|
||||
|
||||
- [ ] Verify all services are up
|
||||
- [ ] Run health checks
|
||||
- [ ] Check error rates
|
||||
- [ ] Verify backups completed
|
||||
- [ ] Notify tenants of completion
|
||||
- [ ] Update documentation
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### API Not Responding
|
||||
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl describe pod -n api -l app=api
|
||||
|
||||
# Check logs
|
||||
kubectl logs -n api -l app=api --tail=100
|
||||
|
||||
# Check resource limits
|
||||
kubectl top pod -n api
|
||||
|
||||
# Check network policies
|
||||
kubectl get networkpolicies -n api
|
||||
```
|
||||
|
||||
### Database Performance Issues
|
||||
|
||||
```bash
|
||||
# Check slow queries
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10"
|
||||
|
||||
# Check table sizes
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10"
|
||||
|
||||
# Analyze tables
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "ANALYZE"
|
||||
```
|
||||
|
||||
### Keycloak Issues
|
||||
|
||||
```bash
|
||||
# Check Keycloak logs
|
||||
kubectl logs -n keycloak deployment/keycloak --tail=100
|
||||
|
||||
# Check database connection
|
||||
kubectl exec -it -n keycloak deployment/keycloak -- \
|
||||
curl http://localhost:8080/health/ready
|
||||
|
||||
# Restart Keycloak
|
||||
kubectl rollout restart deployment/keycloak -n keycloak
|
||||
```
|
||||
|
||||
### Proxmox Integration Issues
|
||||
|
||||
```bash
|
||||
# Check Crossplane provider
|
||||
kubectl get pods -n crossplane-system | grep proxmox
|
||||
|
||||
# Check provider logs
|
||||
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox
|
||||
|
||||
# Test Proxmox connection
|
||||
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
|
||||
curl https://proxmox-endpoint:8006/api2/json/version
|
||||
```
|
||||
|
||||
## Security Audit
|
||||
|
||||
### Monthly Security Review
|
||||
|
||||
1. Review access logs
|
||||
2. Check for failed authentication attempts
|
||||
3. Review policy violations
|
||||
4. Check for unusual API usage
|
||||
5. Review incident response logs
|
||||
6. Update security documentation
|
||||
|
||||
### Access Review
|
||||
|
||||
```bash
|
||||
# List all users
|
||||
query {
|
||||
users {
|
||||
id
|
||||
email
|
||||
role
|
||||
lastLogin
|
||||
}
|
||||
}
|
||||
|
||||
# Review tenant access
|
||||
query {
|
||||
tenant(id: "tenant-id") {
|
||||
users {
|
||||
id
|
||||
email
|
||||
role
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Emergency Contacts
|
||||
|
||||
- **On-Call Engineer**: (configure in PagerDuty/Opsgenie)
|
||||
- **Database Admin**: (configure)
|
||||
- **Security Team**: (configure)
|
||||
- **Management**: (configure)
|
||||
|
||||
## References
|
||||
|
||||
- Monitoring Guide: `docs/MONITORING_GUIDE.md`
|
||||
- Deployment Guide: `docs/DEPLOYMENT_GUIDE.md`
|
||||
- Keycloak Guide: `docs/KEYCLOAK_DEPLOYMENT.md`
|
||||
|
||||
136
docs/guides/PNPM_MIGRATION_GUIDE.md
Normal file
136
docs/guides/PNPM_MIGRATION_GUIDE.md
Normal file
@@ -0,0 +1,136 @@
|
||||
# pnpm Migration Guide
|
||||
|
||||
This guide explains the package management setup for the Sankofa Phoenix project.
|
||||
|
||||
## Current Status
|
||||
|
||||
The project supports both **pnpm** (recommended) and **npm** (fallback) for package management.
|
||||
|
||||
- **Root**: Uses `pnpm` with `pnpm-lock.yaml`
|
||||
- **API**: Supports both `pnpm` and `npm` (via `.npmrc` configuration)
|
||||
- **Portal**: Supports both `pnpm` and `npm` (via `.npmrc` configuration)
|
||||
|
||||
## Why pnpm?
|
||||
|
||||
pnpm offers several advantages:
|
||||
|
||||
1. **Disk Space Efficiency**: Shared dependency store across projects
|
||||
2. **Speed**: Faster installation due to content-addressable storage
|
||||
3. **Strict Dependency Resolution**: Prevents phantom dependencies
|
||||
4. **Better Monorepo Support**: Excellent for managing multiple packages
|
||||
|
||||
## Installation
|
||||
|
||||
### Using pnpm (Recommended)
|
||||
|
||||
```bash
|
||||
# Install pnpm globally
|
||||
npm install -g pnpm
|
||||
|
||||
# Or using corepack (Node.js 16.13+)
|
||||
corepack enable
|
||||
corepack prepare pnpm@latest --activate
|
||||
|
||||
# Install dependencies
|
||||
pnpm install
|
||||
|
||||
# In API directory
|
||||
cd api
|
||||
pnpm install
|
||||
|
||||
# In Portal directory
|
||||
cd portal
|
||||
pnpm install
|
||||
```
|
||||
|
||||
### Using npm (Fallback)
|
||||
|
||||
```bash
|
||||
# Install dependencies with npm
|
||||
npm install
|
||||
|
||||
# In API directory
|
||||
cd api
|
||||
npm install
|
||||
|
||||
# In Portal directory
|
||||
cd portal
|
||||
npm install
|
||||
```
|
||||
|
||||
## CI/CD
|
||||
|
||||
The CI/CD pipeline (`.github/workflows/ci.yml`) supports both package managers:
|
||||
|
||||
```yaml
|
||||
- name: Install dependencies
|
||||
run: npm install --frozen-lockfile || pnpm install --frozen-lockfile
|
||||
```
|
||||
|
||||
This ensures CI works regardless of which package manager is used locally.
|
||||
|
||||
## Migration Steps (Optional)
|
||||
|
||||
If you want to fully migrate to pnpm:
|
||||
|
||||
1. **Remove package-lock.json files** (if any exist):
|
||||
```bash
|
||||
find . -name "package-lock.json" -not -path "*/node_modules/*" -delete
|
||||
```
|
||||
|
||||
2. **Install with pnpm**:
|
||||
```bash
|
||||
pnpm install
|
||||
```
|
||||
|
||||
3. **Verify installation**:
|
||||
```bash
|
||||
pnpm list
|
||||
```
|
||||
|
||||
4. **Update CI/CD** (optional):
|
||||
- The current CI already supports both, so no changes needed
|
||||
- You can make it pnpm-only if desired
|
||||
|
||||
## Benefits of Current Setup
|
||||
|
||||
The current flexible setup provides:
|
||||
|
||||
- ✅ **Backward Compatibility**: Works with both package managers
|
||||
- ✅ **Team Flexibility**: Team members can use their preferred tool
|
||||
- ✅ **CI Resilience**: CI works with either package manager
|
||||
- ✅ **Gradual Migration**: Can migrate at own pace
|
||||
|
||||
## Recommended Practice
|
||||
|
||||
While both are supported, we recommend:
|
||||
|
||||
- **Local Development**: Use `pnpm` for better performance
|
||||
- **CI/CD**: Current setup (both supported) is fine
|
||||
- **Documentation**: Update to reflect pnpm as primary, npm as fallback
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Module not found errors
|
||||
|
||||
If you encounter module resolution issues:
|
||||
|
||||
1. Delete `node_modules` and lock file
|
||||
2. Reinstall with your chosen package manager:
|
||||
```bash
|
||||
rm -rf node_modules package-lock.json
|
||||
pnpm install # or npm install
|
||||
```
|
||||
|
||||
### Lock file conflicts
|
||||
|
||||
If you see conflicts between `package-lock.json` and `pnpm-lock.yaml`:
|
||||
|
||||
- Use `.gitignore` to exclude `package-lock.json` (already configured)
|
||||
- Team should agree on primary package manager
|
||||
- Document choice in README
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-01-09
|
||||
|
||||
70
docs/guides/QUICK_INSTALL_GUEST_AGENT.md
Normal file
70
docs/guides/QUICK_INSTALL_GUEST_AGENT.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Quick Guide: Install Guest Agent via Proxmox Console
|
||||
|
||||
## Problem
|
||||
VMs are not accessible via SSH from your current network location. Use Proxmox Web UI console instead.
|
||||
|
||||
## Solution: Proxmox Web UI Console
|
||||
|
||||
### Access Proxmox Web UI
|
||||
|
||||
**Site 1:** https://192.168.11.10:8006
|
||||
**Site 2:** https://192.168.11.11:8006
|
||||
|
||||
### For Each VM (14 total):
|
||||
|
||||
1. **Open VM Console:**
|
||||
- Click on the VM in Proxmox Web UI
|
||||
- Click **"Console"** button
|
||||
- Console opens in browser
|
||||
|
||||
2. **Login:**
|
||||
- Username: `admin`
|
||||
- Password: (your VM password)
|
||||
|
||||
3. **Install Guest Agent:**
|
||||
```bash
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y qemu-guest-agent
|
||||
sudo systemctl enable qemu-guest-agent
|
||||
sudo systemctl start qemu-guest-agent
|
||||
sudo systemctl status qemu-guest-agent
|
||||
```
|
||||
|
||||
4. **Verify:**
|
||||
- Should see: `active (running)`
|
||||
|
||||
### After Installing on All VMs
|
||||
|
||||
Run verification:
|
||||
```bash
|
||||
./scripts/verify-guest-agent-complete.sh
|
||||
./scripts/check-all-vm-ips.sh
|
||||
```
|
||||
|
||||
## VM List
|
||||
|
||||
**Site 1 (8 VMs):**
|
||||
- 136: nginx-proxy-vm
|
||||
- 139: smom-management
|
||||
- 141: smom-rpc-node-01
|
||||
- 142: smom-rpc-node-02
|
||||
- 145: smom-sentry-01
|
||||
- 146: smom-sentry-02
|
||||
- 150: smom-validator-01
|
||||
- 151: smom-validator-02
|
||||
|
||||
**Site 2 (6 VMs):**
|
||||
- 101: smom-rpc-node-03
|
||||
- 104: smom-validator-04
|
||||
- 137: cloudflare-tunnel-vm
|
||||
- 138: smom-blockscout
|
||||
- 144: smom-rpc-node-04
|
||||
- 148: smom-sentry-04
|
||||
|
||||
## Expected Result
|
||||
|
||||
Once guest agent is running:
|
||||
- ✅ Proxmox can automatically detect IP addresses
|
||||
- ✅ IP assignment capability fully functional
|
||||
- ✅ All guest agent features available
|
||||
|
||||
15
docs/guides/README.md
Normal file
15
docs/guides/README.md
Normal file
@@ -0,0 +1,15 @@
|
||||
# Guides
|
||||
|
||||
This directory contains step-by-step guides and how-to documentation.
|
||||
|
||||
## Contents
|
||||
|
||||
- **[Build and Deploy Instructions](BUILD_AND_DEPLOY_INSTRUCTIONS.md)** - Instructions for building and deploying the system
|
||||
- **[Force Unlock Instructions](FORCE_UNLOCK_INSTRUCTIONS.md)** - Instructions for force unlocking resources
|
||||
- **[Quick Install Guest Agent](QUICK_INSTALL_GUEST_AGENT.md)** - Quick installation guide for guest agent
|
||||
- **[Enable Guest Agent Manual](enable-guest-agent-manual.md)** - Manual steps for enabling guest agent
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-01-09
|
||||
|
||||
293
docs/guides/TESTING.md
Normal file
293
docs/guides/TESTING.md
Normal file
@@ -0,0 +1,293 @@
|
||||
# Testing Guide
|
||||
|
||||
**Last Updated**: 2025-01-09 for Sankofa Phoenix
|
||||
|
||||
## Overview
|
||||
|
||||
This guide covers testing strategies, test suites, and best practices for the Sankofa Phoenix platform.
|
||||
|
||||
## Test Structure
|
||||
|
||||
```
|
||||
api/
|
||||
src/
|
||||
services/
|
||||
__tests__/
|
||||
*.test.ts # Unit tests for services
|
||||
adapters/
|
||||
__tests__/
|
||||
*.test.ts # Adapter tests
|
||||
schema/
|
||||
__tests__/
|
||||
*.test.ts # GraphQL resolver tests
|
||||
|
||||
src/
|
||||
components/
|
||||
__tests__/
|
||||
*.test.tsx # Component tests
|
||||
lib/
|
||||
__tests__/
|
||||
*.test.ts # Utility tests
|
||||
|
||||
blockchain/
|
||||
tests/
|
||||
*.test.ts # Smart contract tests
|
||||
```
|
||||
|
||||
## Running Tests
|
||||
|
||||
### Frontend Tests
|
||||
|
||||
```bash
|
||||
npm test # Run all frontend tests
|
||||
npm test -- --ui # Run with Vitest UI
|
||||
npm test -- --coverage # Generate coverage report
|
||||
```
|
||||
|
||||
### Backend Tests
|
||||
|
||||
```bash
|
||||
cd api
|
||||
npm test # Run all API tests
|
||||
npm test -- --coverage # Generate coverage report
|
||||
```
|
||||
|
||||
### Blockchain Tests
|
||||
|
||||
```bash
|
||||
cd blockchain
|
||||
npm test # Run smart contract tests
|
||||
```
|
||||
|
||||
### E2E Tests
|
||||
|
||||
```bash
|
||||
npm run test:e2e # Run end-to-end tests
|
||||
```
|
||||
|
||||
## Test Types
|
||||
|
||||
### 1. Unit Tests
|
||||
|
||||
Test individual functions and methods in isolation.
|
||||
|
||||
**Example: Resource Service Test**
|
||||
|
||||
```typescript
|
||||
import { describe, it, expect, vi } from 'vitest'
|
||||
import { getResources } from '../services/resource'
|
||||
|
||||
describe('getResources', () => {
|
||||
it('should return resources', async () => {
|
||||
const mockContext = createMockContext()
|
||||
const result = await getResources(mockContext)
|
||||
expect(result).toBeDefined()
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
### 2. Integration Tests
|
||||
|
||||
Test interactions between multiple components.
|
||||
|
||||
**Example: GraphQL Resolver Test**
|
||||
|
||||
```typescript
|
||||
import { describe, it, expect } from 'vitest'
|
||||
import { createTestSchema } from '../schema'
|
||||
import { graphql } from 'graphql'
|
||||
|
||||
describe('Resource Resolvers', () => {
|
||||
it('should query resources', async () => {
|
||||
const query = `
|
||||
query {
|
||||
resources {
|
||||
id
|
||||
name
|
||||
}
|
||||
}
|
||||
`
|
||||
const result = await graphql(createTestSchema(), query)
|
||||
expect(result.data).toBeDefined()
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
### 3. Component Tests
|
||||
|
||||
Test React components in isolation.
|
||||
|
||||
**Example: ResourceList Component Test**
|
||||
|
||||
```typescript
|
||||
import { render, screen } from '@testing-library/react'
|
||||
import { ResourceList } from '../ResourceList'
|
||||
|
||||
describe('ResourceList', () => {
|
||||
it('should render resources', async () => {
|
||||
render(<ResourceList />)
|
||||
await waitFor(() => {
|
||||
expect(screen.getByText('Test Resource')).toBeInTheDocument()
|
||||
})
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
### 4. E2E Tests
|
||||
|
||||
Test complete user workflows.
|
||||
|
||||
**Example: Resource Provisioning E2E**
|
||||
|
||||
```typescript
|
||||
import { test, expect } from '@playwright/test'
|
||||
|
||||
test('should provision resource', async ({ page }) => {
|
||||
await page.goto('/resources')
|
||||
await page.click('text=Provision Resource')
|
||||
await page.fill('[name="name"]', 'test-resource')
|
||||
await page.selectOption('[name="type"]', 'VM')
|
||||
await page.click('text=Create')
|
||||
|
||||
await expect(page.locator('text=test-resource')).toBeVisible()
|
||||
})
|
||||
```
|
||||
|
||||
## Test Coverage Goals
|
||||
|
||||
- **Unit Tests**: >80% coverage
|
||||
- **Integration Tests**: >60% coverage
|
||||
- **Component Tests**: >70% coverage
|
||||
- **E2E Tests**: Critical user paths covered
|
||||
|
||||
## Mocking
|
||||
|
||||
### Mock Database
|
||||
|
||||
```typescript
|
||||
const mockDb = {
|
||||
query: vi.fn().mockResolvedValue({ rows: [] }),
|
||||
}
|
||||
```
|
||||
|
||||
### Mock GraphQL Client
|
||||
|
||||
```typescript
|
||||
vi.mock('@/lib/graphql/client', () => ({
|
||||
apolloClient: {
|
||||
query: vi.fn(),
|
||||
mutate: vi.fn(),
|
||||
},
|
||||
}))
|
||||
```
|
||||
|
||||
### Mock Provider APIs
|
||||
|
||||
```typescript
|
||||
global.fetch = vi.fn().mockResolvedValue({
|
||||
ok: true,
|
||||
json: async () => ({ data: [] }),
|
||||
})
|
||||
```
|
||||
|
||||
## Test Utilities
|
||||
|
||||
### Test Helpers
|
||||
|
||||
```typescript
|
||||
// test-utils.tsx
|
||||
export function createMockContext(): Context {
|
||||
return {
|
||||
db: createMockDb(),
|
||||
user: {
|
||||
id: 'test-user',
|
||||
email: 'test@sankofa.nexus',
|
||||
name: 'Test User',
|
||||
role: 'ADMIN',
|
||||
},
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Test Data Factories
|
||||
|
||||
```typescript
|
||||
export function createMockResource(overrides = {}) {
|
||||
return {
|
||||
id: 'resource-1',
|
||||
name: 'Test Resource',
|
||||
type: 'VM',
|
||||
status: 'RUNNING',
|
||||
...overrides,
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## CI/CD Integration
|
||||
|
||||
Tests run automatically on:
|
||||
|
||||
- **Pull Requests**: All test suites
|
||||
- **Main Branch**: All tests + coverage reports
|
||||
- **Releases**: Full test suite + E2E tests
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Write tests before fixing bugs** (TDD approach)
|
||||
2. **Test edge cases and error conditions**
|
||||
3. **Keep tests independent and isolated**
|
||||
4. **Use descriptive test names**
|
||||
5. **Mock external dependencies**
|
||||
6. **Clean up after tests**
|
||||
7. **Maintain test coverage**
|
||||
|
||||
## Performance Testing
|
||||
|
||||
### Load Testing
|
||||
|
||||
```bash
|
||||
# Use k6 for load testing
|
||||
k6 run tests/load/api-load-test.js
|
||||
```
|
||||
|
||||
### Stress Testing
|
||||
|
||||
```bash
|
||||
# Test API under load
|
||||
artillery run tests/stress/api-stress.yml
|
||||
```
|
||||
|
||||
## Security Testing
|
||||
|
||||
- **Dependency scanning**: `npm audit`
|
||||
- **SAST**: SonarQube analysis
|
||||
- **DAST**: OWASP ZAP scans
|
||||
- **Penetration testing**: Quarterly assessments
|
||||
|
||||
## Test Reports
|
||||
|
||||
Test reports are generated in:
|
||||
- `coverage/` - Coverage reports
|
||||
- `test-results/` - Test execution results
|
||||
- `playwright-report/` - E2E test reports
|
||||
|
||||
## Troubleshooting Tests
|
||||
|
||||
### Tests Timing Out
|
||||
|
||||
- Check for unclosed connections
|
||||
- Verify mocks are properly reset
|
||||
- Increase timeout values if needed
|
||||
|
||||
### Flaky Tests
|
||||
|
||||
- Ensure tests are deterministic
|
||||
- Fix race conditions
|
||||
- Use proper wait conditions
|
||||
|
||||
### Database Test Issues
|
||||
|
||||
- Ensure test database is isolated
|
||||
- Clean up test data after each test
|
||||
- Use transactions for isolation
|
||||
|
||||
314
docs/guides/TEST_EXAMPLES.md
Normal file
314
docs/guides/TEST_EXAMPLES.md
Normal file
@@ -0,0 +1,314 @@
|
||||
# Test Examples and Patterns
|
||||
|
||||
This document provides examples and patterns for writing tests in the Sankofa Phoenix project.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
### Testing Service Functions
|
||||
|
||||
```typescript
|
||||
// api/src/services/auth.test.ts
|
||||
import { describe, it, expect, vi, beforeEach } from 'vitest'
|
||||
import { login } from './auth'
|
||||
import { getDb } from '../db'
|
||||
import { AppErrors } from '../lib/errors'
|
||||
|
||||
// Mock dependencies
|
||||
vi.mock('../db')
|
||||
vi.mock('../lib/errors')
|
||||
|
||||
describe('auth service', () => {
|
||||
beforeEach(() => {
|
||||
vi.clearAllMocks()
|
||||
})
|
||||
|
||||
it('should authenticate valid user', async () => {
|
||||
const mockDb = {
|
||||
query: vi.fn().mockResolvedValue({
|
||||
rows: [{
|
||||
id: '1',
|
||||
email: 'user@example.com',
|
||||
name: 'Test User',
|
||||
password_hash: '$2a$10$hashed',
|
||||
role: 'USER',
|
||||
created_at: new Date(),
|
||||
updated_at: new Date(),
|
||||
}]
|
||||
})
|
||||
}
|
||||
|
||||
vi.mocked(getDb).mockReturnValue(mockDb as any)
|
||||
// Mock bcrypt.compare to return true
|
||||
vi.mock('bcryptjs', () => ({
|
||||
compare: vi.fn().mockResolvedValue(true)
|
||||
}))
|
||||
|
||||
const result = await login('user@example.com', 'password123')
|
||||
|
||||
expect(result).toHaveProperty('token')
|
||||
expect(result.user.email).toBe('user@example.com')
|
||||
})
|
||||
|
||||
it('should throw error for invalid credentials', async () => {
|
||||
const mockDb = {
|
||||
query: vi.fn().mockResolvedValue({
|
||||
rows: []
|
||||
})
|
||||
}
|
||||
|
||||
vi.mocked(getDb).mockReturnValue(mockDb as any)
|
||||
|
||||
await expect(login('invalid@example.com', 'wrong')).rejects.toThrow()
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
### Testing GraphQL Resolvers
|
||||
|
||||
```typescript
|
||||
// api/src/schema/resolvers.test.ts
|
||||
import { describe, it, expect, vi } from 'vitest'
|
||||
import { resolvers } from './resolvers'
|
||||
import * as resourceService from '../services/resource'
|
||||
|
||||
vi.mock('../services/resource')
|
||||
|
||||
describe('GraphQL resolvers', () => {
|
||||
it('should return resources', async () => {
|
||||
const mockContext = {
|
||||
user: { id: '1', email: 'test@example.com', role: 'USER' },
|
||||
db: {} as any,
|
||||
tenantContext: null
|
||||
}
|
||||
|
||||
const mockResources = [
|
||||
{ id: '1', name: 'Resource 1', type: 'VM', status: 'RUNNING' }
|
||||
]
|
||||
|
||||
vi.mocked(resourceService.getResources).mockResolvedValue(mockResources as any)
|
||||
|
||||
const result = await resolvers.Query.resources({}, {}, mockContext)
|
||||
|
||||
expect(result).toEqual(mockResources)
|
||||
expect(resourceService.getResources).toHaveBeenCalledWith(mockContext, undefined)
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
### Testing Adapters
|
||||
|
||||
```typescript
|
||||
// api/src/adapters/proxmox/adapter.test.ts
|
||||
import { describe, it, expect, vi, beforeEach } from 'vitest'
|
||||
import { ProxmoxAdapter } from './adapter'
|
||||
|
||||
// Mock fetch
|
||||
global.fetch = vi.fn()
|
||||
|
||||
describe('ProxmoxAdapter', () => {
|
||||
let adapter: ProxmoxAdapter
|
||||
|
||||
beforeEach(() => {
|
||||
adapter = new ProxmoxAdapter({
|
||||
apiUrl: 'https://proxmox.example.com:8006',
|
||||
apiToken: 'test-token'
|
||||
})
|
||||
vi.clearAllMocks()
|
||||
})
|
||||
|
||||
it('should discover resources', async () => {
|
||||
vi.mocked(fetch)
|
||||
.mockResolvedValueOnce({
|
||||
ok: true,
|
||||
json: async () => ({
|
||||
data: [{ node: 'node1' }]
|
||||
})
|
||||
} as Response)
|
||||
.mockResolvedValueOnce({
|
||||
ok: true,
|
||||
json: async () => ({
|
||||
data: [
|
||||
{ vmid: 100, name: 'vm-100', status: 'running' }
|
||||
]
|
||||
})
|
||||
} as Response)
|
||||
|
||||
const resources = await adapter.discoverResources()
|
||||
|
||||
expect(resources).toHaveLength(1)
|
||||
expect(resources[0].name).toBe('vm-100')
|
||||
})
|
||||
|
||||
it('should handle API errors', async () => {
|
||||
vi.mocked(fetch).mockResolvedValueOnce({
|
||||
ok: false,
|
||||
status: 401,
|
||||
statusText: 'Unauthorized',
|
||||
text: async () => 'Authentication failed'
|
||||
} as Response)
|
||||
|
||||
await expect(adapter.discoverResources()).rejects.toThrow()
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
## Integration Tests
|
||||
|
||||
### Testing Database Operations
|
||||
|
||||
```typescript
|
||||
// api/src/services/resource.integration.test.ts
|
||||
import { describe, it, expect, beforeAll, afterAll } from 'vitest'
|
||||
import { getDb } from '../db'
|
||||
import { createResource, getResource } from './resource'
|
||||
|
||||
describe('resource service integration', () => {
|
||||
let db: any
|
||||
let context: any
|
||||
|
||||
beforeAll(async () => {
|
||||
db = getDb()
|
||||
context = {
|
||||
user: { id: 'test-user', role: 'ADMIN' },
|
||||
db,
|
||||
tenantContext: null
|
||||
}
|
||||
})
|
||||
|
||||
afterAll(async () => {
|
||||
// Cleanup test data
|
||||
await db.query('DELETE FROM resources WHERE name LIKE $1', ['test-%'])
|
||||
await db.end()
|
||||
})
|
||||
|
||||
it('should create and retrieve resource', async () => {
|
||||
const input = {
|
||||
name: 'test-vm',
|
||||
type: 'VM',
|
||||
siteId: 'test-site'
|
||||
}
|
||||
|
||||
const created = await createResource(context, input)
|
||||
expect(created.name).toBe('test-vm')
|
||||
|
||||
const retrieved = await getResource(context, created.id)
|
||||
expect(retrieved.id).toBe(created.id)
|
||||
expect(retrieved.name).toBe('test-vm')
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
## E2E Tests
|
||||
|
||||
### Testing API Endpoints
|
||||
|
||||
```typescript
|
||||
// e2e/api.test.ts
|
||||
import { describe, it, expect, beforeAll } from 'vitest'
|
||||
import { request } from './helpers'
|
||||
|
||||
describe('API E2E tests', () => {
|
||||
let authToken: string
|
||||
|
||||
beforeAll(async () => {
|
||||
// Login to get token
|
||||
const response = await request('/graphql', {
|
||||
method: 'POST',
|
||||
body: JSON.stringify({
|
||||
query: `
|
||||
mutation {
|
||||
login(email: "test@example.com", password: "test123") {
|
||||
token
|
||||
}
|
||||
}
|
||||
`
|
||||
})
|
||||
})
|
||||
|
||||
const data = await response.json()
|
||||
authToken = data.data.login.token
|
||||
})
|
||||
|
||||
it('should get resources', async () => {
|
||||
const response = await request('/graphql', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Authorization': `Bearer ${authToken}`
|
||||
},
|
||||
body: JSON.stringify({
|
||||
query: `
|
||||
query {
|
||||
resources {
|
||||
id
|
||||
name
|
||||
type
|
||||
}
|
||||
}
|
||||
`
|
||||
})
|
||||
})
|
||||
|
||||
const data = await response.json()
|
||||
expect(data.data.resources).toBeInstanceOf(Array)
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
## React Component Tests
|
||||
|
||||
```typescript
|
||||
// portal/src/components/Dashboard.test.tsx
|
||||
import { describe, it, expect, vi } from 'vitest'
|
||||
import { render, screen, waitFor } from '@testing-library/react'
|
||||
import { Dashboard } from './Dashboard'
|
||||
|
||||
vi.mock('../lib/crossplane-client', () => ({
|
||||
createCrossplaneClient: () => ({
|
||||
getVMs: vi.fn().mockResolvedValue([
|
||||
{ id: '1', name: 'vm-1', status: 'running' }
|
||||
])
|
||||
})
|
||||
}))
|
||||
|
||||
describe('Dashboard', () => {
|
||||
it('should render VM list', async () => {
|
||||
render(<Dashboard />)
|
||||
|
||||
await waitFor(() => {
|
||||
expect(screen.getByText('vm-1')).toBeInTheDocument()
|
||||
})
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use descriptive test names**: Describe what is being tested
|
||||
2. **Arrange-Act-Assert pattern**: Structure tests clearly
|
||||
3. **Mock external dependencies**: Don't rely on real external services
|
||||
4. **Test error cases**: Verify error handling
|
||||
5. **Clean up test data**: Remove data created during tests
|
||||
6. **Use fixtures**: Create reusable test data
|
||||
7. **Test edge cases**: Include boundary conditions
|
||||
8. **Keep tests isolated**: Tests should not depend on each other
|
||||
|
||||
## Running Tests
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
pnpm test
|
||||
|
||||
# Run tests in watch mode
|
||||
pnpm test:watch
|
||||
|
||||
# Run tests with coverage
|
||||
pnpm test:coverage
|
||||
|
||||
# Run specific test file
|
||||
pnpm test path/to/test/file.test.ts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-01-09
|
||||
|
||||
523
docs/guides/TROUBLESHOOTING_GUIDE.md
Normal file
523
docs/guides/TROUBLESHOOTING_GUIDE.md
Normal file
@@ -0,0 +1,523 @@
|
||||
# Troubleshooting Guide
|
||||
|
||||
**Last Updated**: 2025-01-09
|
||||
|
||||
Common issues and solutions for Sankofa Phoenix.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [API Issues](#api-issues)
|
||||
2. [Database Issues](#database-issues)
|
||||
3. [Authentication Issues](#authentication-issues)
|
||||
4. [Resource Provisioning](#resource-provisioning)
|
||||
5. [Billing Issues](#billing-issues)
|
||||
6. [Performance Issues](#performance-issues)
|
||||
7. [Deployment Issues](#deployment-issues)
|
||||
|
||||
## API Issues
|
||||
|
||||
### API Not Responding
|
||||
|
||||
**Symptoms:**
|
||||
- 503 Service Unavailable
|
||||
- Connection timeout
|
||||
- Health check fails
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods -n api
|
||||
|
||||
# Check logs
|
||||
kubectl logs -n api deployment/api --tail=100
|
||||
|
||||
# Check service
|
||||
kubectl get svc -n api api
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Restart API deployment:
|
||||
```bash
|
||||
kubectl rollout restart deployment/api -n api
|
||||
```
|
||||
|
||||
2. Check resource limits:
|
||||
```bash
|
||||
kubectl describe pod -n api -l app=api
|
||||
```
|
||||
|
||||
3. Verify database connection:
|
||||
```bash
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT 1"
|
||||
```
|
||||
|
||||
### GraphQL Query Errors
|
||||
|
||||
**Symptoms:**
|
||||
- GraphQL errors in response
|
||||
- "Internal server error"
|
||||
- Query timeouts
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check API logs for errors
|
||||
kubectl logs -n api deployment/api | grep -i error
|
||||
|
||||
# Test GraphQL endpoint
|
||||
curl -X POST https://api.sankofa.nexus/graphql \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"query": "{ health { status } }"}'
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Check query syntax
|
||||
2. Verify authentication token
|
||||
3. Check database query performance
|
||||
4. Review resolver logs
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
**Symptoms:**
|
||||
- 429 Too Many Requests
|
||||
- Rate limit headers present
|
||||
|
||||
**Solutions:**
|
||||
1. Implement request batching
|
||||
2. Use subscriptions for real-time updates
|
||||
3. Request rate limit increase (admin)
|
||||
4. Implement client-side caching
|
||||
|
||||
## Database Issues
|
||||
|
||||
### Connection Pool Exhausted
|
||||
|
||||
**Symptoms:**
|
||||
- "Too many connections" errors
|
||||
- Slow query responses
|
||||
- Database connection timeouts
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check active connections
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT count(*) FROM pg_stat_activity"
|
||||
|
||||
# Check connection pool metrics
|
||||
curl https://api.sankofa.nexus/metrics | grep db_connections
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Increase connection pool size:
|
||||
```yaml
|
||||
env:
|
||||
- name: DB_POOL_SIZE
|
||||
value: "30"
|
||||
```
|
||||
|
||||
2. Close idle connections:
|
||||
```sql
|
||||
SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes';
|
||||
```
|
||||
|
||||
3. Restart API to reset connections
|
||||
|
||||
### Slow Queries
|
||||
|
||||
**Symptoms:**
|
||||
- High query latency
|
||||
- Timeout errors
|
||||
- Database CPU high
|
||||
|
||||
**Diagnosis:**
|
||||
```sql
|
||||
-- Find slow queries
|
||||
SELECT query, mean_exec_time, calls
|
||||
FROM pg_stat_statements
|
||||
ORDER BY mean_exec_time DESC
|
||||
LIMIT 10;
|
||||
|
||||
-- Check table sizes
|
||||
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
|
||||
FROM pg_tables
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Add database indexes:
|
||||
```sql
|
||||
CREATE INDEX idx_resources_tenant_id ON resources(tenant_id);
|
||||
CREATE INDEX idx_resources_status ON resources(status);
|
||||
```
|
||||
|
||||
2. Analyze tables:
|
||||
```sql
|
||||
ANALYZE resources;
|
||||
```
|
||||
|
||||
3. Optimize queries
|
||||
4. Consider read replicas for heavy read workloads
|
||||
|
||||
### Database Lock Issues
|
||||
|
||||
**Symptoms:**
|
||||
- Queries hanging
|
||||
- "Lock timeout" errors
|
||||
- Deadlock errors
|
||||
|
||||
**Solutions:**
|
||||
1. Check for long-running transactions:
|
||||
```sql
|
||||
SELECT pid, state, query, now() - xact_start AS duration
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active' AND xact_start IS NOT NULL
|
||||
ORDER BY duration DESC;
|
||||
```
|
||||
|
||||
2. Terminate blocking queries (if safe)
|
||||
3. Review transaction isolation levels
|
||||
4. Break up large transactions
|
||||
|
||||
## Authentication Issues
|
||||
|
||||
### Token Expired
|
||||
|
||||
**Symptoms:**
|
||||
- 401 Unauthorized
|
||||
- "Token expired" error
|
||||
- Keycloak errors
|
||||
|
||||
**Solutions:**
|
||||
1. Refresh token via Keycloak
|
||||
2. Re-authenticate
|
||||
3. Check token expiration settings in Keycloak
|
||||
|
||||
### Invalid Token
|
||||
|
||||
**Symptoms:**
|
||||
- 401 Unauthorized
|
||||
- "Invalid token" error
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Verify Keycloak is accessible
|
||||
curl https://keycloak.sankofa.nexus/health
|
||||
|
||||
# Check Keycloak logs
|
||||
kubectl logs -n keycloak deployment/keycloak --tail=100
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify token format
|
||||
2. Check Keycloak client configuration
|
||||
3. Verify token signature
|
||||
4. Check clock synchronization
|
||||
|
||||
### Permission Denied
|
||||
|
||||
**Symptoms:**
|
||||
- 403 Forbidden
|
||||
- "Access denied" error
|
||||
|
||||
**Solutions:**
|
||||
1. Verify user role in Keycloak
|
||||
2. Check tenant context
|
||||
3. Review RBAC policies
|
||||
4. Verify resource ownership
|
||||
|
||||
## Resource Provisioning
|
||||
|
||||
### VM Creation Fails
|
||||
|
||||
**Symptoms:**
|
||||
- Resource stuck in PENDING
|
||||
- Proxmox errors
|
||||
- Crossplane errors
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check Crossplane provider
|
||||
kubectl get pods -n crossplane-system | grep proxmox
|
||||
|
||||
# Check ProxmoxVM resource
|
||||
kubectl describe proxmoxvm -n default test-vm
|
||||
|
||||
# Check Proxmox connectivity
|
||||
kubectl exec -it -n crossplane-system deployment/crossplane-provider-proxmox -- \
|
||||
curl https://proxmox-endpoint:8006/api2/json/version
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify Proxmox credentials
|
||||
2. Check Proxmox node availability
|
||||
3. Verify resource quotas
|
||||
4. Check Crossplane provider logs
|
||||
|
||||
### Resource Update Fails
|
||||
|
||||
**Symptoms:**
|
||||
- Update mutation fails
|
||||
- Resource not updating
|
||||
- Status mismatch
|
||||
|
||||
**Solutions:**
|
||||
1. Check resource state
|
||||
2. Verify update permissions
|
||||
3. Review resource constraints
|
||||
4. Check for conflicting updates
|
||||
|
||||
## Billing Issues
|
||||
|
||||
### Incorrect Costs
|
||||
|
||||
**Symptoms:**
|
||||
- Unexpected charges
|
||||
- Missing usage records
|
||||
- Cost discrepancies
|
||||
|
||||
**Diagnosis:**
|
||||
```sql
|
||||
-- Check usage records
|
||||
SELECT * FROM usage_records
|
||||
WHERE tenant_id = 'tenant-id'
|
||||
ORDER BY timestamp DESC
|
||||
LIMIT 100;
|
||||
|
||||
-- Check billing calculations
|
||||
SELECT * FROM invoices
|
||||
WHERE tenant_id = 'tenant-id'
|
||||
ORDER BY created_at DESC;
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Review usage records
|
||||
2. Verify pricing configuration
|
||||
3. Check for duplicate records
|
||||
4. Recalculate costs if needed
|
||||
|
||||
### Budget Alerts Not Triggering
|
||||
|
||||
**Symptoms:**
|
||||
- Budget exceeded but no alert
|
||||
- Alerts not sent
|
||||
|
||||
**Diagnosis:**
|
||||
```sql
|
||||
-- Check budget status
|
||||
SELECT * FROM budgets
|
||||
WHERE tenant_id = 'tenant-id';
|
||||
|
||||
-- Check alert configuration
|
||||
SELECT * FROM billing_alerts
|
||||
WHERE tenant_id = 'tenant-id' AND enabled = true;
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify alert configuration
|
||||
2. Check alert evaluation schedule
|
||||
3. Review notification channels
|
||||
4. Test alert manually
|
||||
|
||||
### Invoice Generation Fails
|
||||
|
||||
**Symptoms:**
|
||||
- Invoice creation error
|
||||
- Missing line items
|
||||
- PDF generation fails
|
||||
|
||||
**Solutions:**
|
||||
1. Check usage records exist
|
||||
2. Verify billing period
|
||||
3. Check PDF service
|
||||
4. Review invoice template
|
||||
|
||||
## Performance Issues
|
||||
|
||||
### High Latency
|
||||
|
||||
**Symptoms:**
|
||||
- Slow API responses
|
||||
- Timeout errors
|
||||
- High P95 latency
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check API metrics
|
||||
curl https://api.sankofa.nexus/metrics | grep request_duration
|
||||
|
||||
# Check database performance
|
||||
kubectl exec -it -n api deployment/postgres -- \
|
||||
psql -U sankofa -c "SELECT * FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10"
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Add caching layer
|
||||
2. Optimize database queries
|
||||
3. Scale API horizontally
|
||||
4. Review N+1 query problems
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
**Symptoms:**
|
||||
- OOM kills
|
||||
- Pod restarts
|
||||
- Memory warnings
|
||||
|
||||
**Solutions:**
|
||||
1. Increase memory limits
|
||||
2. Review memory leaks
|
||||
3. Optimize data structures
|
||||
4. Implement pagination
|
||||
|
||||
### High CPU Usage
|
||||
|
||||
**Symptoms:**
|
||||
- Slow responses
|
||||
- CPU throttling
|
||||
- Pod evictions
|
||||
|
||||
**Solutions:**
|
||||
1. Scale horizontally
|
||||
2. Optimize algorithms
|
||||
3. Add caching
|
||||
4. Review expensive operations
|
||||
|
||||
## Deployment Issues
|
||||
|
||||
### Pods Not Starting
|
||||
|
||||
**Symptoms:**
|
||||
- Pods in Pending/CrashLoopBackOff
|
||||
- Image pull errors
|
||||
- Init container failures
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl describe pod -n api <pod-name>
|
||||
|
||||
# Check events
|
||||
kubectl get events -n api --sort-by='.lastTimestamp'
|
||||
|
||||
# Check logs
|
||||
kubectl logs -n api <pod-name>
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Check image availability
|
||||
2. Verify resource requests/limits
|
||||
3. Check node resources
|
||||
4. Review init container logs
|
||||
|
||||
### Service Not Accessible
|
||||
|
||||
**Symptoms:**
|
||||
- Service unreachable
|
||||
- DNS resolution fails
|
||||
- Ingress errors
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check service
|
||||
kubectl get svc -n api
|
||||
|
||||
# Check ingress
|
||||
kubectl describe ingress -n api api
|
||||
|
||||
# Test service directly
|
||||
kubectl port-forward -n api svc/api 8080:80
|
||||
curl http://localhost:8080/health
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify service selector matches pods
|
||||
2. Check ingress configuration
|
||||
3. Verify DNS records
|
||||
4. Check network policies
|
||||
|
||||
### Configuration Issues
|
||||
|
||||
**Symptoms:**
|
||||
- Wrong environment variables
|
||||
- Missing secrets
|
||||
- ConfigMap errors
|
||||
|
||||
**Solutions:**
|
||||
1. Verify environment variables:
|
||||
```bash
|
||||
kubectl exec -n api deployment/api -- env | grep -E "DB_|KEYCLOAK_"
|
||||
```
|
||||
|
||||
2. Check secrets:
|
||||
```bash
|
||||
kubectl get secrets -n api
|
||||
```
|
||||
|
||||
3. Review ConfigMaps:
|
||||
```bash
|
||||
kubectl get configmaps -n api
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# API logs
|
||||
kubectl logs -n api deployment/api --tail=100 -f
|
||||
|
||||
# Database logs
|
||||
kubectl logs -n api deployment/postgres --tail=100
|
||||
|
||||
# Keycloak logs
|
||||
kubectl logs -n keycloak deployment/keycloak --tail=100
|
||||
|
||||
# Crossplane logs
|
||||
kubectl logs -n crossplane-system deployment/crossplane-provider-proxmox --tail=100
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
```bash
|
||||
# Prometheus queries
|
||||
curl 'https://prometheus.sankofa.nexus/api/v1/query?query=up'
|
||||
|
||||
# Grafana dashboards
|
||||
# Access: https://grafana.sankofa.nexus
|
||||
```
|
||||
|
||||
### Support
|
||||
|
||||
- **Documentation**: See `docs/` directory
|
||||
- **Operations Runbook**: `docs/OPERATIONS_RUNBOOK.md`
|
||||
- **API Documentation**: `docs/API_DOCUMENTATION.md`
|
||||
|
||||
## Common Error Messages
|
||||
|
||||
### "Database connection failed"
|
||||
- Check database pod status
|
||||
- Verify connection string
|
||||
- Check network policies
|
||||
|
||||
### "Authentication required"
|
||||
- Verify token in request
|
||||
- Check token expiration
|
||||
- Verify Keycloak is accessible
|
||||
|
||||
### "Quota exceeded"
|
||||
- Review tenant quotas
|
||||
- Check resource usage
|
||||
- Request quota increase
|
||||
|
||||
### "Resource not found"
|
||||
- Verify resource ID
|
||||
- Check tenant context
|
||||
- Review access permissions
|
||||
|
||||
### "Internal server error"
|
||||
- Check application logs
|
||||
- Review error details
|
||||
- Check system resources
|
||||
|
||||
153
docs/guides/enable-guest-agent-manual.md
Normal file
153
docs/guides/enable-guest-agent-manual.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# Enable Guest Agent on VMs
|
||||
|
||||
## Automated Scripts (Recommended)
|
||||
|
||||
The project includes automated scripts for managing guest agent:
|
||||
|
||||
### Enable Guest Agent
|
||||
|
||||
```bash
|
||||
./scripts/enable-guest-agent-existing-vms.sh
|
||||
```
|
||||
|
||||
This script will:
|
||||
- Automatically discover all nodes on each Proxmox site
|
||||
- Automatically discover all VMs on each node
|
||||
- Check if guest agent is already enabled
|
||||
- Enable guest agent on VMs that need it
|
||||
- Provide detailed summary statistics
|
||||
|
||||
### Verify Guest Agent Status
|
||||
|
||||
```bash
|
||||
./scripts/verify-guest-agent.sh
|
||||
```
|
||||
|
||||
This script will:
|
||||
- List all VMs with their guest agent status
|
||||
- Show which VMs have guest agent enabled/disabled
|
||||
- Provide per-node and per-site summaries
|
||||
- Display VM names and VMIDs for easy identification
|
||||
|
||||
## Manual Instructions (Alternative)
|
||||
|
||||
If the automated script doesn't work, you can use Proxmox CLI via SSH.
|
||||
|
||||
## Site 1 (ml110-01) - 192.168.11.10
|
||||
|
||||
### Step 1: Connect to Proxmox Host
|
||||
```bash
|
||||
ssh root@192.168.11.10
|
||||
```
|
||||
|
||||
### Step 2: Enable Guest Agent for All VMs
|
||||
```bash
|
||||
for vmid in 118 132 133 127 128 123 124 121; do
|
||||
echo "Enabling guest agent on VMID $vmid..."
|
||||
qm set $vmid --agent 1
|
||||
echo "✅ VMID $vmid done"
|
||||
done
|
||||
```
|
||||
|
||||
### Step 3: Verify (Optional)
|
||||
```bash
|
||||
for vmid in 118 132 133 127 128 123 124 121; do
|
||||
agent=$(qm config $vmid | grep '^agent:' | cut -d: -f2 | tr -d ' ')
|
||||
echo "VMID $vmid: agent=${agent:-not set}"
|
||||
done
|
||||
```
|
||||
|
||||
### Step 4: Exit
|
||||
```bash
|
||||
exit
|
||||
```
|
||||
|
||||
## Site 2 (r630-01) - 192.168.11.11
|
||||
|
||||
### Step 1: Connect to Proxmox Host
|
||||
```bash
|
||||
ssh root@192.168.11.11
|
||||
```
|
||||
|
||||
### Step 2: Enable Guest Agent for All VMs
|
||||
```bash
|
||||
for vmid in 119 134 135 122 129 130 125 126 131 120; do
|
||||
echo "Enabling guest agent on VMID $vmid..."
|
||||
qm set $vmid --agent 1
|
||||
echo "✅ VMID $vmid done"
|
||||
done
|
||||
```
|
||||
|
||||
### Step 3: Verify (Optional)
|
||||
```bash
|
||||
for vmid in 119 134 135 122 129 130 125 126 131 120; do
|
||||
agent=$(qm config $vmid | grep '^agent:' | cut -d: -f2 | tr -d ' ')
|
||||
echo "VMID $vmid: agent=${agent:-not set}"
|
||||
done
|
||||
```
|
||||
|
||||
### Step 4: Exit
|
||||
```bash
|
||||
exit
|
||||
```
|
||||
|
||||
## Quick One-Liners (Alternative)
|
||||
|
||||
If you have SSH key-based authentication set up, you can run these one-liners:
|
||||
|
||||
```bash
|
||||
# Site 1
|
||||
ssh root@192.168.11.10 "for vmid in 118 132 133 127 128 123 124 121; do qm set \$vmid --agent 1; done"
|
||||
|
||||
# Site 2
|
||||
ssh root@192.168.11.11 "for vmid in 119 134 135 122 129 130 125 126 131 120; do qm set \$vmid --agent 1; done"
|
||||
```
|
||||
|
||||
## VMID Reference
|
||||
|
||||
### Site 1 (ml110-01)
|
||||
- 118: nginx-proxy-vm
|
||||
- 132: smom-validator-01
|
||||
- 133: smom-validator-02
|
||||
- 127: smom-sentry-01
|
||||
- 128: smom-sentry-02
|
||||
- 123: smom-rpc-node-01
|
||||
- 124: smom-rpc-node-02
|
||||
- 121: smom-management
|
||||
|
||||
### Site 2 (r630-01)
|
||||
- 119: cloudflare-tunnel-vm
|
||||
- 134: smom-validator-03
|
||||
- 135: smom-validator-04
|
||||
- 122: smom-sentry-03
|
||||
- 129: smom-sentry-04
|
||||
- 130: smom-rpc-node-03
|
||||
- 125: smom-rpc-node-04
|
||||
- 126: smom-services
|
||||
- 131: smom-blockscout
|
||||
- 120: smom-monitoring
|
||||
|
||||
## Next Steps
|
||||
|
||||
After enabling guest agent in Proxmox:
|
||||
|
||||
1. **Wait for VMs to get IP addresses** (if they don't have them yet)
|
||||
2. **Install guest agent package in each VM** (if not already installed):
|
||||
```bash
|
||||
ssh admin@<vm-ip>
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y qemu-guest-agent
|
||||
sudo systemctl enable qemu-guest-agent
|
||||
sudo systemctl start qemu-guest-agent
|
||||
```
|
||||
|
||||
## Automatic Guest Agent Enablement
|
||||
|
||||
**New VMs** created with the updated Crossplane provider will automatically have guest agent enabled in Proxmox configuration. The provider code has been updated to set `agent=1` for all new VMs, cloned VMs, and when updating existing VMs.
|
||||
|
||||
The guest agent package (`qemu-guest-agent`) is also automatically installed via cloud-init userData in the VM manifests, so new VMs will have both:
|
||||
1. Guest agent enabled in Proxmox config (`agent=1`)
|
||||
2. Guest agent package installed and running in the OS
|
||||
|
||||
For existing VMs, use the automated script above or follow the manual instructions below.
|
||||
|
||||
Reference in New Issue
Block a user