Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements

- Add comprehensive database migrations (001-024) for schema evolution
- Enhance API schema with expanded type definitions and resolvers
- Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth
- Implement new services: AI optimization, billing, blockchain, compliance, marketplace
- Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage)
- Update Crossplane provider with enhanced VM management capabilities
- Add comprehensive test suite for API endpoints and services
- Update frontend components with improved GraphQL subscriptions and real-time updates
- Enhance security configurations and headers (CSP, CORS, etc.)
- Update documentation and configuration files
- Add new CI/CD workflows and validation scripts
- Implement design system improvements and UI enhancements
This commit is contained in:
defiQUG
2025-12-12 18:01:35 -08:00
parent e01131efaf
commit 9daf1fd378
968 changed files with 160890 additions and 1092 deletions

View File

@@ -0,0 +1,224 @@
# Data Retention Policy
## Overview
This document defines data retention policies for the Sankofa Phoenix platform to ensure compliance with regulatory requirements and optimize storage costs.
## Retention Periods
### Application Data
#### User Data
- **Active Users**: Retained indefinitely while account is active
- **Inactive Users**: Retained for 7 years after last login
- **Deleted Users**: Soft delete for 90 days, then permanent deletion
- **User Activity Logs**: 2 years
#### Tenant Data
- **Active Tenants**: Retained indefinitely while tenant is active
- **Suspended Tenants**: Retained for 1 year after suspension
- **Deleted Tenants**: Soft delete for 90 days, then permanent deletion
#### Resource Data
- **Active Resources**: Retained indefinitely
- **Deleted Resources**: Retained for 90 days for recovery purposes
- **Resource History**: 1 year
### Audit and Compliance Data
#### Audit Logs
- **Security Events**: 7 years (compliance requirement)
- **Authentication Logs**: 2 years
- **Authorization Logs**: 2 years
- **Data Access Logs**: 2 years
- **Administrative Actions**: 7 years
#### Compliance Data
- **STIG Compliance Reports**: 7 years
- **RMF Documentation**: 7 years
- **Incident Reports**: 7 years
- **Risk Assessments**: 7 years
### Operational Data
#### Application Logs
- **Application Logs (Loki)**: 30 days
- **Access Logs**: 90 days
- **Error Logs**: 90 days
- **Performance Logs**: 30 days
#### Metrics
- **Prometheus Metrics**: 30 days (raw)
- **Aggregated Metrics**: 1 year
- **Custom Metrics**: 90 days
#### Backups
- **Database Backups**: 7 days (daily), 4 weeks (weekly), 12 months (monthly)
- **Configuration Backups**: 90 days
- **Disaster Recovery Backups**: 7 years
### Blockchain Data
#### Transaction History
- **All Transactions**: Retained indefinitely (immutable)
- **Transaction Logs**: 7 years
#### Smart Contract Data
- **Contract State**: Retained indefinitely
- **Contract Events**: 7 years
## Data Deletion Procedures
### Automated Deletion
#### Scheduled Cleanup Jobs
```bash
# Run daily cleanup job
kubectl create cronjob cleanup-old-data \
--image=postgres:14-alpine \
--schedule="0 3 * * *" \
--restart=OnFailure \
-- /bin/bash -c "psql $DATABASE_URL -f /scripts/cleanup-old-data.sql"
```
#### Cleanup Scripts
- **User Data Cleanup**: Runs monthly, deletes users inactive > 7 years
- **Log Cleanup**: Runs daily, deletes logs older than retention period
- **Backup Cleanup**: Runs daily, deletes backups older than retention period
### Manual Deletion
#### User-Requested Deletion
1. User submits deletion request
2. Account marked for deletion
3. 30-day grace period for account recovery
4. Data anonymized after grace period
5. Permanent deletion after 90 days
#### Administrative Deletion
1. Admin initiates deletion
2. Approval required for sensitive data
3. Data exported for compliance (if required)
4. Data deleted according to retention policy
## Compliance Requirements
### GDPR (General Data Protection Regulation)
- **Right to Erasure**: Users can request data deletion
- **Data Portability**: Users can export their data
- **Retention Limitation**: Data retained only as long as necessary
### SOX (Sarbanes-Oxley Act)
- **Financial Records**: 7 years retention
- **Audit Trails**: 7 years retention
### HIPAA (Health Insurance Portability and Accountability Act)
- **PHI Data**: 6 years minimum retention
- **Access Logs**: 6 years minimum retention
### DoD/MilSpec Compliance
- **Security Logs**: 7 years retention
- **Audit Trails**: 7 years retention
- **Compliance Reports**: 7 years retention
## Implementation
### Database Retention
#### Automated Cleanup Queries
```sql
-- Delete inactive users (7 years)
DELETE FROM users
WHERE last_login < NOW() - INTERVAL '7 years'
AND status = 'INACTIVE';
-- Delete old audit logs (after 2 years, archive first)
INSERT INTO audit_logs_archive
SELECT * FROM audit_logs
WHERE created_at < NOW() - INTERVAL '2 years';
DELETE FROM audit_logs
WHERE created_at < NOW() - INTERVAL '2 years';
```
### Log Retention
#### Loki Retention Configuration
```yaml
# gitops/apps/monitoring/loki-config.yaml
retention_period: 30d
retention_stream:
- selector: '{job="api"}'
period: 90d
- selector: '{job="portal"}'
period: 90d
```
#### Prometheus Retention Configuration
```yaml
# gitops/apps/monitoring/prometheus-config.yaml
retention: 30d
retentionSize: 50GB
```
### Backup Retention
#### Backup Cleanup Script
```bash
# Delete backups older than retention period
find /backups/postgres -name "*.sql.gz" -mtime +7 -delete
find /backups/postgres -name "*.sql.gz" -mtime +30 -delete # Weekly backups
find /backups/postgres -name "*.sql.gz" -mtime +365 -delete # Monthly backups
```
## Data Archival
### Long-Term Storage
#### Archived Data Storage
- **Location**: S3 Glacier or equivalent
- **Format**: Compressed, encrypted archives
- **Retention**: Per compliance requirements
- **Access**: On-demand restoration
#### Archive Process
1. Data identified for archival
2. Data compressed and encrypted
3. Data uploaded to archival storage
4. Index updated with archive location
5. Original data deleted after verification
## Monitoring and Compliance
### Retention Policy Compliance
#### Automated Checks
- Daily verification of retention policies
- Alert on data older than retention period
- Report on data deletion activities
#### Compliance Reporting
- Monthly retention compliance report
- Quarterly audit of data retention
- Annual compliance review
## Exceptions and Extensions
### Legal Hold
- Data subject to legal hold cannot be deleted
- Legal hold overrides retention policy
- Legal hold must be documented
- Data released after legal hold lifted
### Business Requirements
- Extended retention for business-critical data
- Approval required for extensions
- Extensions documented and reviewed annually
## Contact
For questions about data retention:
- **Data Protection Officer**: dpo@sankofa.nexus
- **Compliance Team**: compliance@sankofa.nexus
- **Legal Team**: legal@sankofa.nexus

View File

@@ -0,0 +1,239 @@
# Escalation Procedures
## Overview
This document defines escalation procedures for incidents, support requests, and operational issues in the Sankofa Phoenix platform.
## Escalation Levels
### Level 1: On-Call Engineer
- **Response Time**: Immediate (P0/P1) or < 1 hour (P2/P3)
- **Responsibilities**:
- Initial incident triage
- Basic troubleshooting
- Service restart/recovery
- Status updates
### Level 2: Team Lead / Senior Engineer
- **Response Time**: < 15 minutes (P0/P1) or < 2 hours (P2/P3)
- **Responsibilities**:
- Complex troubleshooting
- Architecture decisions
- Code review for hotfixes
- Customer communication
### Level 3: Engineering Manager
- **Response Time**: < 30 minutes (P0) or < 4 hours (P1)
- **Responsibilities**:
- Resource allocation
- Cross-team coordination
- Business impact assessment
- Executive communication
### Level 4: CTO / VP Engineering
- **Response Time**: < 1 hour (P0 only)
- **Responsibilities**:
- Strategic decisions
- Customer escalation
- Public communication
- Resource approval
## Escalation Triggers
### Automatic Escalation
- P0 incident not resolved in 30 minutes
- P1 incident not resolved in 2 hours
- Multiple services affected simultaneously
- Data loss or security breach detected
### Manual Escalation
- On-call engineer requests assistance
- Customer escalates to management
- Issue requires expertise not available at current level
- Business impact exceeds threshold
## Escalation Matrix
| Severity | Level 1 | Level 2 | Level 3 | Level 4 |
|----------|---------|---------|---------|---------|
| P0 | Immediate | 15 min | 30 min | 1 hour |
| P1 | 15 min | 30 min | 2 hours | 4 hours |
| P2 | 1 hour | 2 hours | 24 hours | N/A |
| P3 | 4 hours | 24 hours | 1 week | N/A |
## Escalation Process
### Step 1: Initial Assessment
1. On-call engineer receives alert/notification
2. Assess severity and impact
3. Begin investigation
4. Document findings
### Step 2: Escalation Decision
**Escalate if**:
- Issue not resolved within SLA
- Additional expertise needed
- Customer impact is severe
- Business impact is high
- Security concern
**Do NOT escalate if**:
- Issue is being actively worked on
- Resolution is in progress
- Impact is minimal
- Standard procedure can resolve
### Step 3: Escalation Execution
1. **Notify next level**:
- Create escalation ticket
- Update incident channel
- Call/Slack next level contact
- Provide context and current status
2. **Handoff information**:
- Incident summary
- Current status
- Actions taken
- Relevant logs/metrics
- Customer impact
3. **Update tracking**:
- Update incident system
- Update status page
- Document escalation reason
### Step 4: Escalation Resolution
1. Escalated engineer takes ownership
2. On-call engineer provides support
3. Regular status updates
4. Resolution and post-mortem
## Communication Channels
### Internal Communication
- **Slack/Teams**: `#incident-YYYY-MM-DD-<name>`
- **PagerDuty/Opsgenie**: Automatic escalation
- **Email**: For non-urgent escalations
- **Phone**: For P0 incidents
### External Communication
- **Status Page**: Public updates
- **Customer Notifications**: For affected customers
- **Support Tickets**: Update existing tickets
## Contact Information
### On-Call Rotation
- **Primary**: [Contact Info]
- **Secondary**: [Contact Info]
- **Schedule**: [Link to schedule]
### Escalation Contacts
- **Team Lead**: [Contact Info]
- **Engineering Manager**: [Contact Info]
- **CTO**: [Contact Info]
- **VP Engineering**: [Contact Info]
### Support Contacts
- **Support Team Lead**: [Contact Info]
- **Customer Success**: [Contact Info]
## Escalation Scenarios
### Scenario 1: P0 Service Outage
1. **Detection**: Monitoring alert
2. **Level 1**: On-call engineer investigates (5 min)
3. **Escalation**: If not resolved in 15 min → Level 2
4. **Level 2**: Team lead coordinates (15 min)
5. **Escalation**: If not resolved in 30 min → Level 3
6. **Level 3**: Engineering manager allocates resources
7. **Resolution**: Service restored
8. **Post-Mortem**: Within 24 hours
### Scenario 2: Security Breach
1. **Detection**: Security alert or anomaly
2. **Immediate**: Escalate to Level 3 (bypass Level 1/2)
3. **Level 3**: Engineering manager + Security team
4. **Escalation**: If data breach → Level 4
5. **Level 4**: CTO + Legal + PR
6. **Resolution**: Contain, investigate, remediate
7. **Post-Mortem**: Within 48 hours
### Scenario 3: Data Loss
1. **Detection**: Backup failure or data corruption
2. **Immediate**: Escalate to Level 2
3. **Level 2**: Team lead + Database team
4. **Escalation**: If cannot recover → Level 3
5. **Level 3**: Engineering manager + Customer Success
6. **Resolution**: Restore from backup or data recovery
7. **Post-Mortem**: Within 24 hours
### Scenario 4: Performance Degradation
1. **Detection**: Performance metrics exceed thresholds
2. **Level 1**: On-call engineer investigates (1 hour)
3. **Escalation**: If not resolved → Level 2
4. **Level 2**: Team lead + Performance team
5. **Resolution**: Optimize or scale resources
6. **Post-Mortem**: If P1/P0, within 48 hours
## Customer Escalation
### Customer Escalation Process
1. **Support receives** customer escalation
2. **Assess severity**:
- Technical issue → Engineering
- Billing issue → Finance
- Account issue → Customer Success
3. **Notify appropriate team**
4. **Provide customer updates** every 2 hours (P0/P1)
5. **Resolve and follow up**
### Customer Escalation Contacts
- **Support Escalation**: support-escalation@sankofa.nexus
- **Technical Escalation**: tech-escalation@sankofa.nexus
- **Executive Escalation**: executive-escalation@sankofa.nexus
## Escalation Metrics
### Tracking
- **Escalation Rate**: % of incidents escalated
- **Escalation Time**: Time to escalate
- **Resolution Time**: Time to resolve after escalation
- **Customer Satisfaction**: Post-incident surveys
### Goals
- **P0 Escalation**: < 5% of P0 incidents
- **P1 Escalation**: < 10% of P1 incidents
- **Escalation Time**: < SLA threshold
- **Resolution Time**: < 2x normal resolution time
## Best Practices
### Do's
- ✅ Escalate early if unsure
- ✅ Provide complete context
- ✅ Document all actions
- ✅ Communicate frequently
- ✅ Learn from escalations
### Don'ts
- ❌ Escalate without trying
- ❌ Escalate without context
- ❌ Skip levels unnecessarily
- ❌ Ignore customer escalations
- ❌ Forget to update status
## Review and Improvement
### Monthly Review
- Review escalation patterns
- Identify common causes
- Update procedures
- Train team on improvements
### Quarterly Review
- Analyze escalation metrics
- Update contact information
- Review and update SLAs
- Improve documentation

View File

@@ -0,0 +1,319 @@
# Incident Response Runbook
## Overview
This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.
## Incident Severity Levels
### P0 - Critical (Immediate Response)
- Complete service outage
- Data loss or corruption
- Security breach
- **Response Time**: Immediate (< 5 minutes)
- **Resolution Target**: < 1 hour
### P1 - High (Urgent Response)
- Partial service outage affecting multiple users
- Performance degradation > 50%
- Authentication failures
- **Response Time**: < 15 minutes
- **Resolution Target**: < 4 hours
### P2 - Medium (Standard Response)
- Single feature/service degraded
- Performance degradation 20-50%
- Non-critical errors
- **Response Time**: < 1 hour
- **Resolution Target**: < 24 hours
### P3 - Low (Normal Response)
- Minor issues
- Cosmetic problems
- Non-blocking errors
- **Response Time**: < 4 hours
- **Resolution Target**: < 1 week
## Incident Response Process
### 1. Detection and Triage
#### Detection Sources
- **Monitoring Alerts**: Prometheus/Alertmanager
- **Error Logs**: Loki, application logs
- **User Reports**: Support tickets, status page
- **Health Checks**: Automated health check failures
#### Initial Triage Steps
```bash
# 1. Check service health
kubectl get pods --all-namespaces | grep -v Running
# 2. Check API health
curl -f https://api.sankofa.nexus/health || echo "API DOWN"
# 3. Check portal health
curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"
# 4. Check database connectivity
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"
# 5. Check Keycloak
curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"
```
### 2. Incident Declaration
#### Create Incident Channel
- Create dedicated Slack/Teams channel: `#incident-YYYY-MM-DD-<name>`
- Invite: On-call engineer, Team lead, Product owner
- Post initial status
#### Incident Template
```
INCIDENT: [Brief Description]
SEVERITY: P0/P1/P2/P3
STATUS: Investigating/Identified/Monitoring/Resolved
START TIME: [Timestamp]
AFFECTED SERVICES: [List]
IMPACT: [User impact description]
```
### 3. Investigation
#### Common Investigation Commands
**Check Pod Status**
```bash
kubectl get pods --all-namespaces -o wide
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100
```
**Check Resource Usage**
```bash
kubectl top nodes
kubectl top pods --all-namespaces
```
**Check Database**
```bash
# Connection count
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
# Long-running queries
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
```
**Check Logs**
```bash
# Recent errors
kubectl logs -n api deployment/api --tail=500 | grep -i error
# Authentication failures
kubectl logs -n api deployment/api | grep -i "auth.*fail"
# Rate limiting
kubectl logs -n api deployment/api | grep -i "rate limit"
```
**Check Monitoring**
```bash
# Access Grafana
open https://grafana.sankofa.nexus
# Check Prometheus alerts
kubectl get prometheusrules -n monitoring
```
### 4. Resolution
#### Common Resolution Actions
**Restart Service**
```bash
kubectl rollout restart deployment/api -n api
kubectl rollout restart deployment/portal -n portal
```
**Scale Up**
```bash
kubectl scale deployment/api --replicas=5 -n api
```
**Rollback Deployment**
```bash
# See ROLLBACK_PLAN.md for detailed procedures
kubectl rollout undo deployment/api -n api
```
**Clear Rate Limits** (if needed)
```bash
# Access Redis/rate limit store and clear keys
# Or restart rate limit service
kubectl rollout restart deployment/rate-limit -n api
```
**Database Maintenance**
```bash
# Vacuum database
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "VACUUM ANALYZE;"
# Kill long-running queries
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"
```
### 5. Post-Incident
#### Incident Report Template
```markdown
# Incident Report: [Date] - [Title]
## Summary
[Brief description of incident]
## Timeline
- [Time] - Incident detected
- [Time] - Investigation started
- [Time] - Root cause identified
- [Time] - Resolution implemented
- [Time] - Service restored
## Root Cause
[Detailed root cause analysis]
## Impact
- **Users Affected**: [Number]
- **Duration**: [Time]
- **Services Affected**: [List]
## Resolution
[Steps taken to resolve]
## Prevention
- [ ] Action item 1
- [ ] Action item 2
- [ ] Action item 3
## Follow-up
- [ ] Update monitoring/alerts
- [ ] Update runbooks
- [ ] Code changes needed
- [ ] Documentation updates
```
## Common Incidents
### API High Latency
**Symptoms**: API response times > 500ms
**Investigation**:
```bash
# Check database query performance
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
# Check API metrics
curl https://api.sankofa.nexus/metrics | grep http_request_duration
```
**Resolution**:
- Scale API replicas
- Optimize slow queries
- Add database indexes
- Check for N+1 query problems
### Database Connection Pool Exhausted
**Symptoms**: "too many connections" errors
**Investigation**:
```bash
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
```
**Resolution**:
- Increase connection pool size
- Kill idle connections
- Scale database
- Check for connection leaks
### Authentication Failures
**Symptoms**: Users cannot log in
**Investigation**:
```bash
# Check Keycloak
curl https://keycloak.sankofa.nexus/health
kubectl logs -n keycloak deployment/keycloak --tail=100
# Check API auth logs
kubectl logs -n api deployment/api | grep -i "auth.*fail"
```
**Resolution**:
- Restart Keycloak if needed
- Check OIDC configuration
- Verify JWT secret
- Check network connectivity
### Portal Not Loading
**Symptoms**: Portal returns 500 or blank page
**Investigation**:
```bash
# Check portal pods
kubectl get pods -n portal
kubectl logs -n portal deployment/portal --tail=100
# Check portal health
curl https://portal.sankofa.nexus/api/health
```
**Resolution**:
- Restart portal deployment
- Check environment variables
- Verify Keycloak connectivity
- Check build errors
## Escalation
### When to Escalate
- P0 incident not resolved in 30 minutes
- P1 incident not resolved in 2 hours
- Need additional expertise
- Customer impact is severe
### Escalation Path
1. **On-call Engineer** → Team Lead
2. **Team Lead** → Engineering Manager
3. **Engineering Manager** → CTO/VP Engineering
4. **CTO** → Executive Team
### Emergency Contacts
- **On-call**: [Phone/Slack]
- **Team Lead**: [Phone/Slack]
- **Engineering Manager**: [Phone/Slack]
- **CTO**: [Phone/Slack]
## Communication
### Status Page Updates
- Update status page during incident
- Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
- Include: Status, affected services, estimated resolution time
### Customer Communication
- For P0/P1: Notify affected customers immediately
- For P2/P3: Include in next status update
- Be transparent about impact and resolution timeline

View File

@@ -0,0 +1,244 @@
# Proxmox Disaster Recovery Procedures
## Overview
This document outlines disaster recovery procedures for Proxmox infrastructure managed by the Crossplane provider.
## Recovery Scenarios
### Scenario 1: Provider Pod Failure
#### Symptoms
- Provider pod not running
- VM operations failing
- ProviderConfig not working
#### Recovery Steps
1. **Check Pod Status**:
```bash
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
```
2. **Restart Provider**:
```bash
kubectl delete pod -n crossplane-system -l app=crossplane-provider-proxmox
```
3. **Verify Recovery**:
```bash
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50
```
### Scenario 2: Proxmox Node Failure
#### Symptoms
- Cannot connect to Proxmox
- VMs unreachable
- Provider connection errors
#### Recovery Steps
1. **Verify Node Status**:
- Check Proxmox Web UI
- Verify node is online
- Check network connectivity
2. **Check ProviderConfig**:
```bash
kubectl get providerconfig proxmox-provider-config -o yaml
```
3. **Update Endpoint if Needed**:
- If node IP changed, update ProviderConfig
- If using hostname, verify DNS
4. **Test Connectivity**:
```bash
curl -k https://your-proxmox:8006/api2/json/version
```
### Scenario 3: Credential Compromise
#### Symptoms
- Authentication failures
- Security alerts
- Unauthorized access
#### Recovery Steps
1. **Revoke Compromised Credentials**:
- Log into Proxmox Web UI
- Revoke API tokens
- Change passwords
2. **Create New Credentials**:
- Create new API tokens
- Use strong passwords
- Set appropriate permissions
3. **Update Kubernetes Secret**:
```bash
kubectl delete secret proxmox-credentials -n crossplane-system
kubectl create secret generic proxmox-credentials \
--from-literal=credentials.json='{"username":"root@pam","token":"new-token"}' \
-n crossplane-system
```
4. **Restart Provider**:
```bash
kubectl delete pod -n crossplane-system -l app=crossplane-provider-proxmox
```
### Scenario 4: VM Data Loss
#### Symptoms
- VM not found
- Data missing
- Storage errors
#### Recovery Steps
1. **Check VM Status**:
```bash
kubectl get proxmoxvm <vm-name>
kubectl describe proxmoxvm <vm-name>
```
2. **Check Proxmox Backups**:
- Log into Proxmox Web UI
- Check backup storage
- Review backup schedule
3. **Restore from Backup**:
- Use Proxmox backup restore
- Or recreate VM from template
4. **Recreate VM Resource**:
```bash
# Delete existing resource
kubectl delete proxmoxvm <vm-name>
# Recreate with same configuration
kubectl apply -f <vm-manifest>.yaml
```
### Scenario 5: Complete Provider Failure
#### Symptoms
- Provider not responding
- All VM operations failing
- ProviderConfig errors
#### Recovery Steps
1. **Check Provider Deployment**:
```bash
kubectl get deployment -n crossplane-system crossplane-provider-proxmox
kubectl describe deployment -n crossplane-system crossplane-provider-proxmox
```
2. **Redeploy Provider**:
```bash
kubectl delete deployment -n crossplane-system crossplane-provider-proxmox
kubectl apply -f crossplane-provider-proxmox/config/provider.yaml
```
3. **Verify ProviderConfig**:
```bash
kubectl get providerconfig
kubectl describe providerconfig proxmox-provider-config
```
4. **Test VM Operations**:
```bash
kubectl get proxmoxvm
kubectl describe proxmoxvm <test-vm>
```
## Backup Procedures
### Provider Configuration Backup
```bash
# Backup ProviderConfig
kubectl get providerconfig proxmox-provider-config -o yaml > providerconfig-backup.yaml
# Backup credentials secret (be careful with this!)
kubectl get secret proxmox-credentials -n crossplane-system -o yaml > credentials-backup.yaml
```
### VM Configuration Backup
```bash
# Backup all VM resources
kubectl get proxmoxvm -o yaml > all-vms-backup.yaml
# Backup specific VM
kubectl get proxmoxvm <vm-name> -o yaml > <vm-name>-backup.yaml
```
### Proxmox Backup
1. **Configure Backup Schedule**:
- Log into Proxmox Web UI
- Go to Datacenter → Backup
- Configure backup schedule
2. **Manual Backup**:
- Select VM in Proxmox Web UI
- Click Backup
- Choose backup storage
- Start backup
## Recovery Testing
### Test Provider Recovery
1. **Simulate Failure**:
```bash
kubectl delete pod -n crossplane-system -l app=crossplane-provider-proxmox
```
2. **Verify Auto-Recovery**:
```bash
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
```
3. **Test VM Operations**:
```bash
kubectl get proxmoxvm
```
### Test VM Recovery
1. **Create Test VM**:
```bash
kubectl apply -f test-vm.yaml
```
2. **Delete VM**:
```bash
kubectl delete proxmoxvm test-vm
```
3. **Recreate VM**:
```bash
kubectl apply -f test-vm.yaml
```
## Prevention
1. **Regular Backups**: Schedule regular backups
2. **Monitoring**: Set up alerts for failures
3. **Documentation**: Keep procedures documented
4. **Testing**: Regularly test recovery procedures
5. **Redundancy**: Use multiple Proxmox nodes
## Related Documentation
- [VM Provisioning Runbook](./PROXMOX_VM_PROVISIONING.md)
- [Troubleshooting Guide](./PROXMOX_TROUBLESHOOTING.md)
- [Deployment Guide](../proxmox/DEPLOYMENT_GUIDE.md)

View File

@@ -0,0 +1,272 @@
# Proxmox Troubleshooting Guide
## Common Issues and Solutions
### Provider Not Connecting
#### Symptoms
- Provider logs show connection errors
- ProviderConfig status is not Ready
- VM creation fails with connection errors
#### Solutions
1. **Verify Endpoint**:
```bash
curl -k https://your-proxmox:8006/api2/json/version
```
2. **Check Credentials**:
```bash
kubectl get secret proxmox-credentials -n crossplane-system -o yaml
```
3. **Test Authentication**:
```bash
curl -k -X POST \
-d "username=root@pam&password=your-password" \
https://your-proxmox:8006/api2/json/access/ticket
```
4. **Check Provider Logs**:
```bash
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=100
```
### VM Creation Fails
#### Symptoms
- VM resource stuck in Creating state
- Error messages in VM resource status
- No VM appears in Proxmox
#### Solutions
1. **Check VM Resource**:
```bash
kubectl describe proxmoxvm <vm-name>
```
2. **Verify Site Configuration**:
- Site must exist in ProviderConfig
- Endpoint must be reachable
- Node name must match actual Proxmox node
3. **Check Proxmox Resources**:
- Storage pool must exist
- Network bridge must exist
- OS template must exist
4. **Check Proxmox Logs**:
- Log into Proxmox Web UI
- Check System Log
- Review task history
### VM Status Not Updating
#### Symptoms
- VM status remains unknown
- IP address not populated
- State not reflecting actual VM state
#### Solutions
1. **Check Provider Connectivity**:
```bash
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox | grep -i error
```
2. **Verify VM Exists in Proxmox**:
- Check Proxmox Web UI
- Verify VM ID matches
3. **Check Reconciliation**:
```bash
kubectl get proxmoxvm <vm-name> -o yaml | grep -A 5 conditions
```
### Storage Issues
#### Symptoms
- VM creation fails with storage errors
- "Storage not found" errors
- Insufficient storage errors
#### Solutions
1. **List Available Storage**:
```bash
# Via Proxmox API
curl -k -H "Authorization: PVEAuthCookie=TOKEN" \
https://your-proxmox:8006/api2/json/storage
```
2. **Check Storage Capacity**:
- Log into Proxmox Web UI
- Check Storage section
- Verify available space
3. **Update Storage Name**:
- Verify actual storage pool name
- Update VM manifest if needed
### Network Issues
#### Symptoms
- VM created but no network connectivity
- IP address not assigned
- Network bridge errors
#### Solutions
1. **Verify Network Bridge**:
```bash
# Via Proxmox API
curl -k -H "Authorization: PVEAuthCookie=TOKEN" \
https://your-proxmox:8006/api2/json/nodes/ML110-01/network
```
2. **Check Network Configuration**:
- Verify bridge name in VM manifest
- Check bridge exists on node
- Verify bridge is active
3. **Check DHCP**:
- Verify DHCP server is running
- Check network configuration
- Review VM network settings
### Authentication Failures
#### Symptoms
- 401 Unauthorized errors
- Authentication failed messages
- Token/ticket errors
#### Solutions
1. **Verify Credentials**:
- Check username format: `user@realm`
- Verify password is correct
- Check token format if using tokens
2. **Test Authentication**:
```bash
# Password auth
curl -k -X POST \
-d "username=root@pam&password=your-password" \
https://your-proxmox:8006/api2/json/access/ticket
# Token auth
curl -k -H "Authorization: PVEAuthCookie=TOKEN" \
https://your-proxmox:8006/api2/json/version
```
3. **Check Permissions**:
- Verify user has VM creation permissions
- Check token permissions
- Review Proxmox user roles
### Provider Pod Issues
#### Symptoms
- Provider pod not starting
- Provider pod crashing
- Provider pod in Error state
#### Solutions
1. **Check Pod Status**:
```bash
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
kubectl describe pod -n crossplane-system -l app=crossplane-provider-proxmox
```
2. **Check Pod Logs**:
```bash
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=100
```
3. **Check Image**:
```bash
kubectl get deployment -n crossplane-system crossplane-provider-proxmox -o yaml | grep image
```
4. **Verify Resources**:
```bash
kubectl get deployment -n crossplane-system crossplane-provider-proxmox -o yaml | grep -A 5 resources
```
## Diagnostic Commands
### Check Provider Health
```bash
# Provider status
kubectl get deployment -n crossplane-system crossplane-provider-proxmox
# Provider logs
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50
# Provider metrics
kubectl port-forward -n crossplane-system deployment/crossplane-provider-proxmox 8080:8080
curl http://localhost:8080/metrics
```
### Check VM Resources
```bash
# List all VMs
kubectl get proxmoxvm
# Get VM details
kubectl get proxmoxvm <vm-name> -o yaml
# Check VM events
kubectl describe proxmoxvm <vm-name>
```
### Check ProviderConfig
```bash
# List ProviderConfigs
kubectl get providerconfig
# Get ProviderConfig details
kubectl get providerconfig proxmox-provider-config -o yaml
# Check ProviderConfig status
kubectl describe providerconfig proxmox-provider-config
```
## Escalation Procedures
### Level 1: Basic Troubleshooting
1. Check provider logs
2. Verify credentials
3. Test connectivity
4. Review VM resource status
### Level 2: Advanced Troubleshooting
1. Check Proxmox Web UI
2. Review Proxmox logs
3. Verify network connectivity
4. Check resource availability
### Level 3: Infrastructure Issues
1. Contact Proxmox administrator
2. Check infrastructure status
3. Review network configuration
4. Verify DNS resolution
## Prevention
1. **Regular Monitoring**: Set up alerts for provider health
2. **Resource Verification**: Verify resources before deployment
3. **Credential Rotation**: Rotate credentials regularly
4. **Backup Configuration**: Backup ProviderConfig and secrets
5. **Documentation**: Keep documentation up to date
## Related Documentation
- [VM Provisioning Runbook](./PROXMOX_VM_PROVISIONING.md)
- [Deployment Guide](../proxmox/DEPLOYMENT_GUIDE.md)
- [Site Mapping](../proxmox/SITE_MAPPING.md)

View File

@@ -0,0 +1,207 @@
# Proxmox VM Provisioning Runbook
## Overview
This runbook provides step-by-step procedures for provisioning virtual machines on Proxmox infrastructure using the Crossplane provider.
## Prerequisites
- Kubernetes cluster with Crossplane and Proxmox provider installed
- ProviderConfig configured and ready
- Appropriate permissions to create ProxmoxVM resources
- Access to Proxmox Web UI (for verification)
## Standard VM Provisioning
### Step 1: Create VM Manifest
Create a YAML manifest for the VM:
```yaml
apiVersion: proxmox.sankofa.nexus/v1alpha1
kind: ProxmoxVM
metadata:
name: my-vm
namespace: default
spec:
forProvider:
node: ML110-01
name: my-vm
cpu: 2
memory: 4Gi
disk: 50Gi
storage: local-lvm
network: vmbr0
image: ubuntu-22.04-cloud
site: us-sfvalley
userData: |
#cloud-config
users:
- name: admin
groups: sudo
shell: /bin/bash
sudo: ['ALL=(ALL) NOPASSWD:ALL']
providerConfigRef:
name: proxmox-provider-config
```
### Step 2: Apply Manifest
```bash
kubectl apply -f my-vm.yaml
```
### Step 3: Verify Creation
```bash
# Check VM resource status
kubectl get proxmoxvm my-vm
# Get detailed status
kubectl describe proxmoxvm my-vm
# Check VM in Proxmox
# Log into Proxmox Web UI and verify VM exists
```
### Step 4: Verify VM Status
Wait for VM to be created and check status:
```bash
# Watch VM status
kubectl get proxmoxvm my-vm -w
# Check VM ID
kubectl get proxmoxvm my-vm -o jsonpath='{.status.vmId}'
# Check VM state
kubectl get proxmoxvm my-vm -o jsonpath='{.status.state}'
# Check IP address (if available)
kubectl get proxmoxvm my-vm -o jsonpath='{.status.ipAddress}'
```
## Multi-Site VM Provisioning
### Provision VM on Different Site
Update the `site` field in the manifest:
```yaml
spec:
forProvider:
site: eu-west-1 # or apac-1 or us-sfvalley
node: R630-01 # for both eu-west-1 and apac-1
```
## VM Lifecycle Operations
### Start VM
```bash
# VM should start automatically after creation
# To manually start, update the VM resource or use Proxmox API
```
### Stop VM
```bash
# Update VM resource or use Proxmox Web UI
```
### Delete VM
```bash
kubectl delete proxmoxvm my-vm
```
## Troubleshooting
### VM Creation Fails
1. **Check ProviderConfig**:
```bash
kubectl get providerconfig proxmox-provider-config -o yaml
```
2. **Check Provider Logs**:
```bash
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50
```
3. **Verify Site Configuration**:
- Check if site exists in ProviderConfig
- Verify endpoint is reachable
- Check node name matches actual Proxmox node
4. **Check Proxmox Resources**:
- Verify storage pool exists
- Verify network bridge exists
- Verify OS template exists
### VM Stuck in Creating State
1. **Check VM Resource Events**:
```bash
kubectl describe proxmoxvm my-vm
```
2. **Check Proxmox Web UI**:
- Log into Proxmox
- Check if VM exists
- Check VM status
- Review Proxmox logs
3. **Verify Resources**:
- Check available storage
- Check available memory
- Check node status
### VM Not Getting IP Address
1. **Check Cloud-Init**:
- Verify userData is correct
- Check cloud-init logs in VM
2. **Check Network Configuration**:
- Verify network bridge is correct
- Check DHCP configuration
- Verify VM network interface
3. **Check Guest Agent**:
- Ensure QEMU guest agent is installed
- Verify guest agent is running
## Best Practices
1. **Resource Naming**: Use descriptive names for VMs
2. **Resource Limits**: Set appropriate CPU and memory limits
3. **Storage Planning**: Choose appropriate storage pools
4. **Network Configuration**: Use correct network bridges
5. **Backup Strategy**: Configure backups for important VMs
6. **Monitoring**: Set up monitoring for VM metrics
## Common Configurations
### Small VM (Development)
- CPU: 1-2 cores
- Memory: 2-4 Gi
- Disk: 20-50 Gi
### Medium VM (Staging)
- CPU: 2-4 cores
- Memory: 4-8 Gi
- Disk: 50-100 Gi
### Large VM (Production)
- CPU: 4+ cores
- Memory: 8+ Gi
- Disk: 100+ Gi
## Related Documentation
- [Deployment Guide](../proxmox/DEPLOYMENT_GUIDE.md)
- [Site Mapping](../proxmox/SITE_MAPPING.md)
- [Resource Inventory](../proxmox/RESOURCE_INVENTORY.md)

View File

@@ -0,0 +1,297 @@
# Rollback Plan
## Overview
This document outlines procedures for rolling back deployments in the Sankofa Phoenix platform.
## Rollback Strategy
### GitOps Rollback (Recommended)
All applications are managed via ArgoCD GitOps. Rollbacks should be performed through Git by reverting to a previous commit.
### Manual Rollback
For emergency situations, manual rollbacks can be performed directly in Kubernetes.
## Pre-Rollback Checklist
- [ ] Identify the commit/tag to rollback to
- [ ] Verify the previous version is stable
- [ ] Notify team of rollback
- [ ] Document reason for rollback
- [ ] Check database migration compatibility (if applicable)
## Rollback Procedures
### 1. API Service Rollback
#### GitOps Method
```bash
# 1. Identify the commit to rollback to
git log --oneline api/
# 2. Revert to previous commit or tag
cd gitops/apps/api
git checkout <previous-commit-hash>
git push origin main
# 3. ArgoCD will automatically sync
# Or manually sync:
argocd app sync api
```
#### Manual Method
```bash
# 1. List deployment history
kubectl rollout history deployment/api -n api
# 2. View specific revision
kubectl rollout history deployment/api -n api --revision=<revision-number>
# 3. Rollback to previous revision
kubectl rollout undo deployment/api -n api
# 4. Or rollback to specific revision
kubectl rollout undo deployment/api -n api --to-revision=<revision-number>
# 5. Monitor rollback
kubectl rollout status deployment/api -n api
```
### 2. Portal Rollback
#### GitOps Method
```bash
cd gitops/apps/portal
git checkout <previous-commit-hash>
git push origin main
argocd app sync portal
```
#### Manual Method
```bash
kubectl rollout undo deployment/portal -n portal
kubectl rollout status deployment/portal -n portal
```
### 3. Database Migration Rollback
**⚠️ WARNING**: Database rollbacks require careful planning. Not all migrations are reversible.
#### Check Migration Status
```bash
# Connect to database
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL
# Check migration history
SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 10;
```
#### Rollback Migration (if reversible)
```bash
# Run down migration
cd api
npm run db:migrate:down
# Or manually revert SQL
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -f /path/to/rollback.sql
```
#### For Non-Reversible Migrations
1. Create new migration to restore previous state
2. Test in staging first
3. Apply during maintenance window
4. Document data loss risks
### 4. Frontend (Public Site) Rollback
#### GitOps Method
```bash
cd gitops/apps/frontend
git checkout <previous-commit-hash>
git push origin main
argocd app sync frontend
```
#### Manual Method
```bash
kubectl rollout undo deployment/frontend -n frontend
kubectl rollout status deployment/frontend -n frontend
```
### 5. Monitoring Stack Rollback
```bash
# Rollback Prometheus
kubectl rollout undo deployment/prometheus-operator -n monitoring
# Rollback Grafana
kubectl rollout undo deployment/grafana -n monitoring
# Rollback Alertmanager
kubectl rollout undo deployment/alertmanager -n monitoring
```
### 6. Keycloak Rollback
```bash
# Rollback Keycloak
kubectl rollout undo deployment/keycloak -n keycloak
# Verify Keycloak health
curl https://keycloak.sankofa.nexus/health
```
## Post-Rollback Verification
### 1. Health Checks
```bash
# API
curl -f https://api.sankofa.nexus/health
# Portal
curl -f https://portal.sankofa.nexus/api/health
# Keycloak
curl -f https://keycloak.sankofa.nexus/health
```
### 2. Functional Testing
```bash
# Run smoke tests
./scripts/smoke-tests.sh
# Test authentication
curl -X POST https://api.sankofa.nexus/graphql \
-H "Content-Type: application/json" \
-d '{"query": "mutation { login(email: \"test@example.com\", password: \"test\") { token } }"}'
```
### 3. Monitoring
- Check Grafana dashboards for errors
- Verify Prometheus metrics are normal
- Check Loki logs for errors
### 4. Database Verification
```bash
# Verify database connectivity
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT 1"
# Check for data integrity issues
kubectl exec -it -n api deployment/api -- \
psql $DATABASE_URL -c "SELECT COUNT(*) FROM users;"
```
## Rollback Scenarios
### Scenario 1: API Breaking Change
**Symptoms**: API returns errors after deployment
**Rollback Steps**:
1. Immediately rollback API deployment
2. Verify API health
3. Check error logs
4. Investigate root cause
5. Fix and redeploy
### Scenario 2: Database Migration Failure
**Symptoms**: Database errors, application crashes
**Rollback Steps**:
1. Stop application deployments
2. Assess migration state
3. Rollback migration if possible
4. Or restore from backup
5. Redeploy previous application version
### Scenario 3: Portal Build Failure
**Symptoms**: Portal shows blank page or errors
**Rollback Steps**:
1. Rollback portal deployment
2. Verify portal loads
3. Check build logs
4. Fix build issues
5. Redeploy
### Scenario 4: Configuration Error
**Symptoms**: Services cannot connect to dependencies
**Rollback Steps**:
1. Revert configuration changes in Git
2. ArgoCD will sync automatically
3. Or manually update ConfigMaps/Secrets
4. Restart affected services
## Rollback Testing
### Staging Rollback Test
```bash
# 1. Deploy new version to staging
argocd app sync api-staging
# 2. Test new version
./scripts/smoke-tests.sh --env=staging
# 3. Simulate rollback
kubectl rollout undo deployment/api -n api-staging
# 4. Verify rollback works
./scripts/smoke-tests.sh --env=staging
```
## Rollback Communication
### Internal Communication
- Notify team in #engineering channel
- Update incident tracking system
- Document in runbook
### External Communication
- Update status page if user-facing
- Notify affected customers if needed
- Post-mortem for P0/P1 incidents
## Prevention
### Pre-Deployment
- [ ] All tests passing
- [ ] Code review completed
- [ ] Staging deployment successful
- [ ] Smoke tests passing
- [ ] Database migrations tested
- [ ] Rollback plan reviewed
### Deployment
- [ ] Deploy to staging first
- [ ] Monitor staging for 24 hours
- [ ] Gradual production rollout (canary)
- [ ] Monitor metrics closely
- [ ] Have rollback plan ready
## Rollback Decision Matrix
| Issue | Severity | Rollback? |
|-------|----------|-----------|
| Complete outage | P0 | Yes, immediately |
| Data corruption | P0 | Yes, immediately |
| Security breach | P0 | Yes, immediately |
| >50% error rate | P1 | Yes, within 15 min |
| Performance >50% degraded | P1 | Yes, within 30 min |
| Single feature broken | P2 | Maybe, assess impact |
| Minor bugs | P3 | No, fix forward |
## Emergency Contacts
- **On-call Engineer**: [Contact]
- **Team Lead**: [Contact]
- **DevOps Lead**: [Contact]