Apply Composer changes: comprehensive API updates, migrations, middleware, and infrastructure improvements
- Add comprehensive database migrations (001-024) for schema evolution - Enhance API schema with expanded type definitions and resolvers - Add new middleware: audit logging, rate limiting, MFA enforcement, security, tenant auth - Implement new services: AI optimization, billing, blockchain, compliance, marketplace - Add adapter layer for cloud integrations (Cloudflare, Kubernetes, Proxmox, storage) - Update Crossplane provider with enhanced VM management capabilities - Add comprehensive test suite for API endpoints and services - Update frontend components with improved GraphQL subscriptions and real-time updates - Enhance security configurations and headers (CSP, CORS, etc.) - Update documentation and configuration files - Add new CI/CD workflows and validation scripts - Implement design system improvements and UI enhancements
This commit is contained in:
224
docs/runbooks/DATA_RETENTION_POLICY.md
Normal file
224
docs/runbooks/DATA_RETENTION_POLICY.md
Normal file
@@ -0,0 +1,224 @@
|
||||
# Data Retention Policy
|
||||
|
||||
## Overview
|
||||
|
||||
This document defines data retention policies for the Sankofa Phoenix platform to ensure compliance with regulatory requirements and optimize storage costs.
|
||||
|
||||
## Retention Periods
|
||||
|
||||
### Application Data
|
||||
|
||||
#### User Data
|
||||
- **Active Users**: Retained indefinitely while account is active
|
||||
- **Inactive Users**: Retained for 7 years after last login
|
||||
- **Deleted Users**: Soft delete for 90 days, then permanent deletion
|
||||
- **User Activity Logs**: 2 years
|
||||
|
||||
#### Tenant Data
|
||||
- **Active Tenants**: Retained indefinitely while tenant is active
|
||||
- **Suspended Tenants**: Retained for 1 year after suspension
|
||||
- **Deleted Tenants**: Soft delete for 90 days, then permanent deletion
|
||||
|
||||
#### Resource Data
|
||||
- **Active Resources**: Retained indefinitely
|
||||
- **Deleted Resources**: Retained for 90 days for recovery purposes
|
||||
- **Resource History**: 1 year
|
||||
|
||||
### Audit and Compliance Data
|
||||
|
||||
#### Audit Logs
|
||||
- **Security Events**: 7 years (compliance requirement)
|
||||
- **Authentication Logs**: 2 years
|
||||
- **Authorization Logs**: 2 years
|
||||
- **Data Access Logs**: 2 years
|
||||
- **Administrative Actions**: 7 years
|
||||
|
||||
#### Compliance Data
|
||||
- **STIG Compliance Reports**: 7 years
|
||||
- **RMF Documentation**: 7 years
|
||||
- **Incident Reports**: 7 years
|
||||
- **Risk Assessments**: 7 years
|
||||
|
||||
### Operational Data
|
||||
|
||||
#### Application Logs
|
||||
- **Application Logs (Loki)**: 30 days
|
||||
- **Access Logs**: 90 days
|
||||
- **Error Logs**: 90 days
|
||||
- **Performance Logs**: 30 days
|
||||
|
||||
#### Metrics
|
||||
- **Prometheus Metrics**: 30 days (raw)
|
||||
- **Aggregated Metrics**: 1 year
|
||||
- **Custom Metrics**: 90 days
|
||||
|
||||
#### Backups
|
||||
- **Database Backups**: 7 days (daily), 4 weeks (weekly), 12 months (monthly)
|
||||
- **Configuration Backups**: 90 days
|
||||
- **Disaster Recovery Backups**: 7 years
|
||||
|
||||
### Blockchain Data
|
||||
|
||||
#### Transaction History
|
||||
- **All Transactions**: Retained indefinitely (immutable)
|
||||
- **Transaction Logs**: 7 years
|
||||
|
||||
#### Smart Contract Data
|
||||
- **Contract State**: Retained indefinitely
|
||||
- **Contract Events**: 7 years
|
||||
|
||||
## Data Deletion Procedures
|
||||
|
||||
### Automated Deletion
|
||||
|
||||
#### Scheduled Cleanup Jobs
|
||||
```bash
|
||||
# Run daily cleanup job
|
||||
kubectl create cronjob cleanup-old-data \
|
||||
--image=postgres:14-alpine \
|
||||
--schedule="0 3 * * *" \
|
||||
--restart=OnFailure \
|
||||
-- /bin/bash -c "psql $DATABASE_URL -f /scripts/cleanup-old-data.sql"
|
||||
```
|
||||
|
||||
#### Cleanup Scripts
|
||||
- **User Data Cleanup**: Runs monthly, deletes users inactive > 7 years
|
||||
- **Log Cleanup**: Runs daily, deletes logs older than retention period
|
||||
- **Backup Cleanup**: Runs daily, deletes backups older than retention period
|
||||
|
||||
### Manual Deletion
|
||||
|
||||
#### User-Requested Deletion
|
||||
1. User submits deletion request
|
||||
2. Account marked for deletion
|
||||
3. 30-day grace period for account recovery
|
||||
4. Data anonymized after grace period
|
||||
5. Permanent deletion after 90 days
|
||||
|
||||
#### Administrative Deletion
|
||||
1. Admin initiates deletion
|
||||
2. Approval required for sensitive data
|
||||
3. Data exported for compliance (if required)
|
||||
4. Data deleted according to retention policy
|
||||
|
||||
## Compliance Requirements
|
||||
|
||||
### GDPR (General Data Protection Regulation)
|
||||
- **Right to Erasure**: Users can request data deletion
|
||||
- **Data Portability**: Users can export their data
|
||||
- **Retention Limitation**: Data retained only as long as necessary
|
||||
|
||||
### SOX (Sarbanes-Oxley Act)
|
||||
- **Financial Records**: 7 years retention
|
||||
- **Audit Trails**: 7 years retention
|
||||
|
||||
### HIPAA (Health Insurance Portability and Accountability Act)
|
||||
- **PHI Data**: 6 years minimum retention
|
||||
- **Access Logs**: 6 years minimum retention
|
||||
|
||||
### DoD/MilSpec Compliance
|
||||
- **Security Logs**: 7 years retention
|
||||
- **Audit Trails**: 7 years retention
|
||||
- **Compliance Reports**: 7 years retention
|
||||
|
||||
## Implementation
|
||||
|
||||
### Database Retention
|
||||
|
||||
#### Automated Cleanup Queries
|
||||
```sql
|
||||
-- Delete inactive users (7 years)
|
||||
DELETE FROM users
|
||||
WHERE last_login < NOW() - INTERVAL '7 years'
|
||||
AND status = 'INACTIVE';
|
||||
|
||||
-- Delete old audit logs (after 2 years, archive first)
|
||||
INSERT INTO audit_logs_archive
|
||||
SELECT * FROM audit_logs
|
||||
WHERE created_at < NOW() - INTERVAL '2 years';
|
||||
|
||||
DELETE FROM audit_logs
|
||||
WHERE created_at < NOW() - INTERVAL '2 years';
|
||||
```
|
||||
|
||||
### Log Retention
|
||||
|
||||
#### Loki Retention Configuration
|
||||
```yaml
|
||||
# gitops/apps/monitoring/loki-config.yaml
|
||||
retention_period: 30d
|
||||
retention_stream:
|
||||
- selector: '{job="api"}'
|
||||
period: 90d
|
||||
- selector: '{job="portal"}'
|
||||
period: 90d
|
||||
```
|
||||
|
||||
#### Prometheus Retention Configuration
|
||||
```yaml
|
||||
# gitops/apps/monitoring/prometheus-config.yaml
|
||||
retention: 30d
|
||||
retentionSize: 50GB
|
||||
```
|
||||
|
||||
### Backup Retention
|
||||
|
||||
#### Backup Cleanup Script
|
||||
```bash
|
||||
# Delete backups older than retention period
|
||||
find /backups/postgres -name "*.sql.gz" -mtime +7 -delete
|
||||
find /backups/postgres -name "*.sql.gz" -mtime +30 -delete # Weekly backups
|
||||
find /backups/postgres -name "*.sql.gz" -mtime +365 -delete # Monthly backups
|
||||
```
|
||||
|
||||
## Data Archival
|
||||
|
||||
### Long-Term Storage
|
||||
|
||||
#### Archived Data Storage
|
||||
- **Location**: S3 Glacier or equivalent
|
||||
- **Format**: Compressed, encrypted archives
|
||||
- **Retention**: Per compliance requirements
|
||||
- **Access**: On-demand restoration
|
||||
|
||||
#### Archive Process
|
||||
1. Data identified for archival
|
||||
2. Data compressed and encrypted
|
||||
3. Data uploaded to archival storage
|
||||
4. Index updated with archive location
|
||||
5. Original data deleted after verification
|
||||
|
||||
## Monitoring and Compliance
|
||||
|
||||
### Retention Policy Compliance
|
||||
|
||||
#### Automated Checks
|
||||
- Daily verification of retention policies
|
||||
- Alert on data older than retention period
|
||||
- Report on data deletion activities
|
||||
|
||||
#### Compliance Reporting
|
||||
- Monthly retention compliance report
|
||||
- Quarterly audit of data retention
|
||||
- Annual compliance review
|
||||
|
||||
## Exceptions and Extensions
|
||||
|
||||
### Legal Hold
|
||||
- Data subject to legal hold cannot be deleted
|
||||
- Legal hold overrides retention policy
|
||||
- Legal hold must be documented
|
||||
- Data released after legal hold lifted
|
||||
|
||||
### Business Requirements
|
||||
- Extended retention for business-critical data
|
||||
- Approval required for extensions
|
||||
- Extensions documented and reviewed annually
|
||||
|
||||
## Contact
|
||||
|
||||
For questions about data retention:
|
||||
- **Data Protection Officer**: dpo@sankofa.nexus
|
||||
- **Compliance Team**: compliance@sankofa.nexus
|
||||
- **Legal Team**: legal@sankofa.nexus
|
||||
|
||||
239
docs/runbooks/ESCALATION_PROCEDURES.md
Normal file
239
docs/runbooks/ESCALATION_PROCEDURES.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# Escalation Procedures
|
||||
|
||||
## Overview
|
||||
|
||||
This document defines escalation procedures for incidents, support requests, and operational issues in the Sankofa Phoenix platform.
|
||||
|
||||
## Escalation Levels
|
||||
|
||||
### Level 1: On-Call Engineer
|
||||
- **Response Time**: Immediate (P0/P1) or < 1 hour (P2/P3)
|
||||
- **Responsibilities**:
|
||||
- Initial incident triage
|
||||
- Basic troubleshooting
|
||||
- Service restart/recovery
|
||||
- Status updates
|
||||
|
||||
### Level 2: Team Lead / Senior Engineer
|
||||
- **Response Time**: < 15 minutes (P0/P1) or < 2 hours (P2/P3)
|
||||
- **Responsibilities**:
|
||||
- Complex troubleshooting
|
||||
- Architecture decisions
|
||||
- Code review for hotfixes
|
||||
- Customer communication
|
||||
|
||||
### Level 3: Engineering Manager
|
||||
- **Response Time**: < 30 minutes (P0) or < 4 hours (P1)
|
||||
- **Responsibilities**:
|
||||
- Resource allocation
|
||||
- Cross-team coordination
|
||||
- Business impact assessment
|
||||
- Executive communication
|
||||
|
||||
### Level 4: CTO / VP Engineering
|
||||
- **Response Time**: < 1 hour (P0 only)
|
||||
- **Responsibilities**:
|
||||
- Strategic decisions
|
||||
- Customer escalation
|
||||
- Public communication
|
||||
- Resource approval
|
||||
|
||||
## Escalation Triggers
|
||||
|
||||
### Automatic Escalation
|
||||
- P0 incident not resolved in 30 minutes
|
||||
- P1 incident not resolved in 2 hours
|
||||
- Multiple services affected simultaneously
|
||||
- Data loss or security breach detected
|
||||
|
||||
### Manual Escalation
|
||||
- On-call engineer requests assistance
|
||||
- Customer escalates to management
|
||||
- Issue requires expertise not available at current level
|
||||
- Business impact exceeds threshold
|
||||
|
||||
## Escalation Matrix
|
||||
|
||||
| Severity | Level 1 | Level 2 | Level 3 | Level 4 |
|
||||
|----------|---------|---------|---------|---------|
|
||||
| P0 | Immediate | 15 min | 30 min | 1 hour |
|
||||
| P1 | 15 min | 30 min | 2 hours | 4 hours |
|
||||
| P2 | 1 hour | 2 hours | 24 hours | N/A |
|
||||
| P3 | 4 hours | 24 hours | 1 week | N/A |
|
||||
|
||||
## Escalation Process
|
||||
|
||||
### Step 1: Initial Assessment
|
||||
1. On-call engineer receives alert/notification
|
||||
2. Assess severity and impact
|
||||
3. Begin investigation
|
||||
4. Document findings
|
||||
|
||||
### Step 2: Escalation Decision
|
||||
**Escalate if**:
|
||||
- Issue not resolved within SLA
|
||||
- Additional expertise needed
|
||||
- Customer impact is severe
|
||||
- Business impact is high
|
||||
- Security concern
|
||||
|
||||
**Do NOT escalate if**:
|
||||
- Issue is being actively worked on
|
||||
- Resolution is in progress
|
||||
- Impact is minimal
|
||||
- Standard procedure can resolve
|
||||
|
||||
### Step 3: Escalation Execution
|
||||
1. **Notify next level**:
|
||||
- Create escalation ticket
|
||||
- Update incident channel
|
||||
- Call/Slack next level contact
|
||||
- Provide context and current status
|
||||
|
||||
2. **Handoff information**:
|
||||
- Incident summary
|
||||
- Current status
|
||||
- Actions taken
|
||||
- Relevant logs/metrics
|
||||
- Customer impact
|
||||
|
||||
3. **Update tracking**:
|
||||
- Update incident system
|
||||
- Update status page
|
||||
- Document escalation reason
|
||||
|
||||
### Step 4: Escalation Resolution
|
||||
1. Escalated engineer takes ownership
|
||||
2. On-call engineer provides support
|
||||
3. Regular status updates
|
||||
4. Resolution and post-mortem
|
||||
|
||||
## Communication Channels
|
||||
|
||||
### Internal Communication
|
||||
- **Slack/Teams**: `#incident-YYYY-MM-DD-<name>`
|
||||
- **PagerDuty/Opsgenie**: Automatic escalation
|
||||
- **Email**: For non-urgent escalations
|
||||
- **Phone**: For P0 incidents
|
||||
|
||||
### External Communication
|
||||
- **Status Page**: Public updates
|
||||
- **Customer Notifications**: For affected customers
|
||||
- **Support Tickets**: Update existing tickets
|
||||
|
||||
## Contact Information
|
||||
|
||||
### On-Call Rotation
|
||||
- **Primary**: [Contact Info]
|
||||
- **Secondary**: [Contact Info]
|
||||
- **Schedule**: [Link to schedule]
|
||||
|
||||
### Escalation Contacts
|
||||
- **Team Lead**: [Contact Info]
|
||||
- **Engineering Manager**: [Contact Info]
|
||||
- **CTO**: [Contact Info]
|
||||
- **VP Engineering**: [Contact Info]
|
||||
|
||||
### Support Contacts
|
||||
- **Support Team Lead**: [Contact Info]
|
||||
- **Customer Success**: [Contact Info]
|
||||
|
||||
## Escalation Scenarios
|
||||
|
||||
### Scenario 1: P0 Service Outage
|
||||
1. **Detection**: Monitoring alert
|
||||
2. **Level 1**: On-call engineer investigates (5 min)
|
||||
3. **Escalation**: If not resolved in 15 min → Level 2
|
||||
4. **Level 2**: Team lead coordinates (15 min)
|
||||
5. **Escalation**: If not resolved in 30 min → Level 3
|
||||
6. **Level 3**: Engineering manager allocates resources
|
||||
7. **Resolution**: Service restored
|
||||
8. **Post-Mortem**: Within 24 hours
|
||||
|
||||
### Scenario 2: Security Breach
|
||||
1. **Detection**: Security alert or anomaly
|
||||
2. **Immediate**: Escalate to Level 3 (bypass Level 1/2)
|
||||
3. **Level 3**: Engineering manager + Security team
|
||||
4. **Escalation**: If data breach → Level 4
|
||||
5. **Level 4**: CTO + Legal + PR
|
||||
6. **Resolution**: Contain, investigate, remediate
|
||||
7. **Post-Mortem**: Within 48 hours
|
||||
|
||||
### Scenario 3: Data Loss
|
||||
1. **Detection**: Backup failure or data corruption
|
||||
2. **Immediate**: Escalate to Level 2
|
||||
3. **Level 2**: Team lead + Database team
|
||||
4. **Escalation**: If cannot recover → Level 3
|
||||
5. **Level 3**: Engineering manager + Customer Success
|
||||
6. **Resolution**: Restore from backup or data recovery
|
||||
7. **Post-Mortem**: Within 24 hours
|
||||
|
||||
### Scenario 4: Performance Degradation
|
||||
1. **Detection**: Performance metrics exceed thresholds
|
||||
2. **Level 1**: On-call engineer investigates (1 hour)
|
||||
3. **Escalation**: If not resolved → Level 2
|
||||
4. **Level 2**: Team lead + Performance team
|
||||
5. **Resolution**: Optimize or scale resources
|
||||
6. **Post-Mortem**: If P1/P0, within 48 hours
|
||||
|
||||
## Customer Escalation
|
||||
|
||||
### Customer Escalation Process
|
||||
1. **Support receives** customer escalation
|
||||
2. **Assess severity**:
|
||||
- Technical issue → Engineering
|
||||
- Billing issue → Finance
|
||||
- Account issue → Customer Success
|
||||
3. **Notify appropriate team**
|
||||
4. **Provide customer updates** every 2 hours (P0/P1)
|
||||
5. **Resolve and follow up**
|
||||
|
||||
### Customer Escalation Contacts
|
||||
- **Support Escalation**: support-escalation@sankofa.nexus
|
||||
- **Technical Escalation**: tech-escalation@sankofa.nexus
|
||||
- **Executive Escalation**: executive-escalation@sankofa.nexus
|
||||
|
||||
## Escalation Metrics
|
||||
|
||||
### Tracking
|
||||
- **Escalation Rate**: % of incidents escalated
|
||||
- **Escalation Time**: Time to escalate
|
||||
- **Resolution Time**: Time to resolve after escalation
|
||||
- **Customer Satisfaction**: Post-incident surveys
|
||||
|
||||
### Goals
|
||||
- **P0 Escalation**: < 5% of P0 incidents
|
||||
- **P1 Escalation**: < 10% of P1 incidents
|
||||
- **Escalation Time**: < SLA threshold
|
||||
- **Resolution Time**: < 2x normal resolution time
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
- ✅ Escalate early if unsure
|
||||
- ✅ Provide complete context
|
||||
- ✅ Document all actions
|
||||
- ✅ Communicate frequently
|
||||
- ✅ Learn from escalations
|
||||
|
||||
### Don'ts
|
||||
- ❌ Escalate without trying
|
||||
- ❌ Escalate without context
|
||||
- ❌ Skip levels unnecessarily
|
||||
- ❌ Ignore customer escalations
|
||||
- ❌ Forget to update status
|
||||
|
||||
## Review and Improvement
|
||||
|
||||
### Monthly Review
|
||||
- Review escalation patterns
|
||||
- Identify common causes
|
||||
- Update procedures
|
||||
- Train team on improvements
|
||||
|
||||
### Quarterly Review
|
||||
- Analyze escalation metrics
|
||||
- Update contact information
|
||||
- Review and update SLAs
|
||||
- Improve documentation
|
||||
|
||||
319
docs/runbooks/INCIDENT_RESPONSE.md
Normal file
319
docs/runbooks/INCIDENT_RESPONSE.md
Normal file
@@ -0,0 +1,319 @@
|
||||
# Incident Response Runbook
|
||||
|
||||
## Overview
|
||||
|
||||
This runbook provides step-by-step procedures for responding to incidents in the Sankofa Phoenix platform.
|
||||
|
||||
## Incident Severity Levels
|
||||
|
||||
### P0 - Critical (Immediate Response)
|
||||
- Complete service outage
|
||||
- Data loss or corruption
|
||||
- Security breach
|
||||
- **Response Time**: Immediate (< 5 minutes)
|
||||
- **Resolution Target**: < 1 hour
|
||||
|
||||
### P1 - High (Urgent Response)
|
||||
- Partial service outage affecting multiple users
|
||||
- Performance degradation > 50%
|
||||
- Authentication failures
|
||||
- **Response Time**: < 15 minutes
|
||||
- **Resolution Target**: < 4 hours
|
||||
|
||||
### P2 - Medium (Standard Response)
|
||||
- Single feature/service degraded
|
||||
- Performance degradation 20-50%
|
||||
- Non-critical errors
|
||||
- **Response Time**: < 1 hour
|
||||
- **Resolution Target**: < 24 hours
|
||||
|
||||
### P3 - Low (Normal Response)
|
||||
- Minor issues
|
||||
- Cosmetic problems
|
||||
- Non-blocking errors
|
||||
- **Response Time**: < 4 hours
|
||||
- **Resolution Target**: < 1 week
|
||||
|
||||
## Incident Response Process
|
||||
|
||||
### 1. Detection and Triage
|
||||
|
||||
#### Detection Sources
|
||||
- **Monitoring Alerts**: Prometheus/Alertmanager
|
||||
- **Error Logs**: Loki, application logs
|
||||
- **User Reports**: Support tickets, status page
|
||||
- **Health Checks**: Automated health check failures
|
||||
|
||||
#### Initial Triage Steps
|
||||
```bash
|
||||
# 1. Check service health
|
||||
kubectl get pods --all-namespaces | grep -v Running
|
||||
|
||||
# 2. Check API health
|
||||
curl -f https://api.sankofa.nexus/health || echo "API DOWN"
|
||||
|
||||
# 3. Check portal health
|
||||
curl -f https://portal.sankofa.nexus/api/health || echo "PORTAL DOWN"
|
||||
|
||||
# 4. Check database connectivity
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT 1" || echo "DB CONNECTION FAILED"
|
||||
|
||||
# 5. Check Keycloak
|
||||
curl -f https://keycloak.sankofa.nexus/health || echo "KEYCLOAK DOWN"
|
||||
```
|
||||
|
||||
### 2. Incident Declaration
|
||||
|
||||
#### Create Incident Channel
|
||||
- Create dedicated Slack/Teams channel: `#incident-YYYY-MM-DD-<name>`
|
||||
- Invite: On-call engineer, Team lead, Product owner
|
||||
- Post initial status
|
||||
|
||||
#### Incident Template
|
||||
```
|
||||
INCIDENT: [Brief Description]
|
||||
SEVERITY: P0/P1/P2/P3
|
||||
STATUS: Investigating/Identified/Monitoring/Resolved
|
||||
START TIME: [Timestamp]
|
||||
AFFECTED SERVICES: [List]
|
||||
IMPACT: [User impact description]
|
||||
```
|
||||
|
||||
### 3. Investigation
|
||||
|
||||
#### Common Investigation Commands
|
||||
|
||||
**Check Pod Status**
|
||||
```bash
|
||||
kubectl get pods --all-namespaces -o wide
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
kubectl logs <pod-name> -n <namespace> --tail=100
|
||||
```
|
||||
|
||||
**Check Resource Usage**
|
||||
```bash
|
||||
kubectl top nodes
|
||||
kubectl top pods --all-namespaces
|
||||
```
|
||||
|
||||
**Check Database**
|
||||
```bash
|
||||
# Connection count
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
|
||||
# Long-running queries
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes';"
|
||||
```
|
||||
|
||||
**Check Logs**
|
||||
```bash
|
||||
# Recent errors
|
||||
kubectl logs -n api deployment/api --tail=500 | grep -i error
|
||||
|
||||
# Authentication failures
|
||||
kubectl logs -n api deployment/api | grep -i "auth.*fail"
|
||||
|
||||
# Rate limiting
|
||||
kubectl logs -n api deployment/api | grep -i "rate limit"
|
||||
```
|
||||
|
||||
**Check Monitoring**
|
||||
```bash
|
||||
# Access Grafana
|
||||
open https://grafana.sankofa.nexus
|
||||
|
||||
# Check Prometheus alerts
|
||||
kubectl get prometheusrules -n monitoring
|
||||
```
|
||||
|
||||
### 4. Resolution
|
||||
|
||||
#### Common Resolution Actions
|
||||
|
||||
**Restart Service**
|
||||
```bash
|
||||
kubectl rollout restart deployment/api -n api
|
||||
kubectl rollout restart deployment/portal -n portal
|
||||
```
|
||||
|
||||
**Scale Up**
|
||||
```bash
|
||||
kubectl scale deployment/api --replicas=5 -n api
|
||||
```
|
||||
|
||||
**Rollback Deployment**
|
||||
```bash
|
||||
# See ROLLBACK_PLAN.md for detailed procedures
|
||||
kubectl rollout undo deployment/api -n api
|
||||
```
|
||||
|
||||
**Clear Rate Limits** (if needed)
|
||||
```bash
|
||||
# Access Redis/rate limit store and clear keys
|
||||
# Or restart rate limit service
|
||||
kubectl rollout restart deployment/rate-limit -n api
|
||||
```
|
||||
|
||||
**Database Maintenance**
|
||||
```bash
|
||||
# Vacuum database
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "VACUUM ANALYZE;"
|
||||
|
||||
# Kill long-running queries
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '10 minutes';"
|
||||
```
|
||||
|
||||
### 5. Post-Incident
|
||||
|
||||
#### Incident Report Template
|
||||
```markdown
|
||||
# Incident Report: [Date] - [Title]
|
||||
|
||||
## Summary
|
||||
[Brief description of incident]
|
||||
|
||||
## Timeline
|
||||
- [Time] - Incident detected
|
||||
- [Time] - Investigation started
|
||||
- [Time] - Root cause identified
|
||||
- [Time] - Resolution implemented
|
||||
- [Time] - Service restored
|
||||
|
||||
## Root Cause
|
||||
[Detailed root cause analysis]
|
||||
|
||||
## Impact
|
||||
- **Users Affected**: [Number]
|
||||
- **Duration**: [Time]
|
||||
- **Services Affected**: [List]
|
||||
|
||||
## Resolution
|
||||
[Steps taken to resolve]
|
||||
|
||||
## Prevention
|
||||
- [ ] Action item 1
|
||||
- [ ] Action item 2
|
||||
- [ ] Action item 3
|
||||
|
||||
## Follow-up
|
||||
- [ ] Update monitoring/alerts
|
||||
- [ ] Update runbooks
|
||||
- [ ] Code changes needed
|
||||
- [ ] Documentation updates
|
||||
```
|
||||
|
||||
## Common Incidents
|
||||
|
||||
### API High Latency
|
||||
|
||||
**Symptoms**: API response times > 500ms
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
# Check database query performance
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;"
|
||||
|
||||
# Check API metrics
|
||||
curl https://api.sankofa.nexus/metrics | grep http_request_duration
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Scale API replicas
|
||||
- Optimize slow queries
|
||||
- Add database indexes
|
||||
- Check for N+1 query problems
|
||||
|
||||
### Database Connection Pool Exhausted
|
||||
|
||||
**Symptoms**: "too many connections" errors
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Increase connection pool size
|
||||
- Kill idle connections
|
||||
- Scale database
|
||||
- Check for connection leaks
|
||||
|
||||
### Authentication Failures
|
||||
|
||||
**Symptoms**: Users cannot log in
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
# Check Keycloak
|
||||
curl https://keycloak.sankofa.nexus/health
|
||||
kubectl logs -n keycloak deployment/keycloak --tail=100
|
||||
|
||||
# Check API auth logs
|
||||
kubectl logs -n api deployment/api | grep -i "auth.*fail"
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Restart Keycloak if needed
|
||||
- Check OIDC configuration
|
||||
- Verify JWT secret
|
||||
- Check network connectivity
|
||||
|
||||
### Portal Not Loading
|
||||
|
||||
**Symptoms**: Portal returns 500 or blank page
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
# Check portal pods
|
||||
kubectl get pods -n portal
|
||||
kubectl logs -n portal deployment/portal --tail=100
|
||||
|
||||
# Check portal health
|
||||
curl https://portal.sankofa.nexus/api/health
|
||||
```
|
||||
|
||||
**Resolution**:
|
||||
- Restart portal deployment
|
||||
- Check environment variables
|
||||
- Verify Keycloak connectivity
|
||||
- Check build errors
|
||||
|
||||
## Escalation
|
||||
|
||||
### When to Escalate
|
||||
- P0 incident not resolved in 30 minutes
|
||||
- P1 incident not resolved in 2 hours
|
||||
- Need additional expertise
|
||||
- Customer impact is severe
|
||||
|
||||
### Escalation Path
|
||||
1. **On-call Engineer** → Team Lead
|
||||
2. **Team Lead** → Engineering Manager
|
||||
3. **Engineering Manager** → CTO/VP Engineering
|
||||
4. **CTO** → Executive Team
|
||||
|
||||
### Emergency Contacts
|
||||
- **On-call**: [Phone/Slack]
|
||||
- **Team Lead**: [Phone/Slack]
|
||||
- **Engineering Manager**: [Phone/Slack]
|
||||
- **CTO**: [Phone/Slack]
|
||||
|
||||
## Communication
|
||||
|
||||
### Status Page Updates
|
||||
- Update status page during incident
|
||||
- Post updates every 30 minutes (P0/P1) or hourly (P2/P3)
|
||||
- Include: Status, affected services, estimated resolution time
|
||||
|
||||
### Customer Communication
|
||||
- For P0/P1: Notify affected customers immediately
|
||||
- For P2/P3: Include in next status update
|
||||
- Be transparent about impact and resolution timeline
|
||||
|
||||
244
docs/runbooks/PROXMOX_DISASTER_RECOVERY.md
Normal file
244
docs/runbooks/PROXMOX_DISASTER_RECOVERY.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Proxmox Disaster Recovery Procedures
|
||||
|
||||
## Overview
|
||||
|
||||
This document outlines disaster recovery procedures for Proxmox infrastructure managed by the Crossplane provider.
|
||||
|
||||
## Recovery Scenarios
|
||||
|
||||
### Scenario 1: Provider Pod Failure
|
||||
|
||||
#### Symptoms
|
||||
- Provider pod not running
|
||||
- VM operations failing
|
||||
- ProviderConfig not working
|
||||
|
||||
#### Recovery Steps
|
||||
|
||||
1. **Check Pod Status**:
|
||||
```bash
|
||||
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
|
||||
```
|
||||
|
||||
2. **Restart Provider**:
|
||||
```bash
|
||||
kubectl delete pod -n crossplane-system -l app=crossplane-provider-proxmox
|
||||
```
|
||||
|
||||
3. **Verify Recovery**:
|
||||
```bash
|
||||
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
|
||||
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50
|
||||
```
|
||||
|
||||
### Scenario 2: Proxmox Node Failure
|
||||
|
||||
#### Symptoms
|
||||
- Cannot connect to Proxmox
|
||||
- VMs unreachable
|
||||
- Provider connection errors
|
||||
|
||||
#### Recovery Steps
|
||||
|
||||
1. **Verify Node Status**:
|
||||
- Check Proxmox Web UI
|
||||
- Verify node is online
|
||||
- Check network connectivity
|
||||
|
||||
2. **Check ProviderConfig**:
|
||||
```bash
|
||||
kubectl get providerconfig proxmox-provider-config -o yaml
|
||||
```
|
||||
|
||||
3. **Update Endpoint if Needed**:
|
||||
- If node IP changed, update ProviderConfig
|
||||
- If using hostname, verify DNS
|
||||
|
||||
4. **Test Connectivity**:
|
||||
```bash
|
||||
curl -k https://your-proxmox:8006/api2/json/version
|
||||
```
|
||||
|
||||
### Scenario 3: Credential Compromise
|
||||
|
||||
#### Symptoms
|
||||
- Authentication failures
|
||||
- Security alerts
|
||||
- Unauthorized access
|
||||
|
||||
#### Recovery Steps
|
||||
|
||||
1. **Revoke Compromised Credentials**:
|
||||
- Log into Proxmox Web UI
|
||||
- Revoke API tokens
|
||||
- Change passwords
|
||||
|
||||
2. **Create New Credentials**:
|
||||
- Create new API tokens
|
||||
- Use strong passwords
|
||||
- Set appropriate permissions
|
||||
|
||||
3. **Update Kubernetes Secret**:
|
||||
```bash
|
||||
kubectl delete secret proxmox-credentials -n crossplane-system
|
||||
kubectl create secret generic proxmox-credentials \
|
||||
--from-literal=credentials.json='{"username":"root@pam","token":"new-token"}' \
|
||||
-n crossplane-system
|
||||
```
|
||||
|
||||
4. **Restart Provider**:
|
||||
```bash
|
||||
kubectl delete pod -n crossplane-system -l app=crossplane-provider-proxmox
|
||||
```
|
||||
|
||||
### Scenario 4: VM Data Loss
|
||||
|
||||
#### Symptoms
|
||||
- VM not found
|
||||
- Data missing
|
||||
- Storage errors
|
||||
|
||||
#### Recovery Steps
|
||||
|
||||
1. **Check VM Status**:
|
||||
```bash
|
||||
kubectl get proxmoxvm <vm-name>
|
||||
kubectl describe proxmoxvm <vm-name>
|
||||
```
|
||||
|
||||
2. **Check Proxmox Backups**:
|
||||
- Log into Proxmox Web UI
|
||||
- Check backup storage
|
||||
- Review backup schedule
|
||||
|
||||
3. **Restore from Backup**:
|
||||
- Use Proxmox backup restore
|
||||
- Or recreate VM from template
|
||||
|
||||
4. **Recreate VM Resource**:
|
||||
```bash
|
||||
# Delete existing resource
|
||||
kubectl delete proxmoxvm <vm-name>
|
||||
|
||||
# Recreate with same configuration
|
||||
kubectl apply -f <vm-manifest>.yaml
|
||||
```
|
||||
|
||||
### Scenario 5: Complete Provider Failure
|
||||
|
||||
#### Symptoms
|
||||
- Provider not responding
|
||||
- All VM operations failing
|
||||
- ProviderConfig errors
|
||||
|
||||
#### Recovery Steps
|
||||
|
||||
1. **Check Provider Deployment**:
|
||||
```bash
|
||||
kubectl get deployment -n crossplane-system crossplane-provider-proxmox
|
||||
kubectl describe deployment -n crossplane-system crossplane-provider-proxmox
|
||||
```
|
||||
|
||||
2. **Redeploy Provider**:
|
||||
```bash
|
||||
kubectl delete deployment -n crossplane-system crossplane-provider-proxmox
|
||||
kubectl apply -f crossplane-provider-proxmox/config/provider.yaml
|
||||
```
|
||||
|
||||
3. **Verify ProviderConfig**:
|
||||
```bash
|
||||
kubectl get providerconfig
|
||||
kubectl describe providerconfig proxmox-provider-config
|
||||
```
|
||||
|
||||
4. **Test VM Operations**:
|
||||
```bash
|
||||
kubectl get proxmoxvm
|
||||
kubectl describe proxmoxvm <test-vm>
|
||||
```
|
||||
|
||||
## Backup Procedures
|
||||
|
||||
### Provider Configuration Backup
|
||||
|
||||
```bash
|
||||
# Backup ProviderConfig
|
||||
kubectl get providerconfig proxmox-provider-config -o yaml > providerconfig-backup.yaml
|
||||
|
||||
# Backup credentials secret (be careful with this!)
|
||||
kubectl get secret proxmox-credentials -n crossplane-system -o yaml > credentials-backup.yaml
|
||||
```
|
||||
|
||||
### VM Configuration Backup
|
||||
|
||||
```bash
|
||||
# Backup all VM resources
|
||||
kubectl get proxmoxvm -o yaml > all-vms-backup.yaml
|
||||
|
||||
# Backup specific VM
|
||||
kubectl get proxmoxvm <vm-name> -o yaml > <vm-name>-backup.yaml
|
||||
```
|
||||
|
||||
### Proxmox Backup
|
||||
|
||||
1. **Configure Backup Schedule**:
|
||||
- Log into Proxmox Web UI
|
||||
- Go to Datacenter → Backup
|
||||
- Configure backup schedule
|
||||
|
||||
2. **Manual Backup**:
|
||||
- Select VM in Proxmox Web UI
|
||||
- Click Backup
|
||||
- Choose backup storage
|
||||
- Start backup
|
||||
|
||||
## Recovery Testing
|
||||
|
||||
### Test Provider Recovery
|
||||
|
||||
1. **Simulate Failure**:
|
||||
```bash
|
||||
kubectl delete pod -n crossplane-system -l app=crossplane-provider-proxmox
|
||||
```
|
||||
|
||||
2. **Verify Auto-Recovery**:
|
||||
```bash
|
||||
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
|
||||
```
|
||||
|
||||
3. **Test VM Operations**:
|
||||
```bash
|
||||
kubectl get proxmoxvm
|
||||
```
|
||||
|
||||
### Test VM Recovery
|
||||
|
||||
1. **Create Test VM**:
|
||||
```bash
|
||||
kubectl apply -f test-vm.yaml
|
||||
```
|
||||
|
||||
2. **Delete VM**:
|
||||
```bash
|
||||
kubectl delete proxmoxvm test-vm
|
||||
```
|
||||
|
||||
3. **Recreate VM**:
|
||||
```bash
|
||||
kubectl apply -f test-vm.yaml
|
||||
```
|
||||
|
||||
## Prevention
|
||||
|
||||
1. **Regular Backups**: Schedule regular backups
|
||||
2. **Monitoring**: Set up alerts for failures
|
||||
3. **Documentation**: Keep procedures documented
|
||||
4. **Testing**: Regularly test recovery procedures
|
||||
5. **Redundancy**: Use multiple Proxmox nodes
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [VM Provisioning Runbook](./PROXMOX_VM_PROVISIONING.md)
|
||||
- [Troubleshooting Guide](./PROXMOX_TROUBLESHOOTING.md)
|
||||
- [Deployment Guide](../proxmox/DEPLOYMENT_GUIDE.md)
|
||||
|
||||
272
docs/runbooks/PROXMOX_TROUBLESHOOTING.md
Normal file
272
docs/runbooks/PROXMOX_TROUBLESHOOTING.md
Normal file
@@ -0,0 +1,272 @@
|
||||
# Proxmox Troubleshooting Guide
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Provider Not Connecting
|
||||
|
||||
#### Symptoms
|
||||
- Provider logs show connection errors
|
||||
- ProviderConfig status is not Ready
|
||||
- VM creation fails with connection errors
|
||||
|
||||
#### Solutions
|
||||
|
||||
1. **Verify Endpoint**:
|
||||
```bash
|
||||
curl -k https://your-proxmox:8006/api2/json/version
|
||||
```
|
||||
|
||||
2. **Check Credentials**:
|
||||
```bash
|
||||
kubectl get secret proxmox-credentials -n crossplane-system -o yaml
|
||||
```
|
||||
|
||||
3. **Test Authentication**:
|
||||
```bash
|
||||
curl -k -X POST \
|
||||
-d "username=root@pam&password=your-password" \
|
||||
https://your-proxmox:8006/api2/json/access/ticket
|
||||
```
|
||||
|
||||
4. **Check Provider Logs**:
|
||||
```bash
|
||||
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=100
|
||||
```
|
||||
|
||||
### VM Creation Fails
|
||||
|
||||
#### Symptoms
|
||||
- VM resource stuck in Creating state
|
||||
- Error messages in VM resource status
|
||||
- No VM appears in Proxmox
|
||||
|
||||
#### Solutions
|
||||
|
||||
1. **Check VM Resource**:
|
||||
```bash
|
||||
kubectl describe proxmoxvm <vm-name>
|
||||
```
|
||||
|
||||
2. **Verify Site Configuration**:
|
||||
- Site must exist in ProviderConfig
|
||||
- Endpoint must be reachable
|
||||
- Node name must match actual Proxmox node
|
||||
|
||||
3. **Check Proxmox Resources**:
|
||||
- Storage pool must exist
|
||||
- Network bridge must exist
|
||||
- OS template must exist
|
||||
|
||||
4. **Check Proxmox Logs**:
|
||||
- Log into Proxmox Web UI
|
||||
- Check System Log
|
||||
- Review task history
|
||||
|
||||
### VM Status Not Updating
|
||||
|
||||
#### Symptoms
|
||||
- VM status remains unknown
|
||||
- IP address not populated
|
||||
- State not reflecting actual VM state
|
||||
|
||||
#### Solutions
|
||||
|
||||
1. **Check Provider Connectivity**:
|
||||
```bash
|
||||
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox | grep -i error
|
||||
```
|
||||
|
||||
2. **Verify VM Exists in Proxmox**:
|
||||
- Check Proxmox Web UI
|
||||
- Verify VM ID matches
|
||||
|
||||
3. **Check Reconciliation**:
|
||||
```bash
|
||||
kubectl get proxmoxvm <vm-name> -o yaml | grep -A 5 conditions
|
||||
```
|
||||
|
||||
### Storage Issues
|
||||
|
||||
#### Symptoms
|
||||
- VM creation fails with storage errors
|
||||
- "Storage not found" errors
|
||||
- Insufficient storage errors
|
||||
|
||||
#### Solutions
|
||||
|
||||
1. **List Available Storage**:
|
||||
```bash
|
||||
# Via Proxmox API
|
||||
curl -k -H "Authorization: PVEAuthCookie=TOKEN" \
|
||||
https://your-proxmox:8006/api2/json/storage
|
||||
```
|
||||
|
||||
2. **Check Storage Capacity**:
|
||||
- Log into Proxmox Web UI
|
||||
- Check Storage section
|
||||
- Verify available space
|
||||
|
||||
3. **Update Storage Name**:
|
||||
- Verify actual storage pool name
|
||||
- Update VM manifest if needed
|
||||
|
||||
### Network Issues
|
||||
|
||||
#### Symptoms
|
||||
- VM created but no network connectivity
|
||||
- IP address not assigned
|
||||
- Network bridge errors
|
||||
|
||||
#### Solutions
|
||||
|
||||
1. **Verify Network Bridge**:
|
||||
```bash
|
||||
# Via Proxmox API
|
||||
curl -k -H "Authorization: PVEAuthCookie=TOKEN" \
|
||||
https://your-proxmox:8006/api2/json/nodes/ML110-01/network
|
||||
```
|
||||
|
||||
2. **Check Network Configuration**:
|
||||
- Verify bridge name in VM manifest
|
||||
- Check bridge exists on node
|
||||
- Verify bridge is active
|
||||
|
||||
3. **Check DHCP**:
|
||||
- Verify DHCP server is running
|
||||
- Check network configuration
|
||||
- Review VM network settings
|
||||
|
||||
### Authentication Failures
|
||||
|
||||
#### Symptoms
|
||||
- 401 Unauthorized errors
|
||||
- Authentication failed messages
|
||||
- Token/ticket errors
|
||||
|
||||
#### Solutions
|
||||
|
||||
1. **Verify Credentials**:
|
||||
- Check username format: `user@realm`
|
||||
- Verify password is correct
|
||||
- Check token format if using tokens
|
||||
|
||||
2. **Test Authentication**:
|
||||
```bash
|
||||
# Password auth
|
||||
curl -k -X POST \
|
||||
-d "username=root@pam&password=your-password" \
|
||||
https://your-proxmox:8006/api2/json/access/ticket
|
||||
|
||||
# Token auth
|
||||
curl -k -H "Authorization: PVEAuthCookie=TOKEN" \
|
||||
https://your-proxmox:8006/api2/json/version
|
||||
```
|
||||
|
||||
3. **Check Permissions**:
|
||||
- Verify user has VM creation permissions
|
||||
- Check token permissions
|
||||
- Review Proxmox user roles
|
||||
|
||||
### Provider Pod Issues
|
||||
|
||||
#### Symptoms
|
||||
- Provider pod not starting
|
||||
- Provider pod crashing
|
||||
- Provider pod in Error state
|
||||
|
||||
#### Solutions
|
||||
|
||||
1. **Check Pod Status**:
|
||||
```bash
|
||||
kubectl get pods -n crossplane-system -l app=crossplane-provider-proxmox
|
||||
kubectl describe pod -n crossplane-system -l app=crossplane-provider-proxmox
|
||||
```
|
||||
|
||||
2. **Check Pod Logs**:
|
||||
```bash
|
||||
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=100
|
||||
```
|
||||
|
||||
3. **Check Image**:
|
||||
```bash
|
||||
kubectl get deployment -n crossplane-system crossplane-provider-proxmox -o yaml | grep image
|
||||
```
|
||||
|
||||
4. **Verify Resources**:
|
||||
```bash
|
||||
kubectl get deployment -n crossplane-system crossplane-provider-proxmox -o yaml | grep -A 5 resources
|
||||
```
|
||||
|
||||
## Diagnostic Commands
|
||||
|
||||
### Check Provider Health
|
||||
```bash
|
||||
# Provider status
|
||||
kubectl get deployment -n crossplane-system crossplane-provider-proxmox
|
||||
|
||||
# Provider logs
|
||||
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50
|
||||
|
||||
# Provider metrics
|
||||
kubectl port-forward -n crossplane-system deployment/crossplane-provider-proxmox 8080:8080
|
||||
curl http://localhost:8080/metrics
|
||||
```
|
||||
|
||||
### Check VM Resources
|
||||
```bash
|
||||
# List all VMs
|
||||
kubectl get proxmoxvm
|
||||
|
||||
# Get VM details
|
||||
kubectl get proxmoxvm <vm-name> -o yaml
|
||||
|
||||
# Check VM events
|
||||
kubectl describe proxmoxvm <vm-name>
|
||||
```
|
||||
|
||||
### Check ProviderConfig
|
||||
```bash
|
||||
# List ProviderConfigs
|
||||
kubectl get providerconfig
|
||||
|
||||
# Get ProviderConfig details
|
||||
kubectl get providerconfig proxmox-provider-config -o yaml
|
||||
|
||||
# Check ProviderConfig status
|
||||
kubectl describe providerconfig proxmox-provider-config
|
||||
```
|
||||
|
||||
## Escalation Procedures
|
||||
|
||||
### Level 1: Basic Troubleshooting
|
||||
1. Check provider logs
|
||||
2. Verify credentials
|
||||
3. Test connectivity
|
||||
4. Review VM resource status
|
||||
|
||||
### Level 2: Advanced Troubleshooting
|
||||
1. Check Proxmox Web UI
|
||||
2. Review Proxmox logs
|
||||
3. Verify network connectivity
|
||||
4. Check resource availability
|
||||
|
||||
### Level 3: Infrastructure Issues
|
||||
1. Contact Proxmox administrator
|
||||
2. Check infrastructure status
|
||||
3. Review network configuration
|
||||
4. Verify DNS resolution
|
||||
|
||||
## Prevention
|
||||
|
||||
1. **Regular Monitoring**: Set up alerts for provider health
|
||||
2. **Resource Verification**: Verify resources before deployment
|
||||
3. **Credential Rotation**: Rotate credentials regularly
|
||||
4. **Backup Configuration**: Backup ProviderConfig and secrets
|
||||
5. **Documentation**: Keep documentation up to date
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [VM Provisioning Runbook](./PROXMOX_VM_PROVISIONING.md)
|
||||
- [Deployment Guide](../proxmox/DEPLOYMENT_GUIDE.md)
|
||||
- [Site Mapping](../proxmox/SITE_MAPPING.md)
|
||||
|
||||
207
docs/runbooks/PROXMOX_VM_PROVISIONING.md
Normal file
207
docs/runbooks/PROXMOX_VM_PROVISIONING.md
Normal file
@@ -0,0 +1,207 @@
|
||||
# Proxmox VM Provisioning Runbook
|
||||
|
||||
## Overview
|
||||
|
||||
This runbook provides step-by-step procedures for provisioning virtual machines on Proxmox infrastructure using the Crossplane provider.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Kubernetes cluster with Crossplane and Proxmox provider installed
|
||||
- ProviderConfig configured and ready
|
||||
- Appropriate permissions to create ProxmoxVM resources
|
||||
- Access to Proxmox Web UI (for verification)
|
||||
|
||||
## Standard VM Provisioning
|
||||
|
||||
### Step 1: Create VM Manifest
|
||||
|
||||
Create a YAML manifest for the VM:
|
||||
|
||||
```yaml
|
||||
apiVersion: proxmox.sankofa.nexus/v1alpha1
|
||||
kind: ProxmoxVM
|
||||
metadata:
|
||||
name: my-vm
|
||||
namespace: default
|
||||
spec:
|
||||
forProvider:
|
||||
node: ML110-01
|
||||
name: my-vm
|
||||
cpu: 2
|
||||
memory: 4Gi
|
||||
disk: 50Gi
|
||||
storage: local-lvm
|
||||
network: vmbr0
|
||||
image: ubuntu-22.04-cloud
|
||||
site: us-sfvalley
|
||||
userData: |
|
||||
#cloud-config
|
||||
users:
|
||||
- name: admin
|
||||
groups: sudo
|
||||
shell: /bin/bash
|
||||
sudo: ['ALL=(ALL) NOPASSWD:ALL']
|
||||
providerConfigRef:
|
||||
name: proxmox-provider-config
|
||||
```
|
||||
|
||||
### Step 2: Apply Manifest
|
||||
|
||||
```bash
|
||||
kubectl apply -f my-vm.yaml
|
||||
```
|
||||
|
||||
### Step 3: Verify Creation
|
||||
|
||||
```bash
|
||||
# Check VM resource status
|
||||
kubectl get proxmoxvm my-vm
|
||||
|
||||
# Get detailed status
|
||||
kubectl describe proxmoxvm my-vm
|
||||
|
||||
# Check VM in Proxmox
|
||||
# Log into Proxmox Web UI and verify VM exists
|
||||
```
|
||||
|
||||
### Step 4: Verify VM Status
|
||||
|
||||
Wait for VM to be created and check status:
|
||||
|
||||
```bash
|
||||
# Watch VM status
|
||||
kubectl get proxmoxvm my-vm -w
|
||||
|
||||
# Check VM ID
|
||||
kubectl get proxmoxvm my-vm -o jsonpath='{.status.vmId}'
|
||||
|
||||
# Check VM state
|
||||
kubectl get proxmoxvm my-vm -o jsonpath='{.status.state}'
|
||||
|
||||
# Check IP address (if available)
|
||||
kubectl get proxmoxvm my-vm -o jsonpath='{.status.ipAddress}'
|
||||
```
|
||||
|
||||
## Multi-Site VM Provisioning
|
||||
|
||||
### Provision VM on Different Site
|
||||
|
||||
Update the `site` field in the manifest:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
forProvider:
|
||||
site: eu-west-1 # or apac-1 or us-sfvalley
|
||||
node: R630-01 # for both eu-west-1 and apac-1
|
||||
```
|
||||
|
||||
## VM Lifecycle Operations
|
||||
|
||||
### Start VM
|
||||
|
||||
```bash
|
||||
# VM should start automatically after creation
|
||||
# To manually start, update the VM resource or use Proxmox API
|
||||
```
|
||||
|
||||
### Stop VM
|
||||
|
||||
```bash
|
||||
# Update VM resource or use Proxmox Web UI
|
||||
```
|
||||
|
||||
### Delete VM
|
||||
|
||||
```bash
|
||||
kubectl delete proxmoxvm my-vm
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### VM Creation Fails
|
||||
|
||||
1. **Check ProviderConfig**:
|
||||
```bash
|
||||
kubectl get providerconfig proxmox-provider-config -o yaml
|
||||
```
|
||||
|
||||
2. **Check Provider Logs**:
|
||||
```bash
|
||||
kubectl logs -n crossplane-system -l app=crossplane-provider-proxmox --tail=50
|
||||
```
|
||||
|
||||
3. **Verify Site Configuration**:
|
||||
- Check if site exists in ProviderConfig
|
||||
- Verify endpoint is reachable
|
||||
- Check node name matches actual Proxmox node
|
||||
|
||||
4. **Check Proxmox Resources**:
|
||||
- Verify storage pool exists
|
||||
- Verify network bridge exists
|
||||
- Verify OS template exists
|
||||
|
||||
### VM Stuck in Creating State
|
||||
|
||||
1. **Check VM Resource Events**:
|
||||
```bash
|
||||
kubectl describe proxmoxvm my-vm
|
||||
```
|
||||
|
||||
2. **Check Proxmox Web UI**:
|
||||
- Log into Proxmox
|
||||
- Check if VM exists
|
||||
- Check VM status
|
||||
- Review Proxmox logs
|
||||
|
||||
3. **Verify Resources**:
|
||||
- Check available storage
|
||||
- Check available memory
|
||||
- Check node status
|
||||
|
||||
### VM Not Getting IP Address
|
||||
|
||||
1. **Check Cloud-Init**:
|
||||
- Verify userData is correct
|
||||
- Check cloud-init logs in VM
|
||||
|
||||
2. **Check Network Configuration**:
|
||||
- Verify network bridge is correct
|
||||
- Check DHCP configuration
|
||||
- Verify VM network interface
|
||||
|
||||
3. **Check Guest Agent**:
|
||||
- Ensure QEMU guest agent is installed
|
||||
- Verify guest agent is running
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Resource Naming**: Use descriptive names for VMs
|
||||
2. **Resource Limits**: Set appropriate CPU and memory limits
|
||||
3. **Storage Planning**: Choose appropriate storage pools
|
||||
4. **Network Configuration**: Use correct network bridges
|
||||
5. **Backup Strategy**: Configure backups for important VMs
|
||||
6. **Monitoring**: Set up monitoring for VM metrics
|
||||
|
||||
## Common Configurations
|
||||
|
||||
### Small VM (Development)
|
||||
- CPU: 1-2 cores
|
||||
- Memory: 2-4 Gi
|
||||
- Disk: 20-50 Gi
|
||||
|
||||
### Medium VM (Staging)
|
||||
- CPU: 2-4 cores
|
||||
- Memory: 4-8 Gi
|
||||
- Disk: 50-100 Gi
|
||||
|
||||
### Large VM (Production)
|
||||
- CPU: 4+ cores
|
||||
- Memory: 8+ Gi
|
||||
- Disk: 100+ Gi
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Deployment Guide](../proxmox/DEPLOYMENT_GUIDE.md)
|
||||
- [Site Mapping](../proxmox/SITE_MAPPING.md)
|
||||
- [Resource Inventory](../proxmox/RESOURCE_INVENTORY.md)
|
||||
|
||||
297
docs/runbooks/ROLLBACK_PLAN.md
Normal file
297
docs/runbooks/ROLLBACK_PLAN.md
Normal file
@@ -0,0 +1,297 @@
|
||||
# Rollback Plan
|
||||
|
||||
## Overview
|
||||
|
||||
This document outlines procedures for rolling back deployments in the Sankofa Phoenix platform.
|
||||
|
||||
## Rollback Strategy
|
||||
|
||||
### GitOps Rollback (Recommended)
|
||||
|
||||
All applications are managed via ArgoCD GitOps. Rollbacks should be performed through Git by reverting to a previous commit.
|
||||
|
||||
### Manual Rollback
|
||||
|
||||
For emergency situations, manual rollbacks can be performed directly in Kubernetes.
|
||||
|
||||
## Pre-Rollback Checklist
|
||||
|
||||
- [ ] Identify the commit/tag to rollback to
|
||||
- [ ] Verify the previous version is stable
|
||||
- [ ] Notify team of rollback
|
||||
- [ ] Document reason for rollback
|
||||
- [ ] Check database migration compatibility (if applicable)
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
### 1. API Service Rollback
|
||||
|
||||
#### GitOps Method
|
||||
```bash
|
||||
# 1. Identify the commit to rollback to
|
||||
git log --oneline api/
|
||||
|
||||
# 2. Revert to previous commit or tag
|
||||
cd gitops/apps/api
|
||||
git checkout <previous-commit-hash>
|
||||
git push origin main
|
||||
|
||||
# 3. ArgoCD will automatically sync
|
||||
# Or manually sync:
|
||||
argocd app sync api
|
||||
```
|
||||
|
||||
#### Manual Method
|
||||
```bash
|
||||
# 1. List deployment history
|
||||
kubectl rollout history deployment/api -n api
|
||||
|
||||
# 2. View specific revision
|
||||
kubectl rollout history deployment/api -n api --revision=<revision-number>
|
||||
|
||||
# 3. Rollback to previous revision
|
||||
kubectl rollout undo deployment/api -n api
|
||||
|
||||
# 4. Or rollback to specific revision
|
||||
kubectl rollout undo deployment/api -n api --to-revision=<revision-number>
|
||||
|
||||
# 5. Monitor rollback
|
||||
kubectl rollout status deployment/api -n api
|
||||
```
|
||||
|
||||
### 2. Portal Rollback
|
||||
|
||||
#### GitOps Method
|
||||
```bash
|
||||
cd gitops/apps/portal
|
||||
git checkout <previous-commit-hash>
|
||||
git push origin main
|
||||
argocd app sync portal
|
||||
```
|
||||
|
||||
#### Manual Method
|
||||
```bash
|
||||
kubectl rollout undo deployment/portal -n portal
|
||||
kubectl rollout status deployment/portal -n portal
|
||||
```
|
||||
|
||||
### 3. Database Migration Rollback
|
||||
|
||||
**⚠️ WARNING**: Database rollbacks require careful planning. Not all migrations are reversible.
|
||||
|
||||
#### Check Migration Status
|
||||
```bash
|
||||
# Connect to database
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL
|
||||
|
||||
# Check migration history
|
||||
SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 10;
|
||||
```
|
||||
|
||||
#### Rollback Migration (if reversible)
|
||||
```bash
|
||||
# Run down migration
|
||||
cd api
|
||||
npm run db:migrate:down
|
||||
|
||||
# Or manually revert SQL
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -f /path/to/rollback.sql
|
||||
```
|
||||
|
||||
#### For Non-Reversible Migrations
|
||||
1. Create new migration to restore previous state
|
||||
2. Test in staging first
|
||||
3. Apply during maintenance window
|
||||
4. Document data loss risks
|
||||
|
||||
### 4. Frontend (Public Site) Rollback
|
||||
|
||||
#### GitOps Method
|
||||
```bash
|
||||
cd gitops/apps/frontend
|
||||
git checkout <previous-commit-hash>
|
||||
git push origin main
|
||||
argocd app sync frontend
|
||||
```
|
||||
|
||||
#### Manual Method
|
||||
```bash
|
||||
kubectl rollout undo deployment/frontend -n frontend
|
||||
kubectl rollout status deployment/frontend -n frontend
|
||||
```
|
||||
|
||||
### 5. Monitoring Stack Rollback
|
||||
|
||||
```bash
|
||||
# Rollback Prometheus
|
||||
kubectl rollout undo deployment/prometheus-operator -n monitoring
|
||||
|
||||
# Rollback Grafana
|
||||
kubectl rollout undo deployment/grafana -n monitoring
|
||||
|
||||
# Rollback Alertmanager
|
||||
kubectl rollout undo deployment/alertmanager -n monitoring
|
||||
```
|
||||
|
||||
### 6. Keycloak Rollback
|
||||
|
||||
```bash
|
||||
# Rollback Keycloak
|
||||
kubectl rollout undo deployment/keycloak -n keycloak
|
||||
|
||||
# Verify Keycloak health
|
||||
curl https://keycloak.sankofa.nexus/health
|
||||
```
|
||||
|
||||
## Post-Rollback Verification
|
||||
|
||||
### 1. Health Checks
|
||||
```bash
|
||||
# API
|
||||
curl -f https://api.sankofa.nexus/health
|
||||
|
||||
# Portal
|
||||
curl -f https://portal.sankofa.nexus/api/health
|
||||
|
||||
# Keycloak
|
||||
curl -f https://keycloak.sankofa.nexus/health
|
||||
```
|
||||
|
||||
### 2. Functional Testing
|
||||
```bash
|
||||
# Run smoke tests
|
||||
./scripts/smoke-tests.sh
|
||||
|
||||
# Test authentication
|
||||
curl -X POST https://api.sankofa.nexus/graphql \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"query": "mutation { login(email: \"test@example.com\", password: \"test\") { token } }"}'
|
||||
```
|
||||
|
||||
### 3. Monitoring
|
||||
- Check Grafana dashboards for errors
|
||||
- Verify Prometheus metrics are normal
|
||||
- Check Loki logs for errors
|
||||
|
||||
### 4. Database Verification
|
||||
```bash
|
||||
# Verify database connectivity
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT 1"
|
||||
|
||||
# Check for data integrity issues
|
||||
kubectl exec -it -n api deployment/api -- \
|
||||
psql $DATABASE_URL -c "SELECT COUNT(*) FROM users;"
|
||||
```
|
||||
|
||||
## Rollback Scenarios
|
||||
|
||||
### Scenario 1: API Breaking Change
|
||||
|
||||
**Symptoms**: API returns errors after deployment
|
||||
|
||||
**Rollback Steps**:
|
||||
1. Immediately rollback API deployment
|
||||
2. Verify API health
|
||||
3. Check error logs
|
||||
4. Investigate root cause
|
||||
5. Fix and redeploy
|
||||
|
||||
### Scenario 2: Database Migration Failure
|
||||
|
||||
**Symptoms**: Database errors, application crashes
|
||||
|
||||
**Rollback Steps**:
|
||||
1. Stop application deployments
|
||||
2. Assess migration state
|
||||
3. Rollback migration if possible
|
||||
4. Or restore from backup
|
||||
5. Redeploy previous application version
|
||||
|
||||
### Scenario 3: Portal Build Failure
|
||||
|
||||
**Symptoms**: Portal shows blank page or errors
|
||||
|
||||
**Rollback Steps**:
|
||||
1. Rollback portal deployment
|
||||
2. Verify portal loads
|
||||
3. Check build logs
|
||||
4. Fix build issues
|
||||
5. Redeploy
|
||||
|
||||
### Scenario 4: Configuration Error
|
||||
|
||||
**Symptoms**: Services cannot connect to dependencies
|
||||
|
||||
**Rollback Steps**:
|
||||
1. Revert configuration changes in Git
|
||||
2. ArgoCD will sync automatically
|
||||
3. Or manually update ConfigMaps/Secrets
|
||||
4. Restart affected services
|
||||
|
||||
## Rollback Testing
|
||||
|
||||
### Staging Rollback Test
|
||||
```bash
|
||||
# 1. Deploy new version to staging
|
||||
argocd app sync api-staging
|
||||
|
||||
# 2. Test new version
|
||||
./scripts/smoke-tests.sh --env=staging
|
||||
|
||||
# 3. Simulate rollback
|
||||
kubectl rollout undo deployment/api -n api-staging
|
||||
|
||||
# 4. Verify rollback works
|
||||
./scripts/smoke-tests.sh --env=staging
|
||||
```
|
||||
|
||||
## Rollback Communication
|
||||
|
||||
### Internal Communication
|
||||
- Notify team in #engineering channel
|
||||
- Update incident tracking system
|
||||
- Document in runbook
|
||||
|
||||
### External Communication
|
||||
- Update status page if user-facing
|
||||
- Notify affected customers if needed
|
||||
- Post-mortem for P0/P1 incidents
|
||||
|
||||
## Prevention
|
||||
|
||||
### Pre-Deployment
|
||||
- [ ] All tests passing
|
||||
- [ ] Code review completed
|
||||
- [ ] Staging deployment successful
|
||||
- [ ] Smoke tests passing
|
||||
- [ ] Database migrations tested
|
||||
- [ ] Rollback plan reviewed
|
||||
|
||||
### Deployment
|
||||
- [ ] Deploy to staging first
|
||||
- [ ] Monitor staging for 24 hours
|
||||
- [ ] Gradual production rollout (canary)
|
||||
- [ ] Monitor metrics closely
|
||||
- [ ] Have rollback plan ready
|
||||
|
||||
## Rollback Decision Matrix
|
||||
|
||||
| Issue | Severity | Rollback? |
|
||||
|-------|----------|-----------|
|
||||
| Complete outage | P0 | Yes, immediately |
|
||||
| Data corruption | P0 | Yes, immediately |
|
||||
| Security breach | P0 | Yes, immediately |
|
||||
| >50% error rate | P1 | Yes, within 15 min |
|
||||
| Performance >50% degraded | P1 | Yes, within 30 min |
|
||||
| Single feature broken | P2 | Maybe, assess impact |
|
||||
| Minor bugs | P3 | No, fix forward |
|
||||
|
||||
## Emergency Contacts
|
||||
|
||||
- **On-call Engineer**: [Contact]
|
||||
- **Team Lead**: [Contact]
|
||||
- **DevOps Lead**: [Contact]
|
||||
|
||||
Reference in New Issue
Block a user