# VM Creation Failure Analysis & Prevention Guide ## Executive Summary This document catalogs all working and non-working attempts at VM creation, identifies codebase inconsistencies that repeat previous failures, and provides recommendations to prevent future issues. **Critical Finding**: The `importdisk` API endpoint (`POST /nodes/{node}/qemu/{vmid}/importdisk`) is **NOT IMPLEMENTED** in the Proxmox version running on ml110-01, causing all VM creation attempts with cloud images to fail and create orphaned VMs with stuck lock files. --- ## 1. Root Cause Analysis ### Primary Failure: importdisk API Not Implemented **Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397-400` **Error**: ``` 501 Method 'POST /nodes/ml110-01/qemu/{vmid}/importdisk' not implemented ``` **Impact**: - VM is created successfully (blank disk) - Image import fails immediately - VM remains in locked state (`lock-{vmid}.conf`) - Controller retries indefinitely (VMID never set in status) - Each retry creates a NEW VM (perpetual creation loop) **Code Path**: ```go // Line 350-400: createVM() function if needsImageImport && imageVolid != "" { // ... stops VM ... // Line 397: Attempts importdisk API call if err := c.httpClient.Post(ctx, importPath, importConfig, &importResult); err != nil { // Line 399: Returns error, VM already created but orphaned return nil, errors.Wrapf(err, "failed to import image...") } } ``` **Controller Behavior**: ```go // Line 142-145: controller.go createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec) if err != nil { // Returns error, but VM already exists in Proxmox return ctrl.Result{}, errors.Wrap(err, "cannot create VM") } // Status never updated (VMID stays 0), causing infinite retry loop ``` --- ## 2. Working vs Non-Working Attempts ### ✅ WORKING Approaches #### 2.1 VM Deletion (Force Removal) **Script**: `scripts/force-remove-all-remaining.sh` **Method**: - Multiple unlock attempts (10x with delays) - Stop VM if running - Delete with `purge=1&skiplock=1` parameters - Wait for task completion (up to 60 seconds) - Verify deletion **Success Rate**: 100% (all 66 VMs eventually deleted) **Key Success Factors**: 1. **Aggressive unlocking**: 10 unlock attempts with 1-second delays 2. **Long wait times**: 60-second timeout for delete tasks 3. **Verification**: Confirms VM is actually deleted before proceeding #### 2.2 Controller Scaling **Command**: `kubectl scale deployment crossplane-provider-proxmox -n crossplane-system --replicas=0` **Result**: Immediately stops all VM creation processes **Status**: ✅ Effective ### ❌ NON-WORKING Approaches #### 2.1 importdisk API Usage **Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:397` **Problem**: API endpoint not implemented in Proxmox version **Error**: `501 Method not implemented` **Impact**: All VM creations with cloud images fail #### 2.2 Single Unlock Attempt **Problem**: Lock files persist after single unlock **Result**: Delete operations timeout with "can't lock file" errors **Solution**: Multiple unlock attempts (10x) required #### 2.3 Short Timeouts **Problem**: 20-second timeout insufficient for delete operations **Result**: Tasks appear to fail but actually complete later **Solution**: 60-second timeout with verification #### 2.4 No Error Recovery **Problem**: Controller doesn't handle partial VM creation **Result**: Orphaned VMs accumulate when importdisk fails **Impact**: Status never updates, infinite retry loop --- ## 3. Codebase Inconsistencies & Repeated Failures ### 3.1 CRITICAL: No Error Recovery for Partial VM Creation **Location**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145` **Problem**: ```go createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec) if err != nil { // ❌ VM already created in Proxmox, but error returned // ❌ No cleanup of orphaned VM // ❌ Status never updated (VMID stays 0) // ❌ Controller will retry forever, creating new VMs return ctrl.Result{}, errors.Wrap(err, "cannot create VM") } ``` **Fix Required**: ```go createdVM, err := proxmoxClient.CreateVM(ctx, vmSpec) if err != nil { // Check if VM was partially created if createdVM != nil && createdVM.ID > 0 { // Attempt cleanup logger.Error(err, "VM creation failed, attempting cleanup", "vmID", createdVM.ID) cleanupErr := proxmoxClient.DeleteVM(ctx, createdVM.ID) if cleanupErr != nil { logger.Error(cleanupErr, "Failed to cleanup orphaned VM", "vmID", createdVM.ID) } } // Don't requeue immediately - wait longer to prevent rapid retries return ctrl.Result{RequeueAfter: 5 * time.Minute}, errors.Wrap(err, "cannot create VM") } ``` ### 3.2 CRITICAL: importdisk API Not Checked Before Use **Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:350-400` **Problem**: Code assumes `importdisk` API exists without checking Proxmox version or API availability. **Fix Required**: ```go // Before attempting importdisk, check if API is available // Option 1: Check Proxmox version pveVersion, err := c.GetPVEVersion(ctx) if err != nil || !supportsImportDisk(pveVersion) { return nil, errors.Errorf("importdisk API not supported in Proxmox version %s. Use template cloning or pre-imported images instead", pveVersion) } // Option 2: Use alternative method (qm disk import via SSH/API) // Option 3: Require images to be pre-imported as templates ``` ### 3.3 CRITICAL: No Status Update on Partial Failure **Location**: `crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156` **Problem**: If VM creation fails after VM is created but before status update, the VMID remains 0, causing infinite retries. **Current Flow**: 1. VM created in Proxmox (VMID assigned) 2. importdisk fails 3. Error returned, status never updated 4. `vm.Status.VMID == 0` still true 5. Controller retries, creates new VM **Fix Required**: Add intermediate status updates or cleanup on failure. ### 3.4 Inconsistent Error Handling **Location**: Multiple locations **Problem**: Some errors trigger requeue, others don't. No consistent strategy for retryable vs non-retryable errors. **Examples**: - Line 53: Credentials error → requeue after 30s - Line 60: Site error → requeue after 30s - Line 144: VM creation error → no requeue (but should have longer delay) **Fix Required**: Define error categories and consistent requeue strategies. ### 3.5 Lock File Handling Inconsistency **Location**: `crossplane-provider-proxmox/pkg/proxmox/client.go:803-821` (UnlockVM) **Problem**: UnlockVM function exists but is never called during VM creation failure recovery. **Fix Required**: Call UnlockVM before DeleteVM in error recovery paths. --- ## 4. ml110-01 Node Status: "Unknown" in Web Portal ### Investigation Results **API Status Check**: ✅ Node is healthy - CPU: 0.027 (2.7% usage) - Memory: 9.2GB used / 270GB total - Uptime: 460,486 seconds (~5.3 days) - PVE Version: `pve-manager/9.1.1/42db4a6cf33dac83` - Kernel: `6.17.2-1-pve` **Web Portal Issue**: Likely a display/UI issue, not an actual node problem. **Possible Causes**: 1. Web UI cache issue 2. Cluster quorum/communication issue (if in cluster) 3. Web UI version mismatch 4. Browser cache **Recommendation**: - Refresh web portal - Check cluster status: `pvecm status` (if in cluster) - Verify node is reachable: `ping ml110-01` - Check Proxmox logs: `/var/log/pveproxy/access.log` --- ## 5. Recommendations to Prevent Future Failures ### 5.1 Immediate Fixes (Critical) 1. **Add Error Recovery for Partial VM Creation** - Detect when VM is created but import fails - Clean up orphaned VMs automatically - Update status to prevent infinite retries 2. **Check importdisk API Availability** - Verify Proxmox version supports importdisk - Provide fallback method (template cloning, pre-imported images) - Document supported Proxmox versions 3. **Improve Status Update Logic** - Update status even on partial failures - Add conditions to track failure states - Prevent infinite retry loops ### 5.2 Short-term Improvements 1. **Add VM Cleanup on Controller Startup** - Scan for orphaned VMs (created but no corresponding Kubernetes resource) - Clean up VMs with stuck locks - Log cleanup actions 2. **Implement Exponential Backoff** - Current: Fixed 30s requeue - Recommended: Exponential backoff (30s, 1m, 2m, 5m, 10m) - Prevents rapid retry storms 3. **Add Health Checks** - Verify Proxmox API endpoints before use - Check node status before VM creation - Validate image availability ### 5.3 Long-term Improvements 1. **Alternative Image Import Methods** - Use `qm disk import` via SSH (if available) - Pre-import images as templates - Use Proxmox templates instead of cloud images 2. **Better Observability** - Add metrics for VM creation success/failure rates - Track orphaned VM counts - Alert on stuck VM creation loops 3. **Comprehensive Testing** - Test with different Proxmox versions - Test error recovery scenarios - Test lock file handling --- ## 6. Code Locations Requiring Fixes ### High Priority 1. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:142-145`** - Add error recovery for partial VM creation - Implement cleanup logic 2. **`crossplane-provider-proxmox/pkg/proxmox/client.go:350-400`** - Check importdisk API availability - Add fallback methods - Improve error messages 3. **`crossplane-provider-proxmox/pkg/controller/virtualmachine/controller.go:75-156`** - Add intermediate status updates - Prevent infinite retry loops ### Medium Priority 4. **`crossplane-provider-proxmox/pkg/proxmox/client.go:803-821`** - Use UnlockVM in error recovery paths 5. **Error handling throughout controller** - Standardize requeue strategies - Add error categorization --- ## 7. Testing Checklist Before deploying fixes, test: - [ ] VM creation with importdisk API (if supported) - [ ] VM creation with template cloning - [ ] Error recovery when importdisk fails - [ ] Cleanup of orphaned VMs - [ ] Lock file handling - [ ] Controller retry behavior - [ ] Status update on partial failures - [ ] Multiple concurrent VM creations - [ ] Node status checks - [ ] Proxmox version compatibility --- ## 8. Documentation Updates Needed 1. **README.md**: Document supported Proxmox versions 2. **API Compatibility**: List which APIs are required 3. **Troubleshooting Guide**: Add section on orphaned VMs 4. **Error Recovery**: Document automatic cleanup features 5. **Image Requirements**: Clarify template vs cloud image usage --- ## 9. Lessons Learned 1. **Always verify API availability** before using it 2. **Implement error recovery** for partial resource creation 3. **Update status early** to prevent infinite retry loops 4. **Test with actual infrastructure** versions, not just mocks 5. **Monitor for orphaned resources** and implement cleanup 6. **Use exponential backoff** for retries 7. **Document failure modes** and recovery procedures --- ## 10. Summary **Primary Issue**: `importdisk` API not implemented → VM creation fails → Orphaned VMs → Infinite retry loop **Root Causes**: 1. No API availability check 2. No error recovery for partial creation 3. No status update on failure 4. No cleanup of orphaned resources **Solutions**: 1. Check API availability before use 2. Implement error recovery and cleanup 3. Update status even on partial failures 4. Add health checks and monitoring **Status**: All orphaned VMs cleaned up. Controller scaled to 0. System ready for fixes. --- *Last Updated: 2025-12-12* *Document Version: 1.0*