Files
proxmox/reports/r630-02-container-startup-failures-analysis.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

264 lines
7.8 KiB
Markdown

# R630-02 Container Startup Failures Analysis
**Date:** January 19, 2026
**Node:** r630-02 (192.168.11.12)
**Status:** ⚠️ **CRITICAL - 33 CONTAINERS FAILED TO START**
---
## Executive Summary
A bulk container startup operation on r630-02 resulted in **33 container failures** out of attempted starts. The failures fall into three distinct categories:
1. **Logical Volume Missing** (8 containers) - Storage volumes don't exist
2. **Startup Failures** (24 containers) - Containers fail to start for unknown reasons
3. **Lock Error** (1 container) - Container is locked in "create" state
**Total Impact:** 33 containers unable to start, affecting multiple services.
---
## Failure Breakdown
### Category 1: Missing Logical Volumes (8 containers)
**Error Pattern:** `no such logical volume pve/vm-XXXX-disk-X`
**Affected Containers:**
- CT 3000: `pve/vm-3000-disk-1`
- CT 3001: `pve/vm-3001-disk-1`
- CT 3002: `pve/vm-3002-disk-2`
- CT 3003: `pve/vm-3003-disk-1`
- CT 3500: `pve/vm-3500-disk-1`
- CT 3501: `pve/vm-3501-disk-2`
- CT 6000: `pve/vm-6000-disk-1`
- CT 6400: `pve/vm-6400-disk-1`
**Root Cause Analysis:**
- Storage volumes were likely deleted, migrated, or never created
- Containers may have been migrated to another node but configs not updated
- Storage pool may have been recreated/reset, losing volume metadata
- Containers may reference wrong storage pool (e.g., `thin1` vs `thin1-r630-02`)
**Diagnostic Steps:**
1. Check if volumes exist on other storage pools:
```bash
ssh root@192.168.11.12 "lvs | grep -E 'vm-3000|vm-3001|vm-3002|vm-3003|vm-3500|vm-3501|vm-6000|vm-6400'"
```
2. Check container storage configuration:
```bash
ssh root@192.168.11.12 "pct config 3000 | grep rootfs"
```
3. Check available storage pools:
```bash
ssh root@192.168.11.12 "pvesm status"
```
**Resolution Options:**
- **Option A:** Recreate missing volumes if data is not critical
- **Option B:** Migrate containers to existing storage pool
- **Option C:** Restore volumes from backup if available
- **Option D:** Update container configs to point to correct storage
---
### Category 2: Startup Failures (24 containers)
**Error Pattern:** `startup for container 'XXXX' failed`
**Affected Containers:**
- CT 5200
- CT 10000, 10001, 10020, 10030, 10040, 10050, 10060
- CT 10070, 10080, 10090, 10091, 10092
- CT 10100, 10101, 10120, 10130
- CT 10150, 10151
- CT 10200, 10201, 10202, 10210, 10230
**Root Cause Analysis:**
Startup failures can have multiple causes:
1. **Missing configuration files** - Container config deleted or not migrated
2. **Storage issues** - Storage accessible but corrupted or misconfigured
3. **Network issues** - Network configuration problems
4. **Resource constraints** - Insufficient memory/CPU
5. **Container corruption** - Container filesystem issues
6. **Dependencies** - Missing required services or mounts
**Diagnostic Steps:**
1. Check if config files exist:
```bash
ssh root@192.168.11.12 "ls -la /etc/pve/lxc/ | grep -E '5200|10000|10001|10020|10030|10040|10050|10060|10070|10080|10090|10091|10092|10100|10101|10120|10130|10150|10151|10200|10201|10202|10210|10230'"
```
2. Check detailed startup error:
```bash
ssh root@192.168.11.12 "pct start 5200 2>&1"
```
3. Check container status and locks:
```bash
ssh root@192.168.11.12 "pct list | grep -E '5200|10000|10001'"
```
4. Check system resources:
```bash
ssh root@192.168.11.12 "free -h; df -h"
```
5. Check container logs:
```bash
ssh root@192.168.11.12 "journalctl -u pve-container@5200 -n 50 --no-pager"
```
**Resolution Options:**
- **Option A:** Fix configuration issues (network, storage, etc.)
- **Option B:** Recreate containers if configs are missing
- **Option C:** Check and resolve resource constraints
- **Option D:** Restore from backup if corruption detected
---
### Category 3: Lock Error (1 container)
**Error Pattern:** `CT is locked (create)`
**Affected Container:**
- CT 10232
**Root Cause Analysis:**
- Container is stuck in "create" state
- Previous creation operation may have been interrupted
- Lock file exists but container creation incomplete
**Diagnostic Steps:**
1. Check lock status:
```bash
ssh root@192.168.11.12 "pct list | grep 10232"
```
2. Check for lock files:
```bash
ssh root@192.168.11.12 "ls -la /var/lock/qemu-server/ | grep 10232"
```
3. Check Proxmox task queue:
```bash
ssh root@192.168.11.12 "qm list | grep 10232"
```
**Resolution Options:**
- **Option A:** Clear lock manually:
```bash
ssh root@192.168.11.12 "rm -f /var/lock/qemu-server/lock-10232"
```
- **Option B:** Complete or cancel the creation task
- **Option C:** Delete and recreate container if creation incomplete
---
## Successfully Started Containers
The following containers started successfully:
- CT 10030, 10040, 10050, 10060, 10070, 10080, 10090, 10091, 10092, 10100, 10101, 10120, 10130, 10150, 10151, 10200, 10201, 10202, 10210, 10230, 10232
**Note:** Some of these may have started initially but then failed (see failure list above).
---
## Recommended Actions
### Immediate Actions (Priority 1)
1. **Run Diagnostic Script:**
```bash
./scripts/diagnose-r630-02-startup-failures.sh
```
This will identify the root cause for each failure.
2. **Check Storage Status:**
```bash
ssh root@192.168.11.12 "pvesm status; lvs; vgs"
```
3. **Check System Resources:**
```bash
ssh root@192.168.11.12 "free -h; df -h; uptime"
```
### Short-term Actions (Priority 2)
1. **Fix Logical Volume Issues:**
- Identify where volumes should be or if they need recreation
- Update container configs to use correct storage pools
- Recreate volumes if data is not critical
2. **Resolve Startup Failures:**
- Check each container's detailed error message
- Fix configuration issues
- Recreate containers if configs are missing
3. **Clear Lock on CT 10232:**
- Remove lock file and retry creation or delete container
### Long-term Actions (Priority 3)
1. **Implement Monitoring:**
- Set up alerts for container startup failures
- Monitor storage pool health
- Track container status changes
2. **Documentation:**
- Document container dependencies
- Create runbooks for common failure scenarios
- Maintain container inventory with storage mappings
3. **Prevention:**
- Implement pre-startup validation
- Add storage health checks
- Create backup procedures for container configs
---
## Diagnostic Commands Reference
### Check Container Status
```bash
ssh root@192.168.11.12 "pct list | grep -E '3000|3001|3002|3003|3500|3501|5200|6000|6400|10000|10001|10020|10030|10040|10050|10060|10070|10080|10090|10091|10092|10100|10101|10120|10130|10150|10151|10200|10201|10202|10210|10230|10232'"
```
### Check Storage Configuration
```bash
ssh root@192.168.11.12 "pvesm status"
ssh root@192.168.11.12 "lvs | grep -E 'vm-3000|vm-3001|vm-3002|vm-3003|vm-3500|vm-3501|vm-6000|vm-6400'"
```
### Check Container Configs
```bash
ssh root@192.168.11.12 "for vmid in 3000 3001 3002 3003 3500 3501 5200 6000 6400; do echo \"=== CT \$vmid ===\"; pct config \$vmid 2>&1 | head -5; done"
```
### Check Detailed Errors
```bash
ssh root@192.168.11.12 "for vmid in 3000 5200 10000 10232; do echo \"=== CT \$vmid ===\"; pct start \$vmid 2>&1; echo; done"
```
---
## Related Documentation
- [Storage Migration Issues](../docs/09-troubleshooting/STORAGE_MIGRATION_ISSUE.md)
- [R630-02 Storage Fixes](../docs/04-configuration/R630-02_STORAGE_FIXES_APPLIED.md)
- [RPC Migration Execution Summary](../docs/04-configuration/RPC_MIGRATION_EXECUTION_SUMMARY.md)
---
## Next Steps
1. Run the diagnostic script to gather detailed information
2. Review diagnostic output and categorize failures
3. Execute fix script for automated resolution where possible
4. Manually resolve remaining issues based on diagnostic findings
5. Verify all containers can start successfully
6. Document resolution steps for future reference