Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands - CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround - CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check - NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere - MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates - LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference Co-authored-by: Cursor <cursoragent@cursor.com>
264 lines
7.8 KiB
Markdown
264 lines
7.8 KiB
Markdown
# R630-02 Container Startup Failures Analysis
|
|
|
|
**Date:** January 19, 2026
|
|
**Node:** r630-02 (192.168.11.12)
|
|
**Status:** ⚠️ **CRITICAL - 33 CONTAINERS FAILED TO START**
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
A bulk container startup operation on r630-02 resulted in **33 container failures** out of attempted starts. The failures fall into three distinct categories:
|
|
|
|
1. **Logical Volume Missing** (8 containers) - Storage volumes don't exist
|
|
2. **Startup Failures** (24 containers) - Containers fail to start for unknown reasons
|
|
3. **Lock Error** (1 container) - Container is locked in "create" state
|
|
|
|
**Total Impact:** 33 containers unable to start, affecting multiple services.
|
|
|
|
---
|
|
|
|
## Failure Breakdown
|
|
|
|
### Category 1: Missing Logical Volumes (8 containers)
|
|
|
|
**Error Pattern:** `no such logical volume pve/vm-XXXX-disk-X`
|
|
|
|
**Affected Containers:**
|
|
- CT 3000: `pve/vm-3000-disk-1`
|
|
- CT 3001: `pve/vm-3001-disk-1`
|
|
- CT 3002: `pve/vm-3002-disk-2`
|
|
- CT 3003: `pve/vm-3003-disk-1`
|
|
- CT 3500: `pve/vm-3500-disk-1`
|
|
- CT 3501: `pve/vm-3501-disk-2`
|
|
- CT 6000: `pve/vm-6000-disk-1`
|
|
- CT 6400: `pve/vm-6400-disk-1`
|
|
|
|
**Root Cause Analysis:**
|
|
- Storage volumes were likely deleted, migrated, or never created
|
|
- Containers may have been migrated to another node but configs not updated
|
|
- Storage pool may have been recreated/reset, losing volume metadata
|
|
- Containers may reference wrong storage pool (e.g., `thin1` vs `thin1-r630-02`)
|
|
|
|
**Diagnostic Steps:**
|
|
1. Check if volumes exist on other storage pools:
|
|
```bash
|
|
ssh root@192.168.11.12 "lvs | grep -E 'vm-3000|vm-3001|vm-3002|vm-3003|vm-3500|vm-3501|vm-6000|vm-6400'"
|
|
```
|
|
|
|
2. Check container storage configuration:
|
|
```bash
|
|
ssh root@192.168.11.12 "pct config 3000 | grep rootfs"
|
|
```
|
|
|
|
3. Check available storage pools:
|
|
```bash
|
|
ssh root@192.168.11.12 "pvesm status"
|
|
```
|
|
|
|
**Resolution Options:**
|
|
- **Option A:** Recreate missing volumes if data is not critical
|
|
- **Option B:** Migrate containers to existing storage pool
|
|
- **Option C:** Restore volumes from backup if available
|
|
- **Option D:** Update container configs to point to correct storage
|
|
|
|
---
|
|
|
|
### Category 2: Startup Failures (24 containers)
|
|
|
|
**Error Pattern:** `startup for container 'XXXX' failed`
|
|
|
|
**Affected Containers:**
|
|
- CT 5200
|
|
- CT 10000, 10001, 10020, 10030, 10040, 10050, 10060
|
|
- CT 10070, 10080, 10090, 10091, 10092
|
|
- CT 10100, 10101, 10120, 10130
|
|
- CT 10150, 10151
|
|
- CT 10200, 10201, 10202, 10210, 10230
|
|
|
|
**Root Cause Analysis:**
|
|
Startup failures can have multiple causes:
|
|
1. **Missing configuration files** - Container config deleted or not migrated
|
|
2. **Storage issues** - Storage accessible but corrupted or misconfigured
|
|
3. **Network issues** - Network configuration problems
|
|
4. **Resource constraints** - Insufficient memory/CPU
|
|
5. **Container corruption** - Container filesystem issues
|
|
6. **Dependencies** - Missing required services or mounts
|
|
|
|
**Diagnostic Steps:**
|
|
1. Check if config files exist:
|
|
```bash
|
|
ssh root@192.168.11.12 "ls -la /etc/pve/lxc/ | grep -E '5200|10000|10001|10020|10030|10040|10050|10060|10070|10080|10090|10091|10092|10100|10101|10120|10130|10150|10151|10200|10201|10202|10210|10230'"
|
|
```
|
|
|
|
2. Check detailed startup error:
|
|
```bash
|
|
ssh root@192.168.11.12 "pct start 5200 2>&1"
|
|
```
|
|
|
|
3. Check container status and locks:
|
|
```bash
|
|
ssh root@192.168.11.12 "pct list | grep -E '5200|10000|10001'"
|
|
```
|
|
|
|
4. Check system resources:
|
|
```bash
|
|
ssh root@192.168.11.12 "free -h; df -h"
|
|
```
|
|
|
|
5. Check container logs:
|
|
```bash
|
|
ssh root@192.168.11.12 "journalctl -u pve-container@5200 -n 50 --no-pager"
|
|
```
|
|
|
|
**Resolution Options:**
|
|
- **Option A:** Fix configuration issues (network, storage, etc.)
|
|
- **Option B:** Recreate containers if configs are missing
|
|
- **Option C:** Check and resolve resource constraints
|
|
- **Option D:** Restore from backup if corruption detected
|
|
|
|
---
|
|
|
|
### Category 3: Lock Error (1 container)
|
|
|
|
**Error Pattern:** `CT is locked (create)`
|
|
|
|
**Affected Container:**
|
|
- CT 10232
|
|
|
|
**Root Cause Analysis:**
|
|
- Container is stuck in "create" state
|
|
- Previous creation operation may have been interrupted
|
|
- Lock file exists but container creation incomplete
|
|
|
|
**Diagnostic Steps:**
|
|
1. Check lock status:
|
|
```bash
|
|
ssh root@192.168.11.12 "pct list | grep 10232"
|
|
```
|
|
|
|
2. Check for lock files:
|
|
```bash
|
|
ssh root@192.168.11.12 "ls -la /var/lock/qemu-server/ | grep 10232"
|
|
```
|
|
|
|
3. Check Proxmox task queue:
|
|
```bash
|
|
ssh root@192.168.11.12 "qm list | grep 10232"
|
|
```
|
|
|
|
**Resolution Options:**
|
|
- **Option A:** Clear lock manually:
|
|
```bash
|
|
ssh root@192.168.11.12 "rm -f /var/lock/qemu-server/lock-10232"
|
|
```
|
|
- **Option B:** Complete or cancel the creation task
|
|
- **Option C:** Delete and recreate container if creation incomplete
|
|
|
|
---
|
|
|
|
## Successfully Started Containers
|
|
|
|
The following containers started successfully:
|
|
- CT 10030, 10040, 10050, 10060, 10070, 10080, 10090, 10091, 10092, 10100, 10101, 10120, 10130, 10150, 10151, 10200, 10201, 10202, 10210, 10230, 10232
|
|
|
|
**Note:** Some of these may have started initially but then failed (see failure list above).
|
|
|
|
---
|
|
|
|
## Recommended Actions
|
|
|
|
### Immediate Actions (Priority 1)
|
|
|
|
1. **Run Diagnostic Script:**
|
|
```bash
|
|
./scripts/diagnose-r630-02-startup-failures.sh
|
|
```
|
|
This will identify the root cause for each failure.
|
|
|
|
2. **Check Storage Status:**
|
|
```bash
|
|
ssh root@192.168.11.12 "pvesm status; lvs; vgs"
|
|
```
|
|
|
|
3. **Check System Resources:**
|
|
```bash
|
|
ssh root@192.168.11.12 "free -h; df -h; uptime"
|
|
```
|
|
|
|
### Short-term Actions (Priority 2)
|
|
|
|
1. **Fix Logical Volume Issues:**
|
|
- Identify where volumes should be or if they need recreation
|
|
- Update container configs to use correct storage pools
|
|
- Recreate volumes if data is not critical
|
|
|
|
2. **Resolve Startup Failures:**
|
|
- Check each container's detailed error message
|
|
- Fix configuration issues
|
|
- Recreate containers if configs are missing
|
|
|
|
3. **Clear Lock on CT 10232:**
|
|
- Remove lock file and retry creation or delete container
|
|
|
|
### Long-term Actions (Priority 3)
|
|
|
|
1. **Implement Monitoring:**
|
|
- Set up alerts for container startup failures
|
|
- Monitor storage pool health
|
|
- Track container status changes
|
|
|
|
2. **Documentation:**
|
|
- Document container dependencies
|
|
- Create runbooks for common failure scenarios
|
|
- Maintain container inventory with storage mappings
|
|
|
|
3. **Prevention:**
|
|
- Implement pre-startup validation
|
|
- Add storage health checks
|
|
- Create backup procedures for container configs
|
|
|
|
---
|
|
|
|
## Diagnostic Commands Reference
|
|
|
|
### Check Container Status
|
|
```bash
|
|
ssh root@192.168.11.12 "pct list | grep -E '3000|3001|3002|3003|3500|3501|5200|6000|6400|10000|10001|10020|10030|10040|10050|10060|10070|10080|10090|10091|10092|10100|10101|10120|10130|10150|10151|10200|10201|10202|10210|10230|10232'"
|
|
```
|
|
|
|
### Check Storage Configuration
|
|
```bash
|
|
ssh root@192.168.11.12 "pvesm status"
|
|
ssh root@192.168.11.12 "lvs | grep -E 'vm-3000|vm-3001|vm-3002|vm-3003|vm-3500|vm-3501|vm-6000|vm-6400'"
|
|
```
|
|
|
|
### Check Container Configs
|
|
```bash
|
|
ssh root@192.168.11.12 "for vmid in 3000 3001 3002 3003 3500 3501 5200 6000 6400; do echo \"=== CT \$vmid ===\"; pct config \$vmid 2>&1 | head -5; done"
|
|
```
|
|
|
|
### Check Detailed Errors
|
|
```bash
|
|
ssh root@192.168.11.12 "for vmid in 3000 5200 10000 10232; do echo \"=== CT \$vmid ===\"; pct start \$vmid 2>&1; echo; done"
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [Storage Migration Issues](../docs/09-troubleshooting/STORAGE_MIGRATION_ISSUE.md)
|
|
- [R630-02 Storage Fixes](../docs/04-configuration/R630-02_STORAGE_FIXES_APPLIED.md)
|
|
- [RPC Migration Execution Summary](../docs/04-configuration/RPC_MIGRATION_EXECUTION_SUMMARY.md)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. Run the diagnostic script to gather detailed information
|
|
2. Review diagnostic output and categorize failures
|
|
3. Execute fix script for automated resolution where possible
|
|
4. Manually resolve remaining issues based on diagnostic findings
|
|
5. Verify all containers can start successfully
|
|
6. Document resolution steps for future reference
|