proxmox/reports/r630-02-container-startup-failures-analysis.md

# R630-02 Container Startup Failures Analysis

**Date:** January 19, 2026
**Node:** r630-02 (192.168.11.12)
**Status:** ⚠️ **CRITICAL - 33 CONTAINERS FAILED TO START**

---

## Executive Summary

A bulk container startup operation on r630-02 resulted in **33 container failures** out of attempted starts. The failures fall into three distinct categories:

1. **Logical Volume Missing** (8 containers) - Storage volumes don't exist
2. **Startup Failures** (24 containers) - Containers fail to start for unknown reasons
3. **Lock Error** (1 container) - Container is locked in "create" state

**Total Impact:** 33 containers unable to start, affecting multiple services.

---

## Failure Breakdown

### Category 1: Missing Logical Volumes (8 containers)

**Error Pattern:** `no such logical volume pve/vm-XXXX-disk-X`

**Affected Containers:**
- CT 3000: `pve/vm-3000-disk-1`
- CT 3001: `pve/vm-3001-disk-1`
- CT 3002: `pve/vm-3002-disk-2`
- CT 3003: `pve/vm-3003-disk-1`
- CT 3500: `pve/vm-3500-disk-1`
- CT 3501: `pve/vm-3501-disk-2`
- CT 6000: `pve/vm-6000-disk-1`
- CT 6400: `pve/vm-6400-disk-1`

**Root Cause Analysis:**
- Storage volumes were likely deleted, migrated, or never created
- Containers may have been migrated to another node but configs not updated
- Storage pool may have been recreated/reset, losing volume metadata
- Containers may reference wrong storage pool (e.g., `thin1` vs `thin1-r630-02`)

**Diagnostic Steps:**
1. Check if volumes exist on other storage pools:
   ```bash
   ssh root@192.168.11.12 "lvs | grep -E 'vm-3000|vm-3001|vm-3002|vm-3003|vm-3500|vm-3501|vm-6000|vm-6400'"
   ```

2. Check container storage configuration:
   ```bash
   ssh root@192.168.11.12 "pct config 3000 | grep rootfs"
   ```

3. Check available storage pools:
   ```bash
   ssh root@192.168.11.12 "pvesm status"
   ```

**Resolution Options:**
- **Option A:** Recreate missing volumes if data is not critical
- **Option B:** Migrate containers to existing storage pool
- **Option C:** Restore volumes from backup if available
- **Option D:** Update container configs to point to correct storage

---

### Category 2: Startup Failures (24 containers)

**Error Pattern:** `startup for container 'XXXX' failed`

**Affected Containers:**
- CT 5200
- CT 10000, 10001, 10020, 10030, 10040, 10050, 10060
- CT 10070, 10080, 10090, 10091, 10092
- CT 10100, 10101, 10120, 10130
- CT 10150, 10151
- CT 10200, 10201, 10202, 10210, 10230

**Root Cause Analysis:**
Startup failures can have multiple causes:
1. **Missing configuration files** - Container config deleted or not migrated
2. **Storage issues** - Storage accessible but corrupted or misconfigured
3. **Network issues** - Network configuration problems
4. **Resource constraints** - Insufficient memory/CPU
5. **Container corruption** - Container filesystem issues
6. **Dependencies** - Missing required services or mounts

**Diagnostic Steps:**
1. Check if config files exist:
   ```bash
   ssh root@192.168.11.12 "ls -la /etc/pve/lxc/ | grep -E '5200|10000|10001|10020|10030|10040|10050|10060|10070|10080|10090|10091|10092|10100|10101|10120|10130|10150|10151|10200|10201|10202|10210|10230'"
   ```

2. Check detailed startup error:
   ```bash
   ssh root@192.168.11.12 "pct start 5200 2>&1"
   ```

3. Check container status and locks:
   ```bash
   ssh root@192.168.11.12 "pct list | grep -E '5200|10000|10001'"
   ```

4. Check system resources:
   ```bash
   ssh root@192.168.11.12 "free -h; df -h"
   ```

5. Check container logs:
   ```bash
   ssh root@192.168.11.12 "journalctl -u pve-container@5200 -n 50 --no-pager"
   ```

**Resolution Options:**
- **Option A:** Fix configuration issues (network, storage, etc.)
- **Option B:** Recreate containers if configs are missing
- **Option C:** Check and resolve resource constraints
- **Option D:** Restore from backup if corruption detected

---

### Category 3: Lock Error (1 container)

**Error Pattern:** `CT is locked (create)`

**Affected Container:**
- CT 10232

**Root Cause Analysis:**
- Container is stuck in "create" state
- Previous creation operation may have been interrupted
- Lock file exists but container creation incomplete

**Diagnostic Steps:**
1. Check lock status:
   ```bash
   ssh root@192.168.11.12 "pct list | grep 10232"
   ```

2. Check for lock files:
   ```bash
   ssh root@192.168.11.12 "ls -la /var/lock/qemu-server/ | grep 10232"
   ```

3. Check Proxmox task queue:
   ```bash
   ssh root@192.168.11.12 "qm list | grep 10232"
   ```

**Resolution Options:**
- **Option A:** Clear lock manually:
  ```bash
  ssh root@192.168.11.12 "rm -f /var/lock/qemu-server/lock-10232"
  ```
- **Option B:** Complete or cancel the creation task
- **Option C:** Delete and recreate container if creation incomplete

---

## Successfully Started Containers

The following containers started successfully:
- CT 10030, 10040, 10050, 10060, 10070, 10080, 10090, 10091, 10092, 10100, 10101, 10120, 10130, 10150, 10151, 10200, 10201, 10202, 10210, 10230, 10232

**Note:** Some of these may have started initially but then failed (see failure list above).

---

## Recommended Actions

### Immediate Actions (Priority 1)

1. **Run Diagnostic Script:**
   ```bash
   ./scripts/diagnose-r630-02-startup-failures.sh
   ```
   This will identify the root cause for each failure.

2. **Check Storage Status:**
   ```bash
   ssh root@192.168.11.12 "pvesm status; lvs; vgs"
   ```

3. **Check System Resources:**
   ```bash
   ssh root@192.168.11.12 "free -h; df -h; uptime"
   ```

### Short-term Actions (Priority 2)

1. **Fix Logical Volume Issues:**
   - Identify where volumes should be or if they need recreation
   - Update container configs to use correct storage pools
   - Recreate volumes if data is not critical

2. **Resolve Startup Failures:**
   - Check each container's detailed error message
   - Fix configuration issues
   - Recreate containers if configs are missing

3. **Clear Lock on CT 10232:**
   - Remove lock file and retry creation or delete container

### Long-term Actions (Priority 3)

1. **Implement Monitoring:**
   - Set up alerts for container startup failures
   - Monitor storage pool health
   - Track container status changes

2. **Documentation:**
   - Document container dependencies
   - Create runbooks for common failure scenarios
   - Maintain container inventory with storage mappings

3. **Prevention:**
   - Implement pre-startup validation
   - Add storage health checks
   - Create backup procedures for container configs

---

## Diagnostic Commands Reference

### Check Container Status
```bash
ssh root@192.168.11.12 "pct list | grep -E '3000|3001|3002|3003|3500|3501|5200|6000|6400|10000|10001|10020|10030|10040|10050|10060|10070|10080|10090|10091|10092|10100|10101|10120|10130|10150|10151|10200|10201|10202|10210|10230|10232'"
```

### Check Storage Configuration
```bash
ssh root@192.168.11.12 "pvesm status"
ssh root@192.168.11.12 "lvs | grep -E 'vm-3000|vm-3001|vm-3002|vm-3003|vm-3500|vm-3501|vm-6000|vm-6400'"
```

### Check Container Configs
```bash
ssh root@192.168.11.12 "for vmid in 3000 3001 3002 3003 3500 3501 5200 6000 6400; do echo \"=== CT \$vmid ===\"; pct config \$vmid 2>&1 | head -5; done"
```

### Check Detailed Errors
```bash
ssh root@192.168.11.12 "for vmid in 3000 5200 10000 10232; do echo \"=== CT \$vmid ===\"; pct start \$vmid 2>&1; echo; done"
```

---

## Related Documentation

- [Storage Migration Issues](../docs/09-troubleshooting/STORAGE_MIGRATION_ISSUE.md)
- [R630-02 Storage Fixes](../docs/04-configuration/R630-02_STORAGE_FIXES_APPLIED.md)
- [RPC Migration Execution Summary](../docs/04-configuration/RPC_MIGRATION_EXECUTION_SUMMARY.md)

---

## Next Steps

1. Run the diagnostic script to gather detailed information
2. Review diagnostic output and categorize failures
3. Execute fix script for automated resolution where possible
4. Manually resolve remaining issues based on diagnostic findings
5. Verify all containers can start successfully
6. Document resolution steps for future reference