Files
proxmox/docs/04-configuration/STORAGE_RECOMMENDATIONS_BY_FILL_RATE.md
defiQUG b3a8fe4496
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
chore: sync all changes to Gitea
- Config, docs, scripts, and backup manifests
- Submodule refs unchanged (m = modified content in submodules)

Made-with: Cursor
2026-03-02 11:37:34 -08:00

146 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Storage Recommendations by Fill Rate and Growth
**Last updated:** 2026-02-28
Based on current usage, history in `logs/storage-growth/history.csv`, and physical drive layout across ml110, r630-01, and r630-02.
**Completed (2026-02-28):** Storage growth cron verified; prune (VMID 5000 + r630-01 CTs) run; ml110 sdb added to VG `pve` and data thin pool extended to ~1.7 TB (ml110 data now ~11% used). **Phase 1 migration (r630-01 data → thin1):** 8 CTs migrated (10233, 10120, 10100, 10101, 10235, 10236, 7804, 8640); r630-01 data **65.8%** (was 72%), thin1 50.6%.
---
## 1. Thresholds and monitoring
| Level | Use % | Action |
|-------|--------|--------|
| **Healthy** | < 75% | Continue normal collection; review quarterly. |
| **Watch** | 7584% | Weekly review; plan prune or migration. |
| **WARN** | 8594% | Prune and/or migrate within 12 weeks; do not add new large CTs. |
| **CRIT** | ≥ 95% | Immediate action; LVM thin pools can fail or go read-only. |
Current scripts: `check-disk-all-vmids.sh` uses WARN 85%, CRIT 95% for **container root** usage. These recommendations apply to **host storage (pvesm / LVM)** as well.
---
## 2. Observed fill behavior (from history)
| Host | Storage | Trend (recent) | Implied rate / note |
|------|---------|----------------|----------------------|
| **ml110** | data | ~28.7% → ~25% (Feb 15 → 27) | Slight decrease (prune/dedup). Plenty of free space. |
| **r630-01** | data | 88% → 100% → 72% → **65.8%** (Phase 1 migration) | After Phase 1 (8 CTs data→thin1). Main growth host (validators, RPCs, many CTs). |
| **r630-02** | thin1-r630-02 | ~26.5% stable | Low growth. |
| **r630-02** | thin2 | ~4.8% → ~9% after 5000 migration | Now holds Blockscout (5000); monitor. |
| **r630-02** | thin5 | Was 84.6% → 0% after migration | Empty; available for future moves. |
**Conclusion:** The pool that fills fastest and needs the most attention is **r630-01 data** (72% now; many CTs, Besu/DB growth). **ml110 data** is stable and has headroom. **r630-02** is manageable if you avoid concentrating more large CTs on a single thin pool.
---
## 3. Recommendations by host and pool
### ml110
- **data / local-lvm (~25%)**
- **Rate:** Low/slow.
- **Recommendations:**
- Keep running `collect-storage-growth-data.sh --append` (e.g. cron every 6h).
- Prune logs in CTs periodically (e.g. with `fix-storage-r630-01-and-thin5.sh`-style logic for ml110 or a dedicated prune script).
- No urgency; review again when approaching 70%.
- **Unused sdb (931G)**
- **Recommendation:** Use it before adding new disks elsewhere.
- **Option A:** Add sdb to VG `pve` and extend the `data` thin pool (or create a second thin pool). Frees pressure on sda and doubles effective data capacity.
- **Option B:** Create a separate VG + thin pool on sdb for new or migrated CTs.
- Document the chosen layout and any new Proxmox storage names in `storage.cfg` and in `PHYSICAL_DRIVES_AND_CONFIG.md`.
### r630-01
- **data / local-lvm (~72%)**
- **Rate:** Highest risk; this pool has the most CTs and Besu/DB growth.
- **Recommendations:**
1. **Short term:**
- Run log/journal prune on all r630-01 CTs regularly (e.g. `fix-storage-r630-01-and-thin5.sh` Phase 2, or a cron job).
- Keep storage growth collection (e.g. every 6h) and review weekly when > 70%.
2. **Before 85%:**
- Move one or more large CTs to **thin1** on r630-01 (thin1 ~43% used, has space) if VMIDs allow, or plan migration to r630-02 thin pools.
- Identify biggest CTs: `check-disk-all-vmids.sh` and `lvs` on r630-01 (data pool).
3. **Before 90%:**
- Decide on expansion (e.g. add disks to RAID10 and extend md0/LVM) or permanent migration of several CTs to r630-02.
- **Do not** let this pool sit above 85% for long; it has already hit 100% once.
- **thin1 (~43%)**
- **Rate:** Moderate.
- **Recommendations:** Use as spillover for data pool migrations when possible. Monitor monthly; act if > 75%.
### r630-02
- **thin1-r630-02 (~26%)**
- **Rate:** Low.
- **Recommendation:** Monitor; no change needed unless you add many CTs here.
- **thin2 (~9% after 5000 migration)**
- **Rate:** May grow with Blockscout (5000) and other CTs.
- **Recommendations:**
- Run VMID 5000 prune periodically: `vmid5000-free-disk-and-logs.sh`.
- If thin2 approaches 75%, consider moving one CT to thin5 (now empty) or thin6.
- **thin3, thin4, thin6 (roughly 1122%)**
- **Rate:** Low to moderate.
- **Recommendation:** Include in weekly pvesm/lvs review; no special action unless one pool trends > 75%.
- **thin5 (0% after migration)**
- **Recommendation:** Keep as reserve for migrations from thin2 or other pools when they approach WARN.
---
## 4. Operational schedule (by fill rate)
| When | Action |
|------|--------|
| **Always** | Cron: `collect-storage-growth-data.sh --append` every 6h; weekly: `prune-storage-snapshots.sh` (e.g. Sun 08:00). |
| **Weekly** | Review `pvesm status` and `lvs` (or run `audit-proxmox-rpc-storage.sh`); check any pool > 70%. |
| **75% ≤ use < 85%** | Plan and run prune; plan migration for largest CTs on that pool; consider using ml110 sdb (if not yet in use). |
| **85% ≤ use < 95%** | Execute prune and migration within 12 weeks; do not add new large VMs/CTs to that pool. |
| **≥ 95%** | Immediate prune + migration; consider emergency migration to ml110 (after adding sdb) or r630-02. |
---
## 5. Scripts to support these recommendations
| Script | Purpose |
|--------|--------|
| `scripts/monitoring/collect-storage-growth-data.sh --append` | Record fill over time (for rate). |
| `scripts/maintenance/schedule-storage-growth-cron.sh --install` | Install 6h collect + weekly prune. |
| `scripts/audit-proxmox-rpc-storage.sh` | Current pvesm + RPC rootfs mapping. |
| `scripts/maintenance/check-disk-all-vmids.sh` | Per-CT disk usage (find big consumers). |
| `scripts/maintenance/fix-storage-r630-01-and-thin5.sh` | Prune 5000 + r630-01 CT logs; optional migrate 5000. |
| `scripts/maintenance/migrate-ct-r630-01-data-to-thin1.sh <VMID>` | Migrate one CT from r630-01 data → thin1 (same host). |
| `scripts/maintenance/vmid5000-free-disk-and-logs.sh` | Prune Blockscout (5000) only. |
---
## 6. Adding ml110 sdb to increase capacity (suggested steps)
1. On ml110: `vgextend pve /dev/sdb` (if sdb is already a PV) or `pvcreate /dev/sdb && vgextend pve /dev/sdb`.
2. Extend the data thin pool: `lvextend -L +900G /dev/pve/data` (or use `lvextend -l +100%FREE` and adjust as needed).
3. Re-run `pvesm status` and update documentation.
4. No CT migration required; existing LVs on data can use the new space.
(If sdb is a raw disk with no PV, partition or use full disk as PV per your policy; then add to `pve` and extend the data LV as above.)
---
## 7. Summary table by risk
| Host | Pool | Current (approx) | Risk | Priority recommendation |
|------|------|-------------------|------|--------------------------|
| ml110 | data | ~11% (post-extension) | Low | **Done:** sdb added; pool ~1.7 TB. Monitor as before. |
| ml110 | sdb | In use (extended data) | — | **Done:** sdb added to pve, data thin pool extended (~1.7 TB total). |
| r630-01 | data | ~72% | High | Prune weekly; plan migrations before 85%; consider thin1 spillover. |
| r630-01 | thin1 | ~43% | Medium | Use for migrations from data; monitor monthly. |
| r630-02 | thin1-r630-02 | ~26% | Low | Monitor. |
| r630-02 | thin2 | ~9% | Low | Prune 5000 periodically; watch growth. |
| r630-02 | thin5 | 0% | Low | Keep as reserve for migrations. |
| r630-02 | thin3, thin4, thin6 | ~1122% | Low | Include in weekly review. |
These recommendations are based on the rate of filling observed in history and current configurations; adjust thresholds or schedule if your growth pattern changes.