# Storage Growth and Health — Predictable Growth Table & Proactive Monitoring **Last updated:** 2026-03-28 **Purpose:** Real-time data collection and a predictable growth table so we can stay ahead of disk space issues on hosts and VMs. ### Recent operator maintenance (2026-03-28) - **r630-01 `pve/data` (local-lvm):** Thin pool extended (+80 GiB data, +512 MiB metadata earlier); **LVM thin auto-extend** enabled in `lvm.conf` (`thin_pool_autoextend_threshold = 80`, `thin_pool_autoextend_percent = 20`); **dmeventd** must stay active. - **r630-01 `pve/thin1`:** Pool extended (+48 GiB data, +256 MiB metadata) to reduce pressure; metadata percent dropped accordingly. - **r630-01 `/var/lib/vz/dump`:** Removed obsolete **2026-02-15** vzdump archives/logs (~9 GiB); newer logs from 2026-02-28 retained. - **Fleet guest `fstrim`:** `scripts/maintenance/fstrim-all-running-ct.sh` supports **`FSTRIM_TIMEOUT_SEC`** and **`FSTRIM_HOSTS`** (e.g. `ml110`, `r630-01`, `r630-02`). Many CTs return FITRIM “not permitted” (guest/filesystem); others reclaim space on the thin pools (notably on **r630-02**). - **r630-02 `thin1`–`thin6` VGs:** Each VG is on a **single PV** with only **~124 MiB `vg_free`**; you **cannot** `lvextend` those thin pools until the underlying partition/disk is grown or a second PV is added. Monitor `pvesm status` and plan disk expansion before pools tighten. - **CT migration** off r630-01 for load balance remains a **planned** action when maintenance windows and target storage allow (not automated here). - **2026-03-28 (migration follow-up):** CT **3501** migrated to r630-02 **`thin5`** via `pvesh … lxc/3501/migrate --target-storage thin5`. CT **3500** had root LV removed after a mistaken `pct set --delete unused0` (config had `unused0: local-lvm:vm-3500-disk-0` and `rootfs: thin1:vm-3500-disk-0`); **3500** was recreated empty on r630-02 `thin5` — **reinstall Oracle Publisher** on the guest. See `MIGRATE_CT_R630_01_TO_R630_02.md`. --- ## 1. Real-time data collection ### Script: `scripts/monitoring/collect-storage-growth-data.sh` Run from **project root** (LAN, SSH key-based access to Proxmox hosts): ```bash # Full snapshot to stdout + file under logs/storage-growth/ ./scripts/monitoring/collect-storage-growth-data.sh # Append one-line summary per storage to history CSV (for trending) ./scripts/monitoring/collect-storage-growth-data.sh --append # CSV rows to stdout ./scripts/monitoring/collect-storage-growth-data.sh --csv ``` **Collected data (granularity):** | Layer | What is collected | |-------|-------------------| | **Host** | `pvesm status` (each storage: type, used%, total, used, avail), `lvs` (thin pool data_percent, metadata_percent), `vgs` (VG free), `df -h /` | | **VM/CT** | For every **running** container: `df -h /`, `df -h /data`, `df -h /var/log`; `du -sh /data/besu`, `du -sh /var/log` | **Output:** Snapshot file `logs/storage-growth/snapshot_YYYYMMDD_HHMMSS.txt`. Use `--append` to grow `logs/storage-growth/history.csv` for trend analysis. ### Cron (proactive) Use the scheduler script from project root (installs cron every 6 hours; uses `$PROJECT_ROOT`): ```bash ./scripts/maintenance/schedule-storage-growth-cron.sh --install # every 6h: collect + append ./scripts/maintenance/schedule-storage-growth-cron.sh --show # print cron line ./scripts/maintenance/schedule-storage-growth-cron.sh --remove # uninstall ``` **Retention:** Run `scripts/monitoring/prune-storage-snapshots.sh` weekly (e.g. keep last 30 days of snapshot files). Option: `--days 14` or `--dry-run` to preview. See **STORAGE_GROWTH_AUTOMATION_TASKS.md** for full automation list. --- ## 2. Predictable growth table (template) Fill and refresh from real data. **Est. monthly growth** and **Growth factor** should be updated from `history.csv` or from observed rates. | Host / VM | Storage / path | Current used | Capacity | Growth factor | Est. monthly growth | Threshold | Action when exceeded | |-----------|----------------|--------------|----------|---------------|---------------------|-----------|----------------------| | **r630-01** | data (LVM thin) | _e.g. 74%_ | pool size | Thin provisioned | VMs + compaction | **80%** warn, **95%** crit | fstrim CTs, migrate VMs, expand pool | | **r630-01** | local-lvm | _%_ | — | — | — | 80 / 95 | Same | | **r630-02** | thin1 / data | _%_ | — | — | — | 80 / 95 | Same | | **ml110** | thin1 | _%_ | — | — | — | 80 / 95 | Same | | **2101** | / (root) | _%_ | 200G | Besu DB + logs | High (RocksDB) | 85 warn, 95 crit | e2fsck, make writable, free /data | | **2101** | /data/besu | _du_ | same as / | RocksDB + compaction | ~1–5% block growth | — | Resync or expand disk | | **2500–2505** | /, /data/besu | _%_ | — | Besu | Same | 85 / 95 | Same as 2101 | | **2400** | /, /data/besu | _%_ | 196G | Besu + Nginx logs | Same | 85 / 95 | Logrotate, Vert.x tuning | | **10130, 10150, 10151** | / | _%_ | — | Logs, app data | Low–medium | 85 / 95 | Logrotate, clean caches | | **5000** (Blockscout) | /, DB volume | _%_ | — | Postgres + indexer | Medium | 85 / 95 | VACUUM, archive old data | | **10233, 10234** (NPMplus) | / | _%_ | — | Logs, certs | Low | 85 / 95 | Logrotate | **Growth factor** short reference: - **Besu (/data/besu):** Block chain growth + RocksDB compaction spikes. Largest and least predictable. - **Logs (/var/log):** Depends on log level and rotation. Typically low if rotation is enabled. - **Postgres/DB:** Grows with chain indexer and app data. - **Thin pool:** Sum of all LV allocations + actual usage; compaction and new blocks can spike usage. --- ## 3. Factors affecting health (detailed) Use this list to match real-time data to causes and actions. | Factor | Where it matters | Typical size / rate | Mitigation | |--------|-------------------|----------------------|------------| | **LVM thin pool data%** | Host (r630-01 data, r630-02 thin*, ml110 thin1) | 100% = no new writes | fstrim in CTs, migrate VMs, remove unused LVs, expand pool | | **LVM thin metadata%** | Same | High metadata% can cause issues | Expand metadata LV or reduce snapshots | | **RocksDB (Besu)** | /data/besu in 2101, 2500–2505, 2400, 2201, etc. | Grows with chain; compaction needs temp space | Ensure / and /data have headroom; avoid 100% thin pool | | **Journal / systemd logs** | /var/log in every CT | Can grow if not rotated | logrotate, journalctl --vacuum-time=7d | | **Nginx / app logs** | /var/log, /var/www | Depends on traffic | logrotate, log level | | **Postgres / DB** | Blockscout, DBIS, etc. | Grows with indexer and app data | VACUUM, archive, resize volume | | **Backups (proxmox)** | Host storage (e.g. backup target) | Per VMID, full or incremental | Retention policy, offload to NAS | | **Root filesystem read-only** | Any CT when I/O or ENOSPC | — | e2fsck on host, make writable (see 502_DEEP_DIVE) | | **Temp/cache** | /tmp, /var/cache, Besu java.io.tmpdir | Spikes during compaction | Use dedicated tmpdir (e.g. /data/besu/tmp), clear caches | --- ## 4. Thresholds and proactive playbook | Level | Host (thin / pvesm) | VM (/, /data) | Action | |-------|----------------------|---------------|--------| | **OK** | < 80% | < 85% | Continue regular collection and trending | | **Warn** | 80–95% | 85–95% | Run `collect-storage-growth-data.sh`, identify top consumers; plan migration or cleanup | | **Critical** | > 95% | > 95% | Immediate: fstrim, stop non-essential CTs, migrate VMs, or expand storage | **Proactive checks (recommended):** 1. **Daily or every 6h:** Run `collect-storage-growth-data.sh --append` and inspect latest snapshot under `logs/storage-growth/`. 2. **Weekly:** Review `logs/storage-growth/history.csv` for rising trends; update the **Predictable growth table** with current numbers and est. monthly growth. 3. **When adding VMs or chain usage:** Re-estimate growth for affected hosts and thin pools; adjust thresholds or capacity. --- ## 5. Matching real-time data to the table - **Host storage %:** From script output “pvesm status” and “LVM thin pools (data%)”. Map to row “Host / VM” = host name, “Storage / path” = storage or LV name. - **VM /, /data, /var/log:** From “VM/CT on <host>” and “VMID <id>” in the same snapshot. Map to row “Host / VM” = VMID. - **Growth over time:** Use `history.csv` (with `--append` runs). Compute delta of used% or used size between two timestamps to get rate; extrapolate to “Est. monthly growth” and “Action when exceeded”. --- ## 6. Related - **Host-level alerts:** `scripts/storage-monitor.sh` (WARN 80%, CRIT 90%). Schedule: `scripts/maintenance/schedule-storage-monitor-cron.sh --install` (daily 07:00). - **In-CT disk check:** `scripts/maintenance/check-disk-all-vmids.sh` (root /). Run daily via `daily-weekly-checks.sh` (cron 08:00). - **Retention:** `scripts/monitoring/prune-storage-snapshots.sh` (snapshots), `scripts/monitoring/prune-storage-history.sh` (history.csv). Both run weekly when using `schedule-storage-growth-cron.sh --install`. - **Weekly remediation:** `daily-weekly-checks.sh weekly` runs fstrim in all running CTs and journal vacuum in key CTs; see **STORAGE_GROWTH_AUTOMATION_TASKS.md**. - **Logrotate audit:** **LOGROTATE_AUDIT_RUNBOOK.md** (high-log VMIDs). - **Making RPC VMIDs writable after full/read-only:** `scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh`; see **502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md**. - **Thin pool full / migration:** **MIGRATE_CT_R630_01_TO_R630_02.md**, **R630-02_STORAGE_REVIEW.md**.