- Replace pipe-while with process substitution so alerts accumulate. - Iterate ml110→r630-04 in fixed order; tolerate unreachable optional nodes. - STORAGE_GROWTH_AND_HEALTH: 2026-03-28 follow-up (7811 syslog, 10100 resize, I/O pass, ZFS scrub, md0 healthy, table refresh for r630-01/02/ml110). Made-with: Cursor
11 KiB
Storage Growth and Health — Predictable Growth Table & Proactive Monitoring
Last updated: 2026-03-28
Purpose: Real-time data collection and a predictable growth table so we can stay ahead of disk space issues on hosts and VMs.
Recent operator maintenance (2026-03-28)
- Fleet checks (same day, follow-up): Ran
collect-storage-growth-data.sh --append,storage-monitor.sh check,proxmox-host-io-optimize-pass.sh(swappiness/sysstat; hostfstrimN/A on LVM root). Load: ml110 load dominated by Besu (Java) and cloudflared; r630-01 load improved after earlier spike (still many CTs). ZFS: r630-01 / r630-02rpoolONLINE; last scrub 2026-03-08, 0 errors./proc/mdstat(r630-01): RAID devices present and active (no resync observed during check). - CT 7811 (r630-02, thin4): Root was 100% full (~44 GiB in
/var/log/syslog+ rotatedsyslog.1). Remediation: truncatedsyslog/syslog.1and restartedrsyslog; root ~6% after fix. Ensure logrotate forsyslogis effective inside the guest (or lower rsyslog verbosity). - CT 10100 (r630-01, thin1): Root WARN (~88–90% on 8 GiB); growth mostly
/var/lib/postgresql(~5 GiB). Remediation:pct resize 10100 rootfs +4G+resize2fs; root ~57% after. Note: Proxmox warned thin overcommit vs VG — monitorpvesm/lvsand avoid excessive concurrent disk expansions without pool growth. storage-monitor.sh: Fixedset -eabort on unreachable optional nodes and pipe-subshell soALERTS+=runs in the main shell (alerts and summaries work).- r630-01
pve/data(local-lvm): Thin pool extended (+80 GiB data, +512 MiB metadata earlier); LVM thin auto-extend enabled inlvm.conf(thin_pool_autoextend_threshold = 80,thin_pool_autoextend_percent = 20); dmeventd must stay active. - r630-01
pve/thin1: Pool extended (+48 GiB data, +256 MiB metadata) to reduce pressure; metadata percent dropped accordingly. - r630-01
/var/lib/vz/dump: Removed obsolete 2026-02-15 vzdump archives/logs (~9 GiB); newer logs from 2026-02-28 retained. - Fleet guest
fstrim:scripts/maintenance/fstrim-all-running-ct.shsupportsFSTRIM_TIMEOUT_SECandFSTRIM_HOSTS(e.g.ml110,r630-01,r630-02). Many CTs return FITRIM “not permitted” (guest/filesystem); others reclaim space on the thin pools (notably on r630-02). - r630-02
thin1–thin6VGs: Each VG is on a single PV with only ~124 MiBvg_free; you cannotlvextendthose thin pools until the underlying partition/disk is grown or a second PV is added. Monitorpvesm statusand plan disk expansion before pools tighten. - CT migration off r630-01 for load balance remains a planned action when maintenance windows and target storage allow (not automated here).
- 2026-03-28 (migration follow-up): CT 3501 migrated to r630-02
thin5viapvesh … lxc/3501/migrate --target-storage thin5. CT 3500 had root LV removed after a mistakenpct set --delete unused0(config hadunused0: local-lvm:vm-3500-disk-0androotfs: thin1:vm-3500-disk-0); 3500 was recreated empty on r630-02thin5— reinstall Oracle Publisher on the guest. SeeMIGRATE_CT_R630_01_TO_R630_02.md.
1. Real-time data collection
Script: scripts/monitoring/collect-storage-growth-data.sh
Run from project root (LAN, SSH key-based access to Proxmox hosts):
# Full snapshot to stdout + file under logs/storage-growth/
./scripts/monitoring/collect-storage-growth-data.sh
# Append one-line summary per storage to history CSV (for trending)
./scripts/monitoring/collect-storage-growth-data.sh --append
# CSV rows to stdout
./scripts/monitoring/collect-storage-growth-data.sh --csv
Collected data (granularity):
| Layer | What is collected |
|---|---|
| Host | pvesm status (each storage: type, used%, total, used, avail), lvs (thin pool data_percent, metadata_percent), vgs (VG free), df -h / |
| VM/CT | For every running container: df -h /, df -h /data, df -h /var/log; du -sh /data/besu, du -sh /var/log |
Output: Snapshot file logs/storage-growth/snapshot_YYYYMMDD_HHMMSS.txt. Use --append to grow logs/storage-growth/history.csv for trend analysis.
Cron (proactive)
Use the scheduler script from project root (installs cron every 6 hours; uses $PROJECT_ROOT):
./scripts/maintenance/schedule-storage-growth-cron.sh --install # every 6h: collect + append
./scripts/maintenance/schedule-storage-growth-cron.sh --show # print cron line
./scripts/maintenance/schedule-storage-growth-cron.sh --remove # uninstall
Retention: Run scripts/monitoring/prune-storage-snapshots.sh weekly (e.g. keep last 30 days of snapshot files). Option: --days 14 or --dry-run to preview. See STORAGE_GROWTH_AUTOMATION_TASKS.md for full automation list.
2. Predictable growth table (template)
Fill and refresh from real data. Est. monthly growth and Growth factor should be updated from history.csv or from observed rates.
| Host / VM | Storage / path | Current used | Capacity | Growth factor | Est. monthly growth | Threshold | Action when exceeded |
|---|---|---|---|---|---|---|---|
| r630-01 | data (LVM thin) | ~72% (pvesm 2026-03-28) | ~360G pool | Thin provisioned | VMs + compaction | 80% warn, 95% crit | fstrim CTs, migrate VMs, expand pool |
| r630-01 | thin1 | ~48% | ~256G pool | CT root disks on thin1 | Same | 80 / 95 | Same; watch overcommit vs vgs |
| r630-02 | thin1–thin6 (thin1-r630-02 …) |
~1–27% per pool (2026-03-28) | ~226G each | Mixed CTs | Same | 80 / 95 | VG free ~0.12 GiB per thin VG — expand disk/PV before growing LVs |
| ml110 | data / local-lvm | ~15% | ~1.7T thin | Besu CTs | High | 80 / 95 | Same |
| 2101 | / (root) | % | 200G | Besu DB + logs | High (RocksDB) | 85 warn, 95 crit | e2fsck, make writable, free /data |
| 2101 | /data/besu | du | same as / | RocksDB + compaction | ~1–5% block growth | — | Resync or expand disk |
| 2500–2505 | /, /data/besu | % | — | Besu | Same | 85 / 95 | Same as 2101 |
| 2400 | /, /data/besu | % | 196G | Besu + Nginx logs | Same | 85 / 95 | Logrotate, Vert.x tuning |
| 10130, 10150, 10151 | / | % | — | Logs, app data | Low–medium | 85 / 95 | Logrotate, clean caches |
| 5000 (Blockscout) | /, DB volume | % | — | Postgres + indexer | Medium | 85 / 95 | VACUUM, archive old data |
| 10233, 10234 (NPMplus) | / | % | — | Logs, certs | Low | 85 / 95 | Logrotate |
| 7811 (r630-02) | /, /var/log |
~6% after cleanup | 50G | Runaway syslog | Low if rotated | 85 / 95 | Truncate/rotate syslog; fix rsyslog/logrotate |
| 10100 (r630-01) | / | ~57% after +4G | 12G | PostgreSQL under /var/lib |
DB growth | 85 / 95 | VACUUM/archive; resize cautiously (thin overcommit) |
Growth factor short reference:
- Besu (/data/besu): Block chain growth + RocksDB compaction spikes. Largest and least predictable.
- Logs (/var/log): Depends on log level and rotation. Typically low if rotation is enabled.
- Postgres/DB: Grows with chain indexer and app data.
- Thin pool: Sum of all LV allocations + actual usage; compaction and new blocks can spike usage.
3. Factors affecting health (detailed)
Use this list to match real-time data to causes and actions.
| Factor | Where it matters | Typical size / rate | Mitigation |
|---|---|---|---|
| LVM thin pool data% | Host (r630-01 data, r630-02 thin*, ml110 thin1) | 100% = no new writes | fstrim in CTs, migrate VMs, remove unused LVs, expand pool |
| LVM thin metadata% | Same | High metadata% can cause issues | Expand metadata LV or reduce snapshots |
| RocksDB (Besu) | /data/besu in 2101, 2500–2505, 2400, 2201, etc. | Grows with chain; compaction needs temp space | Ensure / and /data have headroom; avoid 100% thin pool |
| Journal / systemd logs | /var/log in every CT | Can grow if not rotated | logrotate, journalctl --vacuum-time=7d |
| Nginx / app logs | /var/log, /var/www | Depends on traffic | logrotate, log level |
| Postgres / DB | Blockscout, DBIS, etc. | Grows with indexer and app data | VACUUM, archive, resize volume |
| Backups (proxmox) | Host storage (e.g. backup target) | Per VMID, full or incremental | Retention policy, offload to NAS |
| Root filesystem read-only | Any CT when I/O or ENOSPC | — | e2fsck on host, make writable (see 502_DEEP_DIVE) |
| Temp/cache | /tmp, /var/cache, Besu java.io.tmpdir | Spikes during compaction | Use dedicated tmpdir (e.g. /data/besu/tmp), clear caches |
4. Thresholds and proactive playbook
| Level | Host (thin / pvesm) | VM (/, /data) | Action |
|---|---|---|---|
| OK | < 80% | < 85% | Continue regular collection and trending |
| Warn | 80–95% | 85–95% | Run collect-storage-growth-data.sh, identify top consumers; plan migration or cleanup |
| Critical | > 95% | > 95% | Immediate: fstrim, stop non-essential CTs, migrate VMs, or expand storage |
Proactive checks (recommended):
- Daily or every 6h: Run
collect-storage-growth-data.sh --appendand inspect latest snapshot underlogs/storage-growth/. - Weekly: Review
logs/storage-growth/history.csvfor rising trends; update the Predictable growth table with current numbers and est. monthly growth. - When adding VMs or chain usage: Re-estimate growth for affected hosts and thin pools; adjust thresholds or capacity.
5. Matching real-time data to the table
- Host storage %: From script output “pvesm status” and “LVM thin pools (data%)”. Map to row “Host / VM” = host name, “Storage / path” = storage or LV name.
- VM /, /data, /var/log: From “VM/CT on <host>” and “VMID <id>” in the same snapshot. Map to row “Host / VM” = VMID.
- Growth over time: Use
history.csv(with--appendruns). Compute delta of used% or used size between two timestamps to get rate; extrapolate to “Est. monthly growth” and “Action when exceeded”.
6. Related
- Host-level alerts:
scripts/storage-monitor.sh(WARN 80%, CRIT 90%). Schedule:scripts/maintenance/schedule-storage-monitor-cron.sh --install(daily 07:00). - In-CT disk check:
scripts/maintenance/check-disk-all-vmids.sh(root /). Run daily viadaily-weekly-checks.sh(cron 08:00). - Retention:
scripts/monitoring/prune-storage-snapshots.sh(snapshots),scripts/monitoring/prune-storage-history.sh(history.csv). Both run weekly when usingschedule-storage-growth-cron.sh --install. - Weekly remediation:
daily-weekly-checks.sh weeklyruns fstrim in all running CTs and journal vacuum in key CTs; see STORAGE_GROWTH_AUTOMATION_TASKS.md. - Logrotate audit: LOGROTATE_AUDIT_RUNBOOK.md (high-log VMIDs).
- Making RPC VMIDs writable after full/read-only:
scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh; see 502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md. - Thin pool full / migration: MIGRATE_CT_R630_01_TO_R630_02.md, R630-02_STORAGE_REVIEW.md.