Files

defiQUG da93f8dbb6 fix(storage-monitor): subshell-safe ALERTS, ordered node loop; doc fleet pass

- Replace pipe-while with process substitution so alerts accumulate.
- Iterate ml110→r630-04 in fixed order; tolerate unreachable optional nodes.
- STORAGE_GROWTH_AND_HEALTH: 2026-03-28 follow-up (7811 syslog, 10100 resize,
  I/O pass, ZFS scrub, md0 healthy, table refresh for r630-01/02/ml110).

Made-with: Cursor

2026-03-28 16:15:59 -07:00

11 KiB

Raw Blame History

Storage Growth and Health — Predictable Growth Table & Proactive Monitoring

Last updated: 2026-03-28
Purpose: Real-time data collection and a predictable growth table so we can stay ahead of disk space issues on hosts and VMs.

Recent operator maintenance (2026-03-28)

Fleet checks (same day, follow-up): Ran collect-storage-growth-data.sh --append, storage-monitor.sh check, proxmox-host-io-optimize-pass.sh (swappiness/sysstat; host fstrim N/A on LVM root). Load: ml110 load dominated by Besu (Java) and cloudflared; r630-01 load improved after earlier spike (still many CTs). ZFS: r630-01 / r630-02 rpool ONLINE; last scrub 2026-03-08, 0 errors. /proc/mdstat (r630-01): RAID devices present and active (no resync observed during check).
CT 7811 (r630-02, thin4): Root was 100% full (~44 GiB in /var/log/syslog + rotated syslog.1). Remediation: truncated syslog / syslog.1 and restarted rsyslog; root ~6% after fix. Ensure logrotate for syslog is effective inside the guest (or lower rsyslog verbosity).
CT 10100 (r630-01, thin1): Root WARN (~88–90% on 8 GiB); growth mostly /var/lib/postgresql (~5 GiB). Remediation: pct resize 10100 rootfs +4G + resize2fs; root ~57% after. Note: Proxmox warned thin overcommit vs VG — monitor pvesm / lvs and avoid excessive concurrent disk expansions without pool growth.
storage-monitor.sh: Fixed set -e abort on unreachable optional nodes and pipe-subshell so ALERTS+= runs in the main shell (alerts and summaries work).
r630-01 pve/data (local-lvm): Thin pool extended (+80 GiB data, +512 MiB metadata earlier); LVM thin auto-extend enabled in lvm.conf (thin_pool_autoextend_threshold = 80, thin_pool_autoextend_percent = 20); dmeventd must stay active.
r630-01 pve/thin1: Pool extended (+48 GiB data, +256 MiB metadata) to reduce pressure; metadata percent dropped accordingly.
r630-01 /var/lib/vz/dump: Removed obsolete 2026-02-15 vzdump archives/logs (~9 GiB); newer logs from 2026-02-28 retained.
Fleet guest fstrim: scripts/maintenance/fstrim-all-running-ct.sh supports FSTRIM_TIMEOUT_SEC and FSTRIM_HOSTS (e.g. ml110, r630-01, r630-02). Many CTs return FITRIM “not permitted” (guest/filesystem); others reclaim space on the thin pools (notably on r630-02).
r630-02 thin1–thin6 VGs: Each VG is on a single PV with only ~124 MiB vg_free; you cannot lvextend those thin pools until the underlying partition/disk is grown or a second PV is added. Monitor pvesm status and plan disk expansion before pools tighten.
CT migration off r630-01 for load balance remains a planned action when maintenance windows and target storage allow (not automated here).
2026-03-28 (migration follow-up): CT 3501 migrated to r630-02 thin5 via pvesh … lxc/3501/migrate --target-storage thin5. CT 3500 had root LV removed after a mistaken pct set --delete unused0 (config had unused0: local-lvm:vm-3500-disk-0 and rootfs: thin1:vm-3500-disk-0); 3500 was recreated empty on r630-02 thin5 — reinstall Oracle Publisher on the guest. See MIGRATE_CT_R630_01_TO_R630_02.md.

1. Real-time data collection

Script: `scripts/monitoring/collect-storage-growth-data.sh`

Run from project root (LAN, SSH key-based access to Proxmox hosts):

# Full snapshot to stdout + file under logs/storage-growth/
./scripts/monitoring/collect-storage-growth-data.sh

# Append one-line summary per storage to history CSV (for trending)
./scripts/monitoring/collect-storage-growth-data.sh --append

# CSV rows to stdout
./scripts/monitoring/collect-storage-growth-data.sh --csv

Collected data (granularity):

Layer	What is collected
Host	`pvesm status` (each storage: type, used%, total, used, avail), `lvs` (thin pool data_percent, metadata_percent), `vgs` (VG free), `df -h /`
VM/CT	For every running container: `df -h /`, `df -h /data`, `df -h /var/log`; `du -sh /data/besu`, `du -sh /var/log`

Output: Snapshot file logs/storage-growth/snapshot_YYYYMMDD_HHMMSS.txt. Use --append to grow logs/storage-growth/history.csv for trend analysis.

Cron (proactive)

Use the scheduler script from project root (installs cron every 6 hours; uses $PROJECT_ROOT):

./scripts/maintenance/schedule-storage-growth-cron.sh --install   # every 6h: collect + append
./scripts/maintenance/schedule-storage-growth-cron.sh --show      # print cron line
./scripts/maintenance/schedule-storage-growth-cron.sh --remove    # uninstall

Retention: Run scripts/monitoring/prune-storage-snapshots.sh weekly (e.g. keep last 30 days of snapshot files). Option: --days 14 or --dry-run to preview. See STORAGE_GROWTH_AUTOMATION_TASKS.md for full automation list.

2. Predictable growth table (template)

Fill and refresh from real data. Est. monthly growth and Growth factor should be updated from history.csv or from observed rates.

Host / VM	Storage / path	Current used	Capacity	Growth factor	Est. monthly growth	Threshold	Action when exceeded
r630-01	data (LVM thin)	~72% (pvesm 2026-03-28)	~360G pool	Thin provisioned	VMs + compaction	80% warn, 95% crit	fstrim CTs, migrate VMs, expand pool
r630-01	thin1	~48%	~256G pool	CT root disks on thin1	Same	80 / 95	Same; watch overcommit vs `vgs`
r630-02	thin1–thin6 (`thin1-r630-02` …)	~1–27% per pool (2026-03-28)	~226G each	Mixed CTs	Same	80 / 95	VG free ~0.12 GiB per thin VG — expand disk/PV before growing LVs
ml110	data / local-lvm	~15%	~1.7T thin	Besu CTs	High	80 / 95	Same
2101	/ (root)	%	200G	Besu DB + logs	High (RocksDB)	85 warn, 95 crit	e2fsck, make writable, free /data
2101	/data/besu	du	same as /	RocksDB + compaction	~1–5% block growth	—	Resync or expand disk
2500–2505	/, /data/besu	%	—	Besu	Same	85 / 95	Same as 2101
2400	/, /data/besu	%	196G	Besu + Nginx logs	Same	85 / 95	Logrotate, Vert.x tuning
10130, 10150, 10151	/	%	—	Logs, app data	Low–medium	85 / 95	Logrotate, clean caches
5000 (Blockscout)	/, DB volume	%	—	Postgres + indexer	Medium	85 / 95	VACUUM, archive old data
10233, 10234 (NPMplus)	/	%	—	Logs, certs	Low	85 / 95	Logrotate
7811 (r630-02)	/, `/var/log`	~6% after cleanup	50G	Runaway syslog	Low if rotated	85 / 95	Truncate/rotate syslog; fix rsyslog/logrotate
10100 (r630-01)	/	~57% after +4G	12G	PostgreSQL under `/var/lib`	DB growth	85 / 95	VACUUM/archive; resize cautiously (thin overcommit)

Growth factor short reference:

Besu (/data/besu): Block chain growth + RocksDB compaction spikes. Largest and least predictable.
Logs (/var/log): Depends on log level and rotation. Typically low if rotation is enabled.
Postgres/DB: Grows with chain indexer and app data.
Thin pool: Sum of all LV allocations + actual usage; compaction and new blocks can spike usage.

3. Factors affecting health (detailed)

Use this list to match real-time data to causes and actions.

Factor	Where it matters	Typical size / rate	Mitigation
LVM thin pool data%	Host (r630-01 data, r630-02 thin*, ml110 thin1)	100% = no new writes	fstrim in CTs, migrate VMs, remove unused LVs, expand pool
LVM thin metadata%	Same	High metadata% can cause issues	Expand metadata LV or reduce snapshots
RocksDB (Besu)	/data/besu in 2101, 2500–2505, 2400, 2201, etc.	Grows with chain; compaction needs temp space	Ensure / and /data have headroom; avoid 100% thin pool
Journal / systemd logs	/var/log in every CT	Can grow if not rotated	logrotate, journalctl --vacuum-time=7d
Nginx / app logs	/var/log, /var/www	Depends on traffic	logrotate, log level
Postgres / DB	Blockscout, DBIS, etc.	Grows with indexer and app data	VACUUM, archive, resize volume
Backups (proxmox)	Host storage (e.g. backup target)	Per VMID, full or incremental	Retention policy, offload to NAS
Root filesystem read-only	Any CT when I/O or ENOSPC	—	e2fsck on host, make writable (see 502_DEEP_DIVE)
Temp/cache	/tmp, /var/cache, Besu java.io.tmpdir	Spikes during compaction	Use dedicated tmpdir (e.g. /data/besu/tmp), clear caches

4. Thresholds and proactive playbook

Level	Host (thin / pvesm)	VM (/, /data)	Action
OK	< 80%	< 85%	Continue regular collection and trending
Warn	80–95%	85–95%	Run `collect-storage-growth-data.sh`, identify top consumers; plan migration or cleanup
Critical	> 95%	> 95%	Immediate: fstrim, stop non-essential CTs, migrate VMs, or expand storage

Proactive checks (recommended):

Daily or every 6h: Run collect-storage-growth-data.sh --append and inspect latest snapshot under logs/storage-growth/.
Weekly: Review logs/storage-growth/history.csv for rising trends; update the Predictable growth table with current numbers and est. monthly growth.
When adding VMs or chain usage: Re-estimate growth for affected hosts and thin pools; adjust thresholds or capacity.

5. Matching real-time data to the table

Host storage %: From script output “pvesm status” and “LVM thin pools (data%)”. Map to row “Host / VM” = host name, “Storage / path” = storage or LV name.
VM /, /data, /var/log: From “VM/CT on <host>” and “VMID <id>” in the same snapshot. Map to row “Host / VM” = VMID.
Growth over time: Use history.csv (with --append runs). Compute delta of used% or used size between two timestamps to get rate; extrapolate to “Est. monthly growth” and “Action when exceeded”.

Host-level alerts: scripts/storage-monitor.sh (WARN 80%, CRIT 90%). Schedule: scripts/maintenance/schedule-storage-monitor-cron.sh --install (daily 07:00).
In-CT disk check: scripts/maintenance/check-disk-all-vmids.sh (root /). Run daily via daily-weekly-checks.sh (cron 08:00).
Retention: scripts/monitoring/prune-storage-snapshots.sh (snapshots), scripts/monitoring/prune-storage-history.sh (history.csv). Both run weekly when using schedule-storage-growth-cron.sh --install.
Weekly remediation: daily-weekly-checks.sh weekly runs fstrim in all running CTs and journal vacuum in key CTs; see STORAGE_GROWTH_AUTOMATION_TASKS.md.
Logrotate audit: LOGROTATE_AUDIT_RUNBOOK.md (high-log VMIDs).
Making RPC VMIDs writable after full/read-only: scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh; see 502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md.
Thin pool full / migration: MIGRATE_CT_R630_01_TO_R630_02.md, R630-02_STORAGE_REVIEW.md.

11 KiB Raw Blame History Unescape Escape