- Provision oracle-publisher on CT 3500 (quoted DATA_SOURCE URLs, dotenv). - Host-side pct-lxc-3501-net-up for ccip-monitor eth0 after migrate. - CoinGecko key script: avoid sed & corruption; document quoted URLs. - Besu node list reload, fstrim/RPC scripts, storage health docs. - Submodule smom-dbis-138: web3 v6 pin, oracle check default host r630-02. Made-with: Cursor
136 lines
9.4 KiB
Markdown
136 lines
9.4 KiB
Markdown
# Storage Growth and Health — Predictable Growth Table & Proactive Monitoring
|
||
|
||
**Last updated:** 2026-03-28
|
||
**Purpose:** Real-time data collection and a predictable growth table so we can stay ahead of disk space issues on hosts and VMs.
|
||
|
||
### Recent operator maintenance (2026-03-28)
|
||
|
||
- **r630-01 `pve/data` (local-lvm):** Thin pool extended (+80 GiB data, +512 MiB metadata earlier); **LVM thin auto-extend** enabled in `lvm.conf` (`thin_pool_autoextend_threshold = 80`, `thin_pool_autoextend_percent = 20`); **dmeventd** must stay active.
|
||
- **r630-01 `pve/thin1`:** Pool extended (+48 GiB data, +256 MiB metadata) to reduce pressure; metadata percent dropped accordingly.
|
||
- **r630-01 `/var/lib/vz/dump`:** Removed obsolete **2026-02-15** vzdump archives/logs (~9 GiB); newer logs from 2026-02-28 retained.
|
||
- **Fleet guest `fstrim`:** `scripts/maintenance/fstrim-all-running-ct.sh` supports **`FSTRIM_TIMEOUT_SEC`** and **`FSTRIM_HOSTS`** (e.g. `ml110`, `r630-01`, `r630-02`). Many CTs return FITRIM “not permitted” (guest/filesystem); others reclaim space on the thin pools (notably on **r630-02**).
|
||
- **r630-02 `thin1`–`thin6` VGs:** Each VG is on a **single PV** with only **~124 MiB `vg_free`**; you **cannot** `lvextend` those thin pools until the underlying partition/disk is grown or a second PV is added. Monitor `pvesm status` and plan disk expansion before pools tighten.
|
||
- **CT migration** off r630-01 for load balance remains a **planned** action when maintenance windows and target storage allow (not automated here).
|
||
- **2026-03-28 (migration follow-up):** CT **3501** migrated to r630-02 **`thin5`** via `pvesh … lxc/3501/migrate --target-storage thin5`. CT **3500** had root LV removed after a mistaken `pct set --delete unused0` (config had `unused0: local-lvm:vm-3500-disk-0` and `rootfs: thin1:vm-3500-disk-0`); **3500** was recreated empty on r630-02 `thin5` — **reinstall Oracle Publisher** on the guest. See `MIGRATE_CT_R630_01_TO_R630_02.md`.
|
||
|
||
---
|
||
|
||
## 1. Real-time data collection
|
||
|
||
### Script: `scripts/monitoring/collect-storage-growth-data.sh`
|
||
|
||
Run from **project root** (LAN, SSH key-based access to Proxmox hosts):
|
||
|
||
```bash
|
||
# Full snapshot to stdout + file under logs/storage-growth/
|
||
./scripts/monitoring/collect-storage-growth-data.sh
|
||
|
||
# Append one-line summary per storage to history CSV (for trending)
|
||
./scripts/monitoring/collect-storage-growth-data.sh --append
|
||
|
||
# CSV rows to stdout
|
||
./scripts/monitoring/collect-storage-growth-data.sh --csv
|
||
```
|
||
|
||
**Collected data (granularity):**
|
||
|
||
| Layer | What is collected |
|
||
|-------|-------------------|
|
||
| **Host** | `pvesm status` (each storage: type, used%, total, used, avail), `lvs` (thin pool data_percent, metadata_percent), `vgs` (VG free), `df -h /` |
|
||
| **VM/CT** | For every **running** container: `df -h /`, `df -h /data`, `df -h /var/log`; `du -sh /data/besu`, `du -sh /var/log` |
|
||
|
||
**Output:** Snapshot file `logs/storage-growth/snapshot_YYYYMMDD_HHMMSS.txt`. Use `--append` to grow `logs/storage-growth/history.csv` for trend analysis.
|
||
|
||
### Cron (proactive)
|
||
|
||
Use the scheduler script from project root (installs cron every 6 hours; uses `$PROJECT_ROOT`):
|
||
|
||
```bash
|
||
./scripts/maintenance/schedule-storage-growth-cron.sh --install # every 6h: collect + append
|
||
./scripts/maintenance/schedule-storage-growth-cron.sh --show # print cron line
|
||
./scripts/maintenance/schedule-storage-growth-cron.sh --remove # uninstall
|
||
```
|
||
|
||
**Retention:** Run `scripts/monitoring/prune-storage-snapshots.sh` weekly (e.g. keep last 30 days of snapshot files). Option: `--days 14` or `--dry-run` to preview. See **STORAGE_GROWTH_AUTOMATION_TASKS.md** for full automation list.
|
||
|
||
---
|
||
|
||
## 2. Predictable growth table (template)
|
||
|
||
Fill and refresh from real data. **Est. monthly growth** and **Growth factor** should be updated from `history.csv` or from observed rates.
|
||
|
||
| Host / VM | Storage / path | Current used | Capacity | Growth factor | Est. monthly growth | Threshold | Action when exceeded |
|
||
|-----------|----------------|--------------|----------|---------------|---------------------|-----------|----------------------|
|
||
| **r630-01** | data (LVM thin) | _e.g. 74%_ | pool size | Thin provisioned | VMs + compaction | **80%** warn, **95%** crit | fstrim CTs, migrate VMs, expand pool |
|
||
| **r630-01** | local-lvm | _%_ | — | — | — | 80 / 95 | Same |
|
||
| **r630-02** | thin1 / data | _%_ | — | — | — | 80 / 95 | Same |
|
||
| **ml110** | thin1 | _%_ | — | — | — | 80 / 95 | Same |
|
||
| **2101** | / (root) | _%_ | 200G | Besu DB + logs | High (RocksDB) | 85 warn, 95 crit | e2fsck, make writable, free /data |
|
||
| **2101** | /data/besu | _du_ | same as / | RocksDB + compaction | ~1–5% block growth | — | Resync or expand disk |
|
||
| **2500–2505** | /, /data/besu | _%_ | — | Besu | Same | 85 / 95 | Same as 2101 |
|
||
| **2400** | /, /data/besu | _%_ | 196G | Besu + Nginx logs | Same | 85 / 95 | Logrotate, Vert.x tuning |
|
||
| **10130, 10150, 10151** | / | _%_ | — | Logs, app data | Low–medium | 85 / 95 | Logrotate, clean caches |
|
||
| **5000** (Blockscout) | /, DB volume | _%_ | — | Postgres + indexer | Medium | 85 / 95 | VACUUM, archive old data |
|
||
| **10233, 10234** (NPMplus) | / | _%_ | — | Logs, certs | Low | 85 / 95 | Logrotate |
|
||
|
||
**Growth factor** short reference:
|
||
|
||
- **Besu (/data/besu):** Block chain growth + RocksDB compaction spikes. Largest and least predictable.
|
||
- **Logs (/var/log):** Depends on log level and rotation. Typically low if rotation is enabled.
|
||
- **Postgres/DB:** Grows with chain indexer and app data.
|
||
- **Thin pool:** Sum of all LV allocations + actual usage; compaction and new blocks can spike usage.
|
||
|
||
---
|
||
|
||
## 3. Factors affecting health (detailed)
|
||
|
||
Use this list to match real-time data to causes and actions.
|
||
|
||
| Factor | Where it matters | Typical size / rate | Mitigation |
|
||
|--------|-------------------|----------------------|------------|
|
||
| **LVM thin pool data%** | Host (r630-01 data, r630-02 thin*, ml110 thin1) | 100% = no new writes | fstrim in CTs, migrate VMs, remove unused LVs, expand pool |
|
||
| **LVM thin metadata%** | Same | High metadata% can cause issues | Expand metadata LV or reduce snapshots |
|
||
| **RocksDB (Besu)** | /data/besu in 2101, 2500–2505, 2400, 2201, etc. | Grows with chain; compaction needs temp space | Ensure / and /data have headroom; avoid 100% thin pool |
|
||
| **Journal / systemd logs** | /var/log in every CT | Can grow if not rotated | logrotate, journalctl --vacuum-time=7d |
|
||
| **Nginx / app logs** | /var/log, /var/www | Depends on traffic | logrotate, log level |
|
||
| **Postgres / DB** | Blockscout, DBIS, etc. | Grows with indexer and app data | VACUUM, archive, resize volume |
|
||
| **Backups (proxmox)** | Host storage (e.g. backup target) | Per VMID, full or incremental | Retention policy, offload to NAS |
|
||
| **Root filesystem read-only** | Any CT when I/O or ENOSPC | — | e2fsck on host, make writable (see 502_DEEP_DIVE) |
|
||
| **Temp/cache** | /tmp, /var/cache, Besu java.io.tmpdir | Spikes during compaction | Use dedicated tmpdir (e.g. /data/besu/tmp), clear caches |
|
||
|
||
---
|
||
|
||
## 4. Thresholds and proactive playbook
|
||
|
||
| Level | Host (thin / pvesm) | VM (/, /data) | Action |
|
||
|-------|----------------------|---------------|--------|
|
||
| **OK** | < 80% | < 85% | Continue regular collection and trending |
|
||
| **Warn** | 80–95% | 85–95% | Run `collect-storage-growth-data.sh`, identify top consumers; plan migration or cleanup |
|
||
| **Critical** | > 95% | > 95% | Immediate: fstrim, stop non-essential CTs, migrate VMs, or expand storage |
|
||
|
||
**Proactive checks (recommended):**
|
||
|
||
1. **Daily or every 6h:** Run `collect-storage-growth-data.sh --append` and inspect latest snapshot under `logs/storage-growth/`.
|
||
2. **Weekly:** Review `logs/storage-growth/history.csv` for rising trends; update the **Predictable growth table** with current numbers and est. monthly growth.
|
||
3. **When adding VMs or chain usage:** Re-estimate growth for affected hosts and thin pools; adjust thresholds or capacity.
|
||
|
||
---
|
||
|
||
## 5. Matching real-time data to the table
|
||
|
||
- **Host storage %:** From script output “pvesm status” and “LVM thin pools (data%)”. Map to row “Host / VM” = host name, “Storage / path” = storage or LV name.
|
||
- **VM /, /data, /var/log:** From “VM/CT on <host>” and “VMID <id>” in the same snapshot. Map to row “Host / VM” = VMID.
|
||
- **Growth over time:** Use `history.csv` (with `--append` runs). Compute delta of used% or used size between two timestamps to get rate; extrapolate to “Est. monthly growth” and “Action when exceeded”.
|
||
|
||
---
|
||
|
||
## 6. Related
|
||
|
||
- **Host-level alerts:** `scripts/storage-monitor.sh` (WARN 80%, CRIT 90%). Schedule: `scripts/maintenance/schedule-storage-monitor-cron.sh --install` (daily 07:00).
|
||
- **In-CT disk check:** `scripts/maintenance/check-disk-all-vmids.sh` (root /). Run daily via `daily-weekly-checks.sh` (cron 08:00).
|
||
- **Retention:** `scripts/monitoring/prune-storage-snapshots.sh` (snapshots), `scripts/monitoring/prune-storage-history.sh` (history.csv). Both run weekly when using `schedule-storage-growth-cron.sh --install`.
|
||
- **Weekly remediation:** `daily-weekly-checks.sh weekly` runs fstrim in all running CTs and journal vacuum in key CTs; see **STORAGE_GROWTH_AUTOMATION_TASKS.md**.
|
||
- **Logrotate audit:** **LOGROTATE_AUDIT_RUNBOOK.md** (high-log VMIDs).
|
||
- **Making RPC VMIDs writable after full/read-only:** `scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh`; see **502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md**.
|
||
- **Thin pool full / migration:** **MIGRATE_CT_R630_01_TO_R630_02.md**, **R630-02_STORAGE_REVIEW.md**.
|