diff --git a/docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md b/docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md index 7090bdc..cf3e34d 100644 --- a/docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md +++ b/docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md @@ -5,6 +5,10 @@ ### Recent operator maintenance (2026-03-28) +- **Fleet checks (same day, follow-up):** Ran `collect-storage-growth-data.sh --append`, `storage-monitor.sh check`, `proxmox-host-io-optimize-pass.sh` (swappiness/sysstat; host `fstrim` N/A on LVM root). **Load:** ml110 load dominated by **Besu (Java)** and **cloudflared**; r630-01 load improved after earlier spike (still many CTs). **ZFS:** r630-01 / r630-02 `rpool` ONLINE; last scrub **2026-03-08**, 0 errors. **`/proc/mdstat` (r630-01):** RAID devices present and active (no resync observed during check). +- **CT 7811 (r630-02, thin4):** Root was **100%** full (**~44 GiB** in `/var/log/syslog` + rotated `syslog.1`). **Remediation:** truncated `syslog` / `syslog.1` and restarted `rsyslog`; root **~6%** after fix. Ensure **logrotate** for `syslog` is effective inside the guest (or lower rsyslog verbosity). +- **CT 10100 (r630-01, thin1):** Root **WARN** (**~88–90%** on 8 GiB); growth mostly **`/var/lib/postgresql` (~5 GiB)**. **Remediation:** `pct resize 10100 rootfs +4G` + `resize2fs`; root **~57%** after. **Note:** Proxmox warned thin **overcommit** vs VG — monitor `pvesm` / `lvs` and avoid excessive concurrent disk expansions without pool growth. +- **`storage-monitor.sh`:** Fixed **`set -e` abort** on unreachable optional nodes and **pipe-subshell** so `ALERTS+=` runs in the main shell (alerts and summaries work). - **r630-01 `pve/data` (local-lvm):** Thin pool extended (+80 GiB data, +512 MiB metadata earlier); **LVM thin auto-extend** enabled in `lvm.conf` (`thin_pool_autoextend_threshold = 80`, `thin_pool_autoextend_percent = 20`); **dmeventd** must stay active. - **r630-01 `pve/thin1`:** Pool extended (+48 GiB data, +256 MiB metadata) to reduce pressure; metadata percent dropped accordingly. - **r630-01 `/var/lib/vz/dump`:** Removed obsolete **2026-02-15** vzdump archives/logs (~9 GiB); newer logs from 2026-02-28 retained. @@ -61,10 +65,10 @@ Fill and refresh from real data. **Est. monthly growth** and **Growth factor** s | Host / VM | Storage / path | Current used | Capacity | Growth factor | Est. monthly growth | Threshold | Action when exceeded | |-----------|----------------|--------------|----------|---------------|---------------------|-----------|----------------------| -| **r630-01** | data (LVM thin) | _e.g. 74%_ | pool size | Thin provisioned | VMs + compaction | **80%** warn, **95%** crit | fstrim CTs, migrate VMs, expand pool | -| **r630-01** | local-lvm | _%_ | — | — | — | 80 / 95 | Same | -| **r630-02** | thin1 / data | _%_ | — | — | — | 80 / 95 | Same | -| **ml110** | thin1 | _%_ | — | — | — | 80 / 95 | Same | +| **r630-01** | data (LVM thin) | **~72%** (pvesm 2026-03-28) | ~360G pool | Thin provisioned | VMs + compaction | **80%** warn, **95%** crit | fstrim CTs, migrate VMs, expand pool | +| **r630-01** | thin1 | **~48%** | ~256G pool | CT root disks on thin1 | Same | 80 / 95 | Same; watch overcommit vs `vgs` | +| **r630-02** | thin1–thin6 (`thin1-r630-02` …) | **~1–27%** per pool (2026-03-28) | ~226G each | Mixed CTs | Same | 80 / 95 | **VG free ~0.12 GiB per thin VG** — expand disk/PV before growing LVs | +| **ml110** | data / local-lvm | **~15%** | ~1.7T thin | Besu CTs | High | 80 / 95 | Same | | **2101** | / (root) | _%_ | 200G | Besu DB + logs | High (RocksDB) | 85 warn, 95 crit | e2fsck, make writable, free /data | | **2101** | /data/besu | _du_ | same as / | RocksDB + compaction | ~1–5% block growth | — | Resync or expand disk | | **2500–2505** | /, /data/besu | _%_ | — | Besu | Same | 85 / 95 | Same as 2101 | @@ -72,6 +76,8 @@ Fill and refresh from real data. **Est. monthly growth** and **Growth factor** s | **10130, 10150, 10151** | / | _%_ | — | Logs, app data | Low–medium | 85 / 95 | Logrotate, clean caches | | **5000** (Blockscout) | /, DB volume | _%_ | — | Postgres + indexer | Medium | 85 / 95 | VACUUM, archive old data | | **10233, 10234** (NPMplus) | / | _%_ | — | Logs, certs | Low | 85 / 95 | Logrotate | +| **7811** (r630-02) | /, `/var/log` | **~6%** after cleanup | 50G | Runaway **syslog** | Low if rotated | 85 / 95 | Truncate/rotate syslog; fix rsyslog/logrotate | +| **10100** (r630-01) | / | **~57%** after +4G | **12G** | **PostgreSQL** under `/var/lib` | DB growth | 85 / 95 | VACUUM/archive; resize cautiously (thin overcommit) | **Growth factor** short reference: diff --git a/scripts/storage-monitor.sh b/scripts/storage-monitor.sh index 3368761..14d3677 100755 --- a/scripts/storage-monitor.sh +++ b/scripts/storage-monitor.sh @@ -48,8 +48,8 @@ NODES[r630-02]="${PROXMOX_HOST_R630_02:-192.168.11.12}:password" NODES[r630-03]="${IP_SERVICE_13:-${IP_SERVICE_13:-${IP_SERVICE_13:-${IP_SERVICE_13:-${IP_SERVICE_13:-${IP_SERVICE_13:-192.168.11.13}}}}}}:L@kers2010" NODES[r630-04]="${IP_DEVICE_14:-${IP_DEVICE_14:-${IP_DEVICE_14:-${IP_DEVICE_14:-${IP_DEVICE_14:-${IP_DEVICE_14:-192.168.11.14}}}}}}:L@kers2010" -# Alert tracking -declare -a ALERTS +# Alert tracking (must stay in main shell — no pipe-|while subshell) +ALERTS=() # SSH helper function ssh_node() { @@ -166,22 +166,22 @@ monitor_node() { return 1 fi - # Process each storage line (skip header) - echo "$storage_status" | tail -n +2 | while IFS= read -r line; do + # Process each storage line (skip header) — process substitution keeps ALERTS in this shell + while IFS= read -r line; do if [ -n "$line" ]; then check_storage_usage "$hostname" "$line" fi - done + done < <(echo "$storage_status" | tail -n +2) # Check volume groups local vgs_info=$(ssh_node "$hostname" 'vgs --units g --noheadings -o vg_name,vg_size,vg_free 2>/dev/null' || echo "") if [ -n "$vgs_info" ]; then - echo "$vgs_info" | while IFS= read -r line; do + while IFS= read -r line; do if [ -n "$line" ]; then check_vg_free_space "$hostname" "$line" fi - done + done < <(echo "$vgs_info") fi # Log storage status @@ -199,7 +199,7 @@ monitor_node() { # Send alerts (can be extended to email, Slack, etc.) send_alerts() { - if [ ${#ALERTS[@]} -eq 0 ]; then + if [[ ${#ALERTS[@]} -eq 0 ]]; then log_success "No storage alerts" return 0 fi @@ -244,7 +244,8 @@ generate_summary() { echo "=== Proxmox Storage Summary $(date) ===" echo "" echo "Nodes Monitored:" - for hostname in "${!NODES[@]}"; do + for hostname in ml110 r630-01 r630-02 r630-03 r630-04; do + [[ -n "${NODES[$hostname]:-}" ]] || continue if check_node "$hostname"; then echo " ✅ $hostname" else @@ -280,9 +281,10 @@ main() { echo "Date: $(date)" echo "" - # Monitor all nodes - for hostname in "${!NODES[@]}"; do - monitor_node "$hostname" + # Monitor all nodes (fixed order for readable logs; optional nodes may be unreachable) + for hostname in ml110 r630-01 r630-02 r630-03 r630-04; do + [[ -n "${NODES[$hostname]:-}" ]] || continue + monitor_node "$hostname" || true done # Send alerts @@ -297,7 +299,8 @@ main() { status) # Show current status echo "=== Current Storage Status ===" - for hostname in "${!NODES[@]}"; do + for hostname in ml110 r630-01 r630-02 r630-03 r630-04; do + [[ -n "${NODES[$hostname]:-}" ]] || continue if check_node "$hostname"; then echo "" echo "--- $hostname ---"