Files
proxmox/docs/04-configuration/STORAGE_GROWTH_AUTOMATION_TASKS.md
defiQUG bea1903ac9
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Sync all local changes: docs, config, scripts, submodule refs, verification evidence
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-21 15:46:06 -08:00

86 lines
6.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Storage Growth & Health — Automation Tasks, Fixes, and Migrations
**Last updated:** 2026-02-15
**Purpose:** List all tasks to automate proactive storage monitoring, plus required fixes and migrations.
---
## 1. Tasks to automate
### 1.1 Scheduled data collection
| # | Task | Description | How |
|---|------|-------------|-----|
| **A1** | **Storage snapshot + history append** | Run `collect-storage-growth-data.sh --append` on a schedule so `history.csv` grows for trend analysis. | Cron every 6 hours (or daily). Use `scripts/maintenance/schedule-storage-growth-cron.sh --install`. |
| **A2** | **Snapshot retention** | Prune old snapshot files under `logs/storage-growth/` so the directory does not grow unbounded. | **Done.** Script: `scripts/monitoring/prune-storage-snapshots.sh` (default keep 30 days; `--days N`, `--dry-run`). Schedule weekly or run manually. |
| **A3** | **History CSV retention** | Cap `history.csv` size (keep last 10k rows or ~90 days). | **Done.** Script: `scripts/monitoring/prune-storage-history.sh` (default 90 days proxy; `--max-rows N`, `--days N`, `--dry-run`). Run weekly via schedule-storage-growth-cron (prune line). |
### 1.2 Threshold checks and alerting
| # | Task | Description | How |
|---|------|-------------|-----|
| **A4** | **Thin pool / pvesm check (all hosts)** | Fail or warn when any hosts thin pool or pvesm storage is ≥ 95% (critical) or ≥ 80% (warn). | **Done.** In `daily-weekly-checks.sh` weekly (F3/M2). |
| **A5** | **In-CT disk check in cron** | Run `check-disk-all-vmids.sh` on a schedule and log or alert on WARN/CRIT. | **Done.** Called from `daily-weekly-checks.sh` daily (cron 08:00). |
| **A6** | **Integrate with existing storage-monitor.sh** | `storage-monitor.sh` already has WARN 80%, CRIT 90% and optional ALERT_EMAIL / ALERT_WEBHOOK. | **Done.** `scripts/maintenance/schedule-storage-monitor-cron.sh --install` (daily 07:00). |
| **A7** | **Metric file for alerting** | Write a metric file (e.g. `logs/storage-growth/last_run.metric`) with max thin pool % and timestamp so an external monitor can alert. | **Done.** Weekly run writes `STORAGE_METRIC_FILE` (storage_max_pct, storage_metric_timestamp). |
### 1.3 Proactive remediation (optional)
| # | Task | Description | How |
|---|------|-------------|-----|
| **A8** | **Weekly fstrim in CTs** | Run `fstrim` inside running CTs on hosts with thin pools to reclaim space. | **Done.** `scripts/maintenance/fstrim-all-running-ct.sh`; run from `daily-weekly-checks.sh` weekly. |
| **A9** | **Logrotate audit** | Ensure high-log VMIDs (10130, 10150, 10151, 5000, 10233, 10234, 2400) have logrotate or equivalent. | **Done.** Runbook: `docs/04-configuration/LOGROTATE_AUDIT_RUNBOOK.md`. |
| **A10** | **Journal vacuum** | Run `journalctl --vacuum-time=7d` in key CTs on a schedule. | **Done.** `scripts/maintenance/journal-vacuum-key-ct.sh`; run from `daily-weekly-checks.sh` weekly. |
---
## 2. Fixes required
| # | Fix | Location | Detail |
|---|-----|----------|--------|
| **F1** | **Implement or remove --json** | `scripts/monitoring/collect-storage-growth-data.sh` | **Done.** `--json` outputs a JSON object with `timestamp` and `csv_rows` (array of CSV line strings). |
| **F2** | **CSV quoting for detail column** | `scripts/monitoring/collect-storage-growth-data.sh` | **Done.** Detail field is quoted when it contains commas or quotes via `csv_quote()`. |
| **F3** | **Thin pool check on all three hosts** | `scripts/maintenance/daily-weekly-checks.sh` | **Done.** [138a] now runs thin pool/storage check on r630-02, r630-01, and ml110 (WARN ≥85%, FAIL ≥95%/100%). |
| **F4** | **PROJECT_ROOT in cron** | `schedule-daily-weekly-cron.sh` / new storage cron | Cron lines use `$PROJECT_ROOT`; crontab is installed by the user who runs the script, so path is correct. For schedule-storage-growth-cron.sh use same pattern (cd $PROJECT_ROOT && ...). |
---
## 3. Migrations
| # | Migration | Description |
|---|------------|-------------|
| **M1** | **Add schedule-storage-growth-cron.sh** | **Done.** Script: `scripts/maintenance/schedule-storage-growth-cron.sh` (same style as schedule-daily-weekly-cron.sh): `--show`, `--install`, `--remove`. Cron runs `collect-storage-growth-data.sh --append` every 6 hours. |
| **M2** | **Extend weekly checks to all-host thin pool** | **Done.** Implemented with F3 in `daily-weekly-checks.sh`: `check_thin_pool_one_host` for r630-02, r630-01, ml110. |
| **M3** | **Doc and index updates** | **Done.** STORAGE_GROWTH_AND_HEALTH.md references schedule-storage-growth-cron.sh and prune script; MASTER_INDEX and OPERATIONAL_RUNBOOKS list storage growth cron. |
| **M4** | **Optional: CI job** | Add a GitHub Actions (or Gitea) workflow that runs `collect-storage-growth-data.sh --csv` (or a dry run that only checks script syntax / host reachability) so config changes dont break the script. Optional because the script requires LAN/SSH to hosts. |
---
## 4. Implementation order
1. **F2** (CSV quoting) and **F1** (--json) in `collect-storage-growth-data.sh`.
2. **M1** Add `schedule-storage-growth-cron.sh` and **M3** update docs.
3. **F3** and **M2** Extend daily-weekly-checks.sh to check thin pool on all three hosts.
4. **A1** Install storage growth cron (via M1).
5. **A2** Add `prune-storage-snapshots.sh` and schedule weekly (or in same cron wrapper).
6. **A4/A7** Optionally have weekly check write a metric file; wire A5 (check-disk-all-vmids) into daily if desired.
7. **A8A10** As needed (fstrim, logrotate audit, journal vacuum).
---
## 5. Quick reference
| Script | Purpose |
|--------|---------|
| `scripts/monitoring/collect-storage-growth-data.sh` | Collect host + VM storage; output snapshot + optional growth table; `--append` for history.csv. |
| `scripts/maintenance/schedule-storage-growth-cron.sh` | Install/show/remove cron for storage collection (every 6h). |
| `scripts/monitoring/prune-storage-snapshots.sh` | Prune snapshot_*.txt older than N days (default 30); `--days N`, `--dry-run`. |
| `scripts/monitoring/prune-storage-history.sh` | Prune history.csv to last N rows (default ~90d); `--days N`, `--max-rows N`, `--dry-run`. |
| `scripts/maintenance/daily-weekly-checks.sh` | Daily: explorer, RPC, indexer lag, in-CT disk (A5). Weekly: config API, thin pool, fstrim (A8), journal vacuum (A10), storage metric (A7). |
| `scripts/maintenance/check-disk-all-vmids.sh` | In-CT df / for all running CTs; WARN 85%, CRIT 95%. |
| `scripts/maintenance/schedule-storage-monitor-cron.sh` | Install/show/remove cron for storage-monitor.sh (daily 07:00). |
| `scripts/maintenance/fstrim-all-running-ct.sh` | fstrim -v / in all running CTs; `--dry-run`. |
| `scripts/maintenance/journal-vacuum-key-ct.sh` | journalctl --vacuum-time=7d in key CTs; `--dry-run`. |
| `scripts/storage-monitor.sh` | Host pvesm + VG; alerts at 80%/90%; optional email/webhook. |
| `docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md` | Growth table template, factors, thresholds, how to use data. |