Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Co-authored-by: Cursor <cursoragent@cursor.com>
86 lines
6.8 KiB
Markdown
86 lines
6.8 KiB
Markdown
# Storage Growth & Health — Automation Tasks, Fixes, and Migrations
|
||
|
||
**Last updated:** 2026-02-15
|
||
**Purpose:** List all tasks to automate proactive storage monitoring, plus required fixes and migrations.
|
||
|
||
---
|
||
|
||
## 1. Tasks to automate
|
||
|
||
### 1.1 Scheduled data collection
|
||
|
||
| # | Task | Description | How |
|
||
|---|------|-------------|-----|
|
||
| **A1** | **Storage snapshot + history append** | Run `collect-storage-growth-data.sh --append` on a schedule so `history.csv` grows for trend analysis. | Cron every 6 hours (or daily). Use `scripts/maintenance/schedule-storage-growth-cron.sh --install`. |
|
||
| **A2** | **Snapshot retention** | Prune old snapshot files under `logs/storage-growth/` so the directory does not grow unbounded. | **Done.** Script: `scripts/monitoring/prune-storage-snapshots.sh` (default keep 30 days; `--days N`, `--dry-run`). Schedule weekly or run manually. |
|
||
| **A3** | **History CSV retention** | Cap `history.csv` size (keep last 10k rows or ~90 days). | **Done.** Script: `scripts/monitoring/prune-storage-history.sh` (default 90 days proxy; `--max-rows N`, `--days N`, `--dry-run`). Run weekly via schedule-storage-growth-cron (prune line). |
|
||
|
||
### 1.2 Threshold checks and alerting
|
||
|
||
| # | Task | Description | How |
|
||
|---|------|-------------|-----|
|
||
| **A4** | **Thin pool / pvesm check (all hosts)** | Fail or warn when any host’s thin pool or pvesm storage is ≥ 95% (critical) or ≥ 80% (warn). | **Done.** In `daily-weekly-checks.sh` weekly (F3/M2). |
|
||
| **A5** | **In-CT disk check in cron** | Run `check-disk-all-vmids.sh` on a schedule and log or alert on WARN/CRIT. | **Done.** Called from `daily-weekly-checks.sh` daily (cron 08:00). |
|
||
| **A6** | **Integrate with existing storage-monitor.sh** | `storage-monitor.sh` already has WARN 80%, CRIT 90% and optional ALERT_EMAIL / ALERT_WEBHOOK. | **Done.** `scripts/maintenance/schedule-storage-monitor-cron.sh --install` (daily 07:00). |
|
||
| **A7** | **Metric file for alerting** | Write a metric file (e.g. `logs/storage-growth/last_run.metric`) with max thin pool % and timestamp so an external monitor can alert. | **Done.** Weekly run writes `STORAGE_METRIC_FILE` (storage_max_pct, storage_metric_timestamp). |
|
||
|
||
### 1.3 Proactive remediation (optional)
|
||
|
||
| # | Task | Description | How |
|
||
|---|------|-------------|-----|
|
||
| **A8** | **Weekly fstrim in CTs** | Run `fstrim` inside running CTs on hosts with thin pools to reclaim space. | **Done.** `scripts/maintenance/fstrim-all-running-ct.sh`; run from `daily-weekly-checks.sh` weekly. |
|
||
| **A9** | **Logrotate audit** | Ensure high-log VMIDs (10130, 10150, 10151, 5000, 10233, 10234, 2400) have logrotate or equivalent. | **Done.** Runbook: `docs/04-configuration/LOGROTATE_AUDIT_RUNBOOK.md`. |
|
||
| **A10** | **Journal vacuum** | Run `journalctl --vacuum-time=7d` in key CTs on a schedule. | **Done.** `scripts/maintenance/journal-vacuum-key-ct.sh`; run from `daily-weekly-checks.sh` weekly. |
|
||
|
||
---
|
||
|
||
## 2. Fixes required
|
||
|
||
| # | Fix | Location | Detail |
|
||
|---|-----|----------|--------|
|
||
| **F1** | **Implement or remove --json** | `scripts/monitoring/collect-storage-growth-data.sh` | **Done.** `--json` outputs a JSON object with `timestamp` and `csv_rows` (array of CSV line strings). |
|
||
| **F2** | **CSV quoting for detail column** | `scripts/monitoring/collect-storage-growth-data.sh` | **Done.** Detail field is quoted when it contains commas or quotes via `csv_quote()`. |
|
||
| **F3** | **Thin pool check on all three hosts** | `scripts/maintenance/daily-weekly-checks.sh` | **Done.** [138a] now runs thin pool/storage check on r630-02, r630-01, and ml110 (WARN ≥85%, FAIL ≥95%/100%). |
|
||
| **F4** | **PROJECT_ROOT in cron** | `schedule-daily-weekly-cron.sh` / new storage cron | Cron lines use `$PROJECT_ROOT`; crontab is installed by the user who runs the script, so path is correct. For schedule-storage-growth-cron.sh use same pattern (cd $PROJECT_ROOT && ...). |
|
||
|
||
---
|
||
|
||
## 3. Migrations
|
||
|
||
| # | Migration | Description |
|
||
|---|------------|-------------|
|
||
| **M1** | **Add schedule-storage-growth-cron.sh** | **Done.** Script: `scripts/maintenance/schedule-storage-growth-cron.sh` (same style as schedule-daily-weekly-cron.sh): `--show`, `--install`, `--remove`. Cron runs `collect-storage-growth-data.sh --append` every 6 hours. |
|
||
| **M2** | **Extend weekly checks to all-host thin pool** | **Done.** Implemented with F3 in `daily-weekly-checks.sh`: `check_thin_pool_one_host` for r630-02, r630-01, ml110. |
|
||
| **M3** | **Doc and index updates** | **Done.** STORAGE_GROWTH_AND_HEALTH.md references schedule-storage-growth-cron.sh and prune script; MASTER_INDEX and OPERATIONAL_RUNBOOKS list storage growth cron. |
|
||
| **M4** | **Optional: CI job** | Add a GitHub Actions (or Gitea) workflow that runs `collect-storage-growth-data.sh --csv` (or a dry run that only checks script syntax / host reachability) so config changes don’t break the script. Optional because the script requires LAN/SSH to hosts. |
|
||
|
||
---
|
||
|
||
## 4. Implementation order
|
||
|
||
1. **F2** (CSV quoting) and **F1** (--json) in `collect-storage-growth-data.sh`.
|
||
2. **M1** Add `schedule-storage-growth-cron.sh` and **M3** update docs.
|
||
3. **F3** and **M2** Extend daily-weekly-checks.sh to check thin pool on all three hosts.
|
||
4. **A1** Install storage growth cron (via M1).
|
||
5. **A2** Add `prune-storage-snapshots.sh` and schedule weekly (or in same cron wrapper).
|
||
6. **A4/A7** Optionally have weekly check write a metric file; wire A5 (check-disk-all-vmids) into daily if desired.
|
||
7. **A8–A10** As needed (fstrim, logrotate audit, journal vacuum).
|
||
|
||
---
|
||
|
||
## 5. Quick reference
|
||
|
||
| Script | Purpose |
|
||
|--------|---------|
|
||
| `scripts/monitoring/collect-storage-growth-data.sh` | Collect host + VM storage; output snapshot + optional growth table; `--append` for history.csv. |
|
||
| `scripts/maintenance/schedule-storage-growth-cron.sh` | Install/show/remove cron for storage collection (every 6h). |
|
||
| `scripts/monitoring/prune-storage-snapshots.sh` | Prune snapshot_*.txt older than N days (default 30); `--days N`, `--dry-run`. |
|
||
| `scripts/monitoring/prune-storage-history.sh` | Prune history.csv to last N rows (default ~90d); `--days N`, `--max-rows N`, `--dry-run`. |
|
||
| `scripts/maintenance/daily-weekly-checks.sh` | Daily: explorer, RPC, indexer lag, in-CT disk (A5). Weekly: config API, thin pool, fstrim (A8), journal vacuum (A10), storage metric (A7). |
|
||
| `scripts/maintenance/check-disk-all-vmids.sh` | In-CT df / for all running CTs; WARN 85%, CRIT 95%. |
|
||
| `scripts/maintenance/schedule-storage-monitor-cron.sh` | Install/show/remove cron for storage-monitor.sh (daily 07:00). |
|
||
| `scripts/maintenance/fstrim-all-running-ct.sh` | fstrim -v / in all running CTs; `--dry-run`. |
|
||
| `scripts/maintenance/journal-vacuum-key-ct.sh` | journalctl --vacuum-time=7d in key CTs; `--dry-run`. |
|
||
| `scripts/storage-monitor.sh` | Host pvesm + VG; alerts at 80%/90%; optional email/webhook. |
|
||
| `docs/04-configuration/STORAGE_GROWTH_AND_HEALTH.md` | Growth table template, factors, thresholds, how to use data. |
|