Files
proxmox/docs/00-meta/MAINTENANCE_SCRIPTS_REVIEW.md

90 lines
5.4 KiB
Markdown
Raw Normal View History

# Maintenance scripts review
**Date:** 2026-02-15
**Scope:** RPC/502 fix flow, writability step, runner, and related docs.
---
## 1. Flow overview
| Step | Script | Purpose |
|------|--------|---------|
| 0 | `make-rpc-vmids-writable-via-ssh.sh` | Stop 2101, 25002505 on r630-01; e2fsck rootfs; start; verify /tmp writable |
| 1 | `resolve-and-fix-all-via-proxmox-ssh.sh` | Dev VM IP .59, start containers, DBIS services (r630-01, ml110) |
| 2 | `fix-rpc-2101-jna-reinstall.sh` | Reinstall Besu in 2101 (JNA fix), use /tmp in CT, set java.io.tmpdir=/data/besu/tmp |
| 3 | `install-besu-permanent-on-missing-nodes.sh` | Install Besu on 15051508 (ml110), 25002505 (r630-01) where missing |
| 4 | `address-all-remaining-502s.sh` | fix-all-502s-comprehensive + NPM proxy update + RPC diagnostics |
| 5 | `verify-end-to-end-routing.sh` | E2E (optional via `--e2e`) |
**Single entry point:** `./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh` [--no-npm] [--e2e] [--dry-run]
---
## 2. What works well
- **Writability first:** Step 0 fixes read-only root (ext4 errors) so steps 2 and 3 can write to CTs. All seven RPC VMIDs (2101, 25002505) are handled on r630-01.
- **Clear ordering:** Make writable → resolve/start → fix 2101 → install Besu on missing → address 502s → E2E. Dependencies are respected.
- **Config-driven:** Hosts and IPs come from `config/ip-addresses.conf` (PROXMOX_HOST_R630_01, etc.).
- **Idempotent / skip logic:** resolve-and-fix skips if already correct; install-besu-permanent skips VMIDs that already have `/opt/besu/bin/besu`.
- **Docs linked:** 502_DEEP_DIVE (§ Read-only CT), CHECK_ALL_UPDATES (§9 Remaining fixes), maintenance README all reference the runner and make-writable script.
- **JNA tmpdir:** Standalone installer and 2101 fix set `-Djava.io.tmpdir=/data/besu/tmp` so Besu/JNA work when `/tmp` is restricted.
- **Apt resilience:** Standalone installer allows `apt-get update` to fail (e.g. command-not-found I/O error) and still requires `java` and `wget` before continuing.
---
## 3. Gaps and risks
- **Step 2 (2101) can be slow:** Apt install inside the CT can take 515+ minutes; the runner has no per-step timeout, so the whole run can appear to hang at “Installing packages…”.
- **Errors hidden:** The runner uses `2>/dev/null` on each step and only prints “Done” or “Step had warnings.” Failures (e.g. 2101 install fail, 2505 install fail) are not surfaced unless you read the full output.
- **Disk space:** 2502/2504 have historically hit “No space left on device” in `/data/besu` (RocksDB). The scripts do not check or resize CT disk; that remains manual (e.g. `pct resize <vmid> rootfs +50G` or free space inside CT).
- **LV name assumption:** make-rpc-vmids-writable assumes LVs are `/dev/pve/vm-<vmid>-disk-0`. Different storage or naming would need script changes.
- **Single host for RPC:** make-rpc-vmids-writable only targets r630-01. If any RPC VMIDs are moved to ml110/r630-02, the script would need to be extended (or a second call with a different host).
---
## 4. Recommendations and completion
1. **Optional verbose mode:****Done.** Runner supports `--verbose`; when set, step output is not redirected (no `2>/dev/null`), so failures are visible.
2. **Optional timeout for step 2:****Done.** `STEP2_TIMEOUT` (default 900) applies to the 2101 fix; exit code 124 is detected and a message tells the user to re-run the fix manually. Use `STEP2_TIMEOUT=0` to disable.
3. **§9 checklist:** ✅ CHECK_ALL_UPDATES §9 includes "RPC CTs read-only → make-rpc-vmids-writable first"; operators have a single place for order of operations.
4. **Disk check (future):** Not implemented. Optionally run `pct exec <vmid> -- df -h / /data/besu` before install/fix and warn if usage &gt; 90%.
---
## 5. File reference
| File | Role |
|------|------|
| `scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh` | Main runner (steps 05) |
| `scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh` | e2fsck 2101, 25002505 on r630-01 |
| `scripts/maintenance/address-all-remaining-502s.sh` | Backends + NPM + diagnostics |
| `scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` | 2101 Besu reinstall, /tmp + JNA tmpdir |
| `scripts/install-besu-in-ct-standalone.sh` | In-CT Besu install; apt tolerant; JNA tmpdir |
| `scripts/besu/install-besu-permanent-on-missing-nodes.sh` | Besu on 15051508, 25002505; writability check |
| `docs/00-meta/502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md` | Root causes, Read-only CT, 2101/25002505 fixes |
| `docs/05-network/CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md` | Config, tunnels, verification, §9 remaining fixes |
---
## 6. Quick commands
```bash
# Full run (writable → fix → install → 502s → E2E)
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
# Show all step output (no 2>/dev/null)
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e --verbose
# Step 2 (2101 fix) timeout: default 900s; disable with 0
STEP2_TIMEOUT=1200 ./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
STEP2_TIMEOUT=0 ./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
# Only make RPC CTs writable
./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh
# Dry-run (print steps only)
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --dry-run
```
Reports and diagnostics: `docs/04-configuration/verification-evidence/` (RPC diagnostics, E2E reports).