Files
proxmox/docs/00-meta/MAINTENANCE_SCRIPTS_REVIEW.md
defiQUG bea1903ac9
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Sync all local changes: docs, config, scripts, submodule refs, verification evidence
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-21 15:46:06 -08:00

90 lines
5.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Maintenance scripts review
**Date:** 2026-02-15
**Scope:** RPC/502 fix flow, writability step, runner, and related docs.
---
## 1. Flow overview
| Step | Script | Purpose |
|------|--------|---------|
| 0 | `make-rpc-vmids-writable-via-ssh.sh` | Stop 2101, 25002505 on r630-01; e2fsck rootfs; start; verify /tmp writable |
| 1 | `resolve-and-fix-all-via-proxmox-ssh.sh` | Dev VM IP .59, start containers, DBIS services (r630-01, ml110) |
| 2 | `fix-rpc-2101-jna-reinstall.sh` | Reinstall Besu in 2101 (JNA fix), use /tmp in CT, set java.io.tmpdir=/data/besu/tmp |
| 3 | `install-besu-permanent-on-missing-nodes.sh` | Install Besu on 15051508 (ml110), 25002505 (r630-01) where missing |
| 4 | `address-all-remaining-502s.sh` | fix-all-502s-comprehensive + NPM proxy update + RPC diagnostics |
| 5 | `verify-end-to-end-routing.sh` | E2E (optional via `--e2e`) |
**Single entry point:** `./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh` [--no-npm] [--e2e] [--dry-run]
---
## 2. What works well
- **Writability first:** Step 0 fixes read-only root (ext4 errors) so steps 2 and 3 can write to CTs. All seven RPC VMIDs (2101, 25002505) are handled on r630-01.
- **Clear ordering:** Make writable → resolve/start → fix 2101 → install Besu on missing → address 502s → E2E. Dependencies are respected.
- **Config-driven:** Hosts and IPs come from `config/ip-addresses.conf` (PROXMOX_HOST_R630_01, etc.).
- **Idempotent / skip logic:** resolve-and-fix skips if already correct; install-besu-permanent skips VMIDs that already have `/opt/besu/bin/besu`.
- **Docs linked:** 502_DEEP_DIVE (§ Read-only CT), CHECK_ALL_UPDATES (§9 Remaining fixes), maintenance README all reference the runner and make-writable script.
- **JNA tmpdir:** Standalone installer and 2101 fix set `-Djava.io.tmpdir=/data/besu/tmp` so Besu/JNA work when `/tmp` is restricted.
- **Apt resilience:** Standalone installer allows `apt-get update` to fail (e.g. command-not-found I/O error) and still requires `java` and `wget` before continuing.
---
## 3. Gaps and risks
- **Step 2 (2101) can be slow:** Apt install inside the CT can take 515+ minutes; the runner has no per-step timeout, so the whole run can appear to hang at “Installing packages…”.
- **Errors hidden:** The runner uses `2>/dev/null` on each step and only prints “Done” or “Step had warnings.” Failures (e.g. 2101 install fail, 2505 install fail) are not surfaced unless you read the full output.
- **Disk space:** 2502/2504 have historically hit “No space left on device” in `/data/besu` (RocksDB). The scripts do not check or resize CT disk; that remains manual (e.g. `pct resize <vmid> rootfs +50G` or free space inside CT).
- **LV name assumption:** make-rpc-vmids-writable assumes LVs are `/dev/pve/vm-<vmid>-disk-0`. Different storage or naming would need script changes.
- **Single host for RPC:** make-rpc-vmids-writable only targets r630-01. If any RPC VMIDs are moved to ml110/r630-02, the script would need to be extended (or a second call with a different host).
---
## 4. Recommendations and completion
1. **Optional verbose mode:****Done.** Runner supports `--verbose`; when set, step output is not redirected (no `2>/dev/null`), so failures are visible.
2. **Optional timeout for step 2:****Done.** `STEP2_TIMEOUT` (default 900) applies to the 2101 fix; exit code 124 is detected and a message tells the user to re-run the fix manually. Use `STEP2_TIMEOUT=0` to disable.
3. **§9 checklist:** ✅ CHECK_ALL_UPDATES §9 includes "RPC CTs read-only → make-rpc-vmids-writable first"; operators have a single place for order of operations.
4. **Disk check (future):** Not implemented. Optionally run `pct exec <vmid> -- df -h / /data/besu` before install/fix and warn if usage &gt; 90%.
---
## 5. File reference
| File | Role |
|------|------|
| `scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh` | Main runner (steps 05) |
| `scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh` | e2fsck 2101, 25002505 on r630-01 |
| `scripts/maintenance/address-all-remaining-502s.sh` | Backends + NPM + diagnostics |
| `scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` | 2101 Besu reinstall, /tmp + JNA tmpdir |
| `scripts/install-besu-in-ct-standalone.sh` | In-CT Besu install; apt tolerant; JNA tmpdir |
| `scripts/besu/install-besu-permanent-on-missing-nodes.sh` | Besu on 15051508, 25002505; writability check |
| `docs/00-meta/502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md` | Root causes, Read-only CT, 2101/25002505 fixes |
| `docs/05-network/CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md` | Config, tunnels, verification, §9 remaining fixes |
---
## 6. Quick commands
```bash
# Full run (writable → fix → install → 502s → E2E)
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
# Show all step output (no 2>/dev/null)
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e --verbose
# Step 2 (2101 fix) timeout: default 900s; disable with 0
STEP2_TIMEOUT=1200 ./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
STEP2_TIMEOUT=0 ./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
# Only make RPC CTs writable
./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh
# Dry-run (print steps only)
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --dry-run
```
Reports and diagnostics: `docs/04-configuration/verification-evidence/` (RPC diagnostics, E2E reports).