Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
Co-authored-by: Cursor <cursoragent@cursor.com>
90 lines
5.4 KiB
Markdown
90 lines
5.4 KiB
Markdown
# Maintenance scripts review
|
||
|
||
**Date:** 2026-02-15
|
||
**Scope:** RPC/502 fix flow, writability step, runner, and related docs.
|
||
|
||
---
|
||
|
||
## 1. Flow overview
|
||
|
||
| Step | Script | Purpose |
|
||
|------|--------|---------|
|
||
| 0 | `make-rpc-vmids-writable-via-ssh.sh` | Stop 2101, 2500–2505 on r630-01; e2fsck rootfs; start; verify /tmp writable |
|
||
| 1 | `resolve-and-fix-all-via-proxmox-ssh.sh` | Dev VM IP .59, start containers, DBIS services (r630-01, ml110) |
|
||
| 2 | `fix-rpc-2101-jna-reinstall.sh` | Reinstall Besu in 2101 (JNA fix), use /tmp in CT, set java.io.tmpdir=/data/besu/tmp |
|
||
| 3 | `install-besu-permanent-on-missing-nodes.sh` | Install Besu on 1505–1508 (ml110), 2500–2505 (r630-01) where missing |
|
||
| 4 | `address-all-remaining-502s.sh` | fix-all-502s-comprehensive + NPM proxy update + RPC diagnostics |
|
||
| 5 | `verify-end-to-end-routing.sh` | E2E (optional via `--e2e`) |
|
||
|
||
**Single entry point:** `./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh` [--no-npm] [--e2e] [--dry-run]
|
||
|
||
---
|
||
|
||
## 2. What works well
|
||
|
||
- **Writability first:** Step 0 fixes read-only root (ext4 errors) so steps 2 and 3 can write to CTs. All seven RPC VMIDs (2101, 2500–2505) are handled on r630-01.
|
||
- **Clear ordering:** Make writable → resolve/start → fix 2101 → install Besu on missing → address 502s → E2E. Dependencies are respected.
|
||
- **Config-driven:** Hosts and IPs come from `config/ip-addresses.conf` (PROXMOX_HOST_R630_01, etc.).
|
||
- **Idempotent / skip logic:** resolve-and-fix skips if already correct; install-besu-permanent skips VMIDs that already have `/opt/besu/bin/besu`.
|
||
- **Docs linked:** 502_DEEP_DIVE (§ Read-only CT), CHECK_ALL_UPDATES (§9 Remaining fixes), maintenance README all reference the runner and make-writable script.
|
||
- **JNA tmpdir:** Standalone installer and 2101 fix set `-Djava.io.tmpdir=/data/besu/tmp` so Besu/JNA work when `/tmp` is restricted.
|
||
- **Apt resilience:** Standalone installer allows `apt-get update` to fail (e.g. command-not-found I/O error) and still requires `java` and `wget` before continuing.
|
||
|
||
---
|
||
|
||
## 3. Gaps and risks
|
||
|
||
- **Step 2 (2101) can be slow:** Apt install inside the CT can take 5–15+ minutes; the runner has no per-step timeout, so the whole run can appear to hang at “Installing packages…”.
|
||
- **Errors hidden:** The runner uses `2>/dev/null` on each step and only prints “Done” or “Step had warnings.” Failures (e.g. 2101 install fail, 2505 install fail) are not surfaced unless you read the full output.
|
||
- **Disk space:** 2502/2504 have historically hit “No space left on device” in `/data/besu` (RocksDB). The scripts do not check or resize CT disk; that remains manual (e.g. `pct resize <vmid> rootfs +50G` or free space inside CT).
|
||
- **LV name assumption:** make-rpc-vmids-writable assumes LVs are `/dev/pve/vm-<vmid>-disk-0`. Different storage or naming would need script changes.
|
||
- **Single host for RPC:** make-rpc-vmids-writable only targets r630-01. If any RPC VMIDs are moved to ml110/r630-02, the script would need to be extended (or a second call with a different host).
|
||
|
||
---
|
||
|
||
## 4. Recommendations and completion
|
||
|
||
1. **Optional verbose mode:** ✅ **Done.** Runner supports `--verbose`; when set, step output is not redirected (no `2>/dev/null`), so failures are visible.
|
||
2. **Optional timeout for step 2:** ✅ **Done.** `STEP2_TIMEOUT` (default 900) applies to the 2101 fix; exit code 124 is detected and a message tells the user to re-run the fix manually. Use `STEP2_TIMEOUT=0` to disable.
|
||
3. **§9 checklist:** ✅ CHECK_ALL_UPDATES §9 includes "RPC CTs read-only → make-rpc-vmids-writable first"; operators have a single place for order of operations.
|
||
4. **Disk check (future):** Not implemented. Optionally run `pct exec <vmid> -- df -h / /data/besu` before install/fix and warn if usage > 90%.
|
||
|
||
---
|
||
|
||
## 5. File reference
|
||
|
||
| File | Role |
|
||
|------|------|
|
||
| `scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh` | Main runner (steps 0–5) |
|
||
| `scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh` | e2fsck 2101, 2500–2505 on r630-01 |
|
||
| `scripts/maintenance/address-all-remaining-502s.sh` | Backends + NPM + diagnostics |
|
||
| `scripts/maintenance/fix-rpc-2101-jna-reinstall.sh` | 2101 Besu reinstall, /tmp + JNA tmpdir |
|
||
| `scripts/install-besu-in-ct-standalone.sh` | In-CT Besu install; apt tolerant; JNA tmpdir |
|
||
| `scripts/besu/install-besu-permanent-on-missing-nodes.sh` | Besu on 1505–1508, 2500–2505; writability check |
|
||
| `docs/00-meta/502_DEEP_DIVE_ROOT_CAUSES_AND_FIXES.md` | Root causes, Read-only CT, 2101/2500–2505 fixes |
|
||
| `docs/05-network/CHECK_ALL_UPDATES_AND_CLOUDFLARE_TUNNELS.md` | Config, tunnels, verification, §9 remaining fixes |
|
||
|
||
---
|
||
|
||
## 6. Quick commands
|
||
|
||
```bash
|
||
# Full run (writable → fix → install → 502s → E2E)
|
||
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
|
||
|
||
# Show all step output (no 2>/dev/null)
|
||
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e --verbose
|
||
|
||
# Step 2 (2101 fix) timeout: default 900s; disable with 0
|
||
STEP2_TIMEOUT=1200 ./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
|
||
STEP2_TIMEOUT=0 ./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --e2e
|
||
|
||
# Only make RPC CTs writable
|
||
./scripts/maintenance/make-rpc-vmids-writable-via-ssh.sh
|
||
|
||
# Dry-run (print steps only)
|
||
./scripts/maintenance/run-all-maintenance-via-proxmox-ssh.sh --dry-run
|
||
```
|
||
|
||
Reports and diagnostics: `docs/04-configuration/verification-evidence/` (RPC diagnostics, E2E reports).
|