Files
proxmox/scripts/maintenance/README.md
defiQUG 0d29343941 chore: update .env.master.example with new deployment scripts and treasury manager parameters; enhance AGENTS.md with GRU reference primacy details
- Added new deployment script references for Aave quote-push and treasury manager in .env.master.example.
- Updated AGENTS.md to include information on GRU reference primacy versus public PMM mesh execution model.
- Minor updates to various documentation files to reflect changes in policy and operational guidelines.

Made-with: Cursor
2026-04-12 18:20:41 -07:00

59 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Maintenance Scripts
**health-check-rpc-2101.sh** — Health check for Besu RPC on VMID 2101: container status, besu-rpc service, port 8545, eth_chainId, eth_blockNumber. Run from project root (LAN). See docs/09-troubleshooting/RPC_NODES_BLOCK_PRODUCTION_FIX.md.
**fix-core-rpc-2101.sh** — One-command fix for Core RPC 2101: start CT if stopped, restart Besu, verify RPC. Options: `--dry-run`, `--apply` (mutations when `PROXMOX_SAFE_DEFAULTS=1`), `--restart-only`. Optional `PROXMOX_OPS_ALLOWED_VMIDS`. If Besu fails with JNA/NoClassDefFoundError, run fix-rpc-2101-jna-reinstall.sh first.
**fix-rpc-2101-jna-reinstall.sh** — Reinstall Besu in CT 2101 to fix JNA/NoClassDefFoundError; then re-run fix-core-rpc-2101.sh. Use `--dry-run` to print steps only.
**check-disk-all-vmids.sh** — Check root disk usage in all running containers on ml110, r630-01, r630-02. Use `--csv` for tab-separated output. For prevention and audits.
**run-all-maintenance-via-proxmox-ssh.sh** — Run all maintenance/fix scripts that use SSH to Proxmox VE (r630-01, ml110, r630-02). **Runs make-rpc-vmids-writable-via-ssh.sh --apply first** (so 2101, 2500-2505 are writable), then resolve-and-fix-all, fix-rpc-2101-jna-reinstall, install-besu-permanent-on-missing-nodes, address-all-remaining-502s; optional E2E with `--e2e`. Use `--no-npm` to skip NPM proxy update, `--dry-run` to print steps only, `--verbose` to show all step output (no stderr hidden). Step 2 (2101 fix) has optional timeout: `STEP2_TIMEOUT=900` (default) or `STEP2_TIMEOUT=0` to disable. Run from project root (LAN).
**make-rpc-vmids-writable-via-ssh.sh** — SSHs to r630-01 and for each VMID (default 2101; override with `BESU_WRITABLE_VMIDS`): stops the CT, runs `e2fsck -f -y` on the rootfs LV, starts the CT. Use before fix-rpc-2101 or install-besu-permanent when CTs are read-only. `--dry-run` / `--apply`; with `PROXMOX_SAFE_DEFAULTS=1`, default is dry-run unless `--apply` or `PROXMOX_OPS_APPLY=1`. Optional `PROXMOX_OPS_ALLOWED_VMIDS`. Run from project root (LAN).
**make-validator-vmids-writable-via-ssh.sh** — SSHs to r630-01 (1000, 1001, 1002) and r630-03 (1003, 1004); stops each validator CT, runs `e2fsck -f -y` on rootfs, starts the CT. Fixes "Read-only file system" / JNA crash loop on validators. Then run `fix-all-validators-and-txpool.sh`. See docs/08-monitoring/RPC_AND_VALIDATOR_TESTING_RUNBOOK.md.
**Sentries 15001502 (r630-01)** — If deploy-besu-node-lists or set-all-besu-max-peers-32 reports Skip/fail or "Read-only file system" for 15001502, they have the same read-only root issue. On the host: `pct stop 1500; e2fsck -f -y /dev/pve/vm-1500-disk-0; pct start 1500` (repeat for 1501, 1502). Then re-run deploy and max-peers/restart.
**address-all-remaining-502s.sh** — One flow to address remaining E2E 502s: runs `fix-all-502s-comprehensive.sh`, then (if `NPM_PASSWORD` set) NPMplus proxy update, then RPC diagnostics (`diagnose-rpc-502s.sh`), optionally `fix-all-besu-nodes.sh` and E2E. Use `--no-npm`, `--run-besu-fix`, `--e2e`, `--dry-run` (print steps only). Run from LAN.
**diagnose-rpc-502s.sh** — Collects for VMIDs 2101 and 25002505: `ss -tlnp` and `journalctl -u besu-rpc` / `besu`. Pipe to a file or use from `address-all-remaining-502s.sh`.
**fix-all-502s-comprehensive.sh** — Starts/serves backends for 10130, 10150/10151, 2101, 25002505, Cacti (Python stubs if needed). Use `--dry-run` to print actions without SSH. Does not update NPMplus; use `update-npmplus-proxy-hosts-api.sh` from LAN for that.
**daily-weekly-checks.sh** — Daily (explorer, indexer lag, RPC) and weekly (config API, thin pool, log reminder).
**schedule-daily-weekly-cron.sh** — Install cron: daily 08:00, weekly Sun 09:00. Run from a persistent host checkout; set `CRON_PROJECT_ROOT=/srv/proxmox` when installing on a Proxmox node.
**ensure-firefly-primary-via-ssh.sh** — SSHs to r630-02 and normalizes `/opt/firefly/docker-compose.yml` on VMID 6200, installs an idempotent helper-backed `firefly.service`, and verifies `/api/v1/status`. It is safe for the current mixed stack where `firefly-core` already exists outside compose while Postgres and IPFS remain compose-managed. Use `--dry-run` to print actions only.
**ensure-fabric-sample-network-via-ssh.sh** — SSHs to r630-02 and ensures VMID 6000 has nested-LXC features, a boot-time `fabric-sample-network.service`, and a queryable `mychannel`. Use `--dry-run` to print actions only.
**ensure-legacy-monitor-networkd-via-ssh.sh** — SSHs to r630-01 and fixes the legacy `3000`-`3003` monitor/RPC-adjacent LXCs so `systemd-networkd` is enabled host-side and started in-guest. This is the safe path for unprivileged guests where `systemctl enable` fails from inside the CT. `--dry-run` / `--apply`; same `PROXMOX_SAFE_DEFAULTS` behavior as other guarded maintenance scripts.
**check-and-fix-explorer-lag.sh** — Checks both RPC vs Blockscout head lag and recent transaction visibility lag. If the explorer head is behind, or if recent on-chain non-empty blocks are present but the explorers latest indexed transaction trails them by more than the configured threshold, it runs `fix-explorer-indexer-lag.sh` (restart Blockscout). It does **not** restart for a genuinely quiet chain with empty recent head blocks.
**schedule-explorer-lag-cron.sh** — Install cron for lag check-and-fix: every 6 hours (0, 6, 12, 18). Log: `logs/explorer-lag-fix.log`. Use `--show` to print the line, `--install` to add to crontab, `--remove` to remove. Run from a persistent host checkout; set `CRON_PROJECT_ROOT=/srv/proxmox` when installing on a Proxmox node.
**All schedule-*.sh installers** — Refuse transient roots such as `/tmp/...`. Install from a persistent checkout only.
## Optional: Alerting on failures
The daily/weekly script writes a **metric file** when run (if `MAINTENANCE_METRIC_FILE` is set or default `logs/maintenance-checks.metric`):
```
maintenance_checks_failed 0
maintenance_checks_timestamp 1739123456
```
- **Use in cron:** After the check, if `maintenance_checks_failed` > 0, send alert.
- **Example wrapper (email on failure):**
```bash
cd /path/to/proxmox && bash scripts/maintenance/daily-weekly-checks.sh daily >> logs/daily-weekly-checks.log 2>&1
FAILED=$(grep '^maintenance_checks_failed' logs/maintenance-checks.metric 2>/dev/null | awk '{print $2}')
[ -n "$FAILED" ] && [ "$FAILED" -gt 0 ] && echo "Maintenance checks failed: $FAILED" | mail -s "Explorer/maintenance alert" ops@example.com
```
- **Slack:** Use a small script that reads the metric file and posts to a webhook when `maintenance_checks_failed` > 0.
- **Prometheus/Grafana:** Scrape the metric file or run a node_exporter textfile collector on `logs/maintenance-checks.metric`.
To disable the metric file, set `MAINTENANCE_METRIC_FILE=` (empty) before running the script.