- Update dbis_core, cross-chain-pmm-lps, explorer-monorepo, metamask-integration, pr-workspace/chains - Omit embedded publish git dirs and empty placeholders from index Made-with: Cursor
15 KiB
Proxmox load balancing runbook
Purpose: Reduce load on the busiest node (r630-01) by migrating selected LXC containers to r630-02. Also frees space on r630-01 when moving to another host. Note: ml110 is being repurposed to OPNsense/pfSense (WAN aggregator); migrate workloads off ml110 to r630-01/r630-02 before repurpose — see ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md.
Before you start: If you are considering adding a third or fourth R630 to the cluster first, see PROXMOX_ADD_THIRD_FOURTH_R630_DECISION.md — including whether you already have r630-03/r630-04 (powered off) to bring online.
Spare nodes (storage ready): r630-03 (192.168.11.13) and r630-04 (192.168.11.14) are in the cluster with data / local-lvm active (shared /etc/pve/storage.cfg lists ml110,r630-01,r630-03,r630-04). For r630-03, you can also place CT disks on thin1-r630-03 … thin6-r630-03 (~226 GiB pools, one per SSD). For r630-04, use data/local-lvm (Ceph OSD disks are separate). Scripts: scripts/proxmox/ensure-r630-spare-node-storage.sh, scripts/proxmox/provision-r630-03-six-ssd-thinpools.sh, optional scripts/proxmox/pve-spare-host-optional-tuneup.sh.
Current imbalance (typical):
| Node | IP | LXC count | Load (1/5/15) | Notes |
|---|---|---|---|---|
| r630-01 | 192.168.11.11 | 58 | 56 / 81 / 92 | Historical sample only; re-check live load before acting |
| r630-02 | 192.168.11.12 | 23 | ~4 / 4 / 4 | Light |
| ml110 | 192.168.11.10 | 18 | ~7 / 7 / 9 | Repurposing to OPNsense/pfSense — migrate workloads off to r630-01/r630-02 |
| r630-03 | 192.168.11.13 | 0 (spare) | low | Migration target — ~1 TiB data/local-lvm + thin1-r630-03…thin6-r630-03 |
| r630-04 | 192.168.11.14 | 0 (spare) | low | Migration target — ~467 GiB thin + Ceph OSDs |
Ways to balance:
- Cross-host migration (e.g. r630-01 → r630-02, r630-03, or r630-04) — Moves workload off r630-01. IP stays the same if the container uses a static IP; only the Proxmox host changes. (ml110 is no longer a migration target; migrate containers off ml110 first.)
- Same-host storage migration (r630-01 data → thin1) — Frees space on the
datapool and can improve I/O; does not reduce CPU/load by much. See MIGRATION_PLAN_R630_01_DATA.md.
1. Check cluster (live migrate vs backup/restore)
If all nodes are in the same Proxmox cluster, you can try live migration (faster, less downtime):
ssh root@192.168.11.11 "pvecm status"
ssh root@192.168.11.12 "pvecm status"
- If both show the same cluster name and list each other: use
pct migrate <VMID> <target_node> --restartfrom any cluster node (run on r630-01 or from a host that SSHs to r630-01). - If nodes are not in a cluster (or migrate fails due to storage): use backup → copy → restore with the script below.
2. Cross-host migration (r630-01 → r630-02)
Script (backup/restore; works without shared storage):
cd /path/to/proxmox
# One container (replace VMID and target storage)
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh <VMID> [target_storage] [--destroy-source]
# Examples
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh 3501 thin1 --dry-run
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh 3501 thin1 --destroy-source
Target storage on r630-02: Check with ssh root@192.168.11.12 "pvesm status". Common: thin1, thin2, thin5, thin6.
If cluster works (live migrate):
ssh root@192.168.11.11 "pct migrate <VMID> r630-02 --storage thin1 --restart"
# Then remove source CT if desired: pct destroy <VMID> --purge 1
3. Good candidates to move (r630-01 → r630-02)
Containers that reduce load and are safe to move (no critical chain/consensus; IP can stay static). Prefer moving several smaller ones rather than one critical RPC.
| VMID | Name / role | Notes |
|---|---|---|
| 3500 | oracle-publisher-1 | Oracle publisher |
| 3501 | ccip-monitor-1 | CCIP monitor |
| 7804 | gov-portals-dev | Gov portals (already migrated in past; verify current host) |
| 8640 | vault-phoenix-1 | Vault (if not critical path) |
| 8642 | vault-phoenix-3 | Vault |
| 10232 | CT10232 | Small service |
| 10235 | npmplus-alltra-hybx | NPMplus instance (has its own NPM; update UDM port forward if needed) |
| 10236 | npmplus-fourth | NPMplus instance |
| 10030–10092 | order-* (identity, intake, finance, etc.) | Order stack; move as a group if desired |
| 10200–10210 | order-prometheus, grafana, opensearch, haproxy | Monitoring/HA; move with order-* or after |
Do not move (keep on r630-01 for now):
- 10233 — npmplus (main NPMplus; 76.53.10.36 → .167)
- 2101 — besu-rpc-core-1 (core RPC for deploy/admin)
- 2420/2430/2440/2460/2470/2480 — edge/private RPC lanes (critical; migrate only deliberately)
- 1000–1002, 1500–1502 — validators and sentries (consensus)
- 10130, 10150, 10151 — dbis-frontend, dbis-api (core apps; move only with a plan)
- 100, 101, 102, 104, 105 — mail, datacenter, cloudflared, gitea (infra);
103 Omadaretired 2026-04-04
4. Migrating workloads off ml110 (before OPNsense/pfSense repurpose)
ml110 (192.168.11.10) is being repurposed to OPNsense/pfSense (WAN aggregator between 6–10 cable modems and UDM Pros). All containers/VMs on ml110 must be migrated to r630-01 or r630-02 before the repurpose.
- If cluster:
ssh root@192.168.11.10 "pct migrate <VMID> r630-01 --storage <storage> --restart"or... r630-02 ... - If no cluster: Use backup on ml110, copy to r630-01 or r630-02, restore there (see MIGRATE_CT_R630_01_TO_R630_02.md and adapt for source=ml110, target=r630-01 or r630-02).
After all workloads are off ml110, remove ml110 from the cluster (or reinstall the node with OPNsense/pfSense). See ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md.
4b. Prepare r630-03 for migrations from r630-01 (mail, TsunamiSwap, 57xx AI, Studio)
Goal: Move selected LXCs off r630-01 onto r630-03 (192.168.11.13) to reduce load. Use cluster online migration with explicit target storage local-lvm (each node has its own pve VG + data thin pool; disks are copied to r630-03).
Verified batch (source r630-01, static IPs on vmbr0 / gw 192.168.11.1):
| VMID | Hostname | RAM (MiB) | Cores | rootfs (source) | Size (config) | Notes |
|---|---|---|---|---|---|---|
| 100 | proxmox-mail-gateway | 4096 | 2 | thin1 (r630-01 only) | 10G | Must use --storage local-lvm (not thin1 on target). |
| 5010 | tsunamiswap | 16384 | 8 | local-lvm | 160G | Largest disk; migrate when window allows. |
| 5702 | ai-inf-1 | 16384 | 4 | local-lvm | 30G | |
| 5705 | ai-inf-2 | 16384 | 4 | local-lvm | 30G | |
| 7805 | sankofa-studio | 8192 | 4 | local-lvm | 60G | studio.sankofa.nexus — NPM unchanged if IP stays .72. |
Rough total new allocation on r630-03: ~290G thin + ~60G RAM cap (not all resident at once). r630-03 had ~1 TiB free on data / local-lvm and ~503 GiB host RAM (check live: pvesm status, free -h on .13).
Preparation checklist
- Cluster:
pvecm statuson r630-01 and r630-03 — same cluster, Quorate: Yes.local-lvmin/etc/pve/storage.cfgmust list r630-03 (and source node). - Network: Target node has vmbr0 on the same LAN/VLAN as r630-01 (static CT IPs unchanged).
- Backups: Take vzdump (or ZFS snapshot policy) for each VMID before migrating.
- Order (suggested): smaller disks first, 5010 last:
100→5702→5705→7805→5010. Alternatively do 5010 in a maintenance window first if you want the biggest copy done when load is lowest.
Migrate (from any node that can run pct, typically r630-01):
# Replace NODE with r630-01 if you SSH there first.
# Always set target storage so thin1-only CT 100 lands on r630-03's pool.
ssh root@192.168.11.11 "pct migrate 100 r630-03 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 5702 r630-03 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 5705 r630-03 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 7805 r630-03 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 5010 r630-03 --storage local-lvm --restart"
If a migrate fails (lock, storage), stop the CT (pct stop <vmid>), retry with --restart, or use offline backup/restore per MIGRATE_CT_R630_01_TO_R630_02.md adapted for target r630-03 and storage local-lvm.
Afterward: Update docs that still say these VMIDs live on r630-01 (pct list -a | grep <vmid>). Optional: bash scripts/proxmox/ensure-r630-spare-node-storage.sh --node r630-03 (dry-run) if you change storage layout.
Helper (prints the same plan): bash scripts/proxmox/print-migrate-r630-01-to-r630-03-plan.sh
4c. First-wave offload from r630-01 to r630-04 (Order / vault / portal support workloads)
Goal: Reduce r630-01 skew and free pressure on /var/lib/vz / thin1 by moving a low-risk first wave onto r630-04 (192.168.11.14), which is a spare node with active data + local-lvm and essentially 0% thin usage.
Validated target readiness (live checks):
r630-04is quorate in the same five-node cluster (pvecm status).vmbr0is up on192.168.11.14/24.pvesm statusshowsdataandlocal-lvmboth active.lvs pve/datashows ~466.7G thin capacity with ~0% Data% and ~1% Meta%.bash scripts/proxmox/ensure-r630-spare-node-storage.sh --node r630-04reportslocal-lvmactive and no corrective action needed.
Recommended first wave (ordered):
| VMID | Hostname | RAM | rootfs | Why this batch |
|---|---|---|---|---|
| 10201 | order-grafana |
2G | thin1:20G |
Very light support service; good first canary |
| 10210 | order-haproxy |
2G | thin1:20G |
Small edge for Order surface; easy to validate |
| 7804 | gov-portals-dev |
2G | thin1:20G |
Small app workload; static IP on vmbr0 |
| 10020 | order-redis |
4G | thin1:50G |
Light in current sample; frees thin1 space |
| 10230 | order-vault |
2G | thin1:50G |
Small support workload |
| 10092 | order-mcp-legal |
4G | thin1:50G |
Small support workload |
| 8640 | vault-phoenix-1 |
4G | thin1:50G |
Light current usage; frees thin1 |
| 8642 | vault-phoenix-3 |
4G | local-lvm:50G |
Similar profile; keep after a few easy wins |
| 10091 | order-portal-internal |
4G | thin1:50G |
Low CPU and RAM in live sample |
| 10090 | order-portal-public |
4G | thin1:50G |
Low CPU and RAM in live sample |
| 10070 | order-legal |
4G | thin1:50G |
Low current pressure |
| 10200 | order-prometheus |
4G | thin1:100G |
Still reasonable, but leave last due to larger disk |
Suggested batching:
- Canary batch:
10201,10210,7804 - Small support batch:
10020,10230,10092 - Portal / vault batch:
8640,8642,10091,10090,10070 - Last in wave:
10200
Migrate (one at a time or by the batches above):
ssh root@192.168.11.11 "pct migrate 10201 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10210 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 7804 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10020 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10230 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10092 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 8640 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 8642 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10091 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10090 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10070 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10200 r630-04 --storage local-lvm --restart"
Preflight before the first command
bash scripts/verify/poll-lxc-cluster-health.shbash scripts/proxmox/ensure-r630-spare-node-storage.sh --node r630-04ssh root@192.168.11.14 "pvecm status; pvesm status | egrep '^(data|local-lvm|local)'"- Take
vzdumpor snapshot coverage for the chosen batch if you want rollback points
Post-check after each batch
ssh root@192.168.11.11 "pct list" | egrep '^(VMID|10201|10210|7804|10020|10230|10092|8640|8642|10091|10090|10070|10200)\b'ssh root@192.168.11.14 "pct list" | egrep '^(VMID|10201|10210|7804|10020|10230|10092|8640|8642|10091|10090|10070|10200)\b'- Re-run
bash scripts/verify/poll-lxc-cluster-health.shand confirmr630-01skew /vzpressure trend down
Helper (prints the same plan): bash scripts/proxmox/print-migrate-r630-01-to-r630-04-first-wave.sh
5. After migration
- IP: Containers keep the same IP if they use static IP in the CT config; no change needed for NPM/DNS if they point by IP.
- Docs: Update any runbooks or configs that assume “VMID X is on r630-01” (e.g.
config/ip-addresses.confcomments, backup scripts). - Verify: Re-run
bash scripts/check-all-proxmox-hosts.shand confirm load and container counts.
6. Special-case CTs
Blockscout 5000
5000 (blockscout-1) cannot use the normal pvesh ... /migrate flow because it has a host-local bind mount:
mp1: /var/lib/vz/logs-vmid5000,mp=/var/log-remote
Live pvesh create /nodes/<src>/lxc/5000/migrate ... aborts with:
cannot migrate local bind mount point 'mp1'
Use the dedicated stop-and-restore helper instead:
bash scripts/proxmox/migrate-blockscout-5000-to-r630-04.sh --dry-run
PROXMOX_OPS_APPLY=1 bash scripts/proxmox/migrate-blockscout-5000-to-r630-04.sh --apply
The helper does four things in order:
- Seed and final-sync
/var/lib/vz/logs-vmid5000tor630-04 - Stop CT
5000 - Create and copy a
vzdumparchive, thenpct restoreit tolocal-lvmonr630-04 - Re-apply
mp1on the target and start the CT
The bind-mounted log tree is intentionally kept as a host path on the target:
/var/lib/vz/logs-vmid5000
7. Quick reference
| Goal | Command / doc |
|---|---|
| Check current load | bash scripts/check-all-proxmox-hosts.sh |
| Migrate one CT (r630-01 → r630-02) | ./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh <VMID> thin1 [--destroy-source] |
| Plan r630-01 → r630-03 (100, 5010, 57xx, 7805) | bash scripts/proxmox/print-migrate-r630-01-to-r630-03-plan.sh — see §4b |
| Move Blockscout 5000 (bind mount) | bash scripts/proxmox/migrate-blockscout-5000-to-r630-04.sh --dry-run |
| Same-host (data → thin1) | MIGRATION_PLAN_R630_01_DATA.md, migrate-ct-r630-01-data-to-thin1.sh |
| Full migration doc | MIGRATE_CT_R630_01_TO_R630_02.md |