Files

defiQUG dbd517b279 Sync workspace: config, docs, scripts, CI, operator rules, and submodule pointers.

- Update dbis_core, cross-chain-pmm-lps, explorer-monorepo, metamask-integration, pr-workspace/chains
- Omit embedded publish git dirs and empty placeholders from index

Made-with: Cursor

2026-04-12 06:12:20 -07:00

15 KiB

Raw Blame History

Proxmox load balancing runbook

Purpose: Reduce load on the busiest node (r630-01) by migrating selected LXC containers to r630-02. Also frees space on r630-01 when moving to another host. Note: ml110 is being repurposed to OPNsense/pfSense (WAN aggregator); migrate workloads off ml110 to r630-01/r630-02 before repurpose — see ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md.

Before you start: If you are considering adding a third or fourth R630 to the cluster first, see PROXMOX_ADD_THIRD_FOURTH_R630_DECISION.md — including whether you already have r630-03/r630-04 (powered off) to bring online.

Spare nodes (storage ready): r630-03 (192.168.11.13) and r630-04 (192.168.11.14) are in the cluster with data / local-lvm active (shared /etc/pve/storage.cfg lists ml110,r630-01,r630-03,r630-04). For r630-03, you can also place CT disks on thin1-r630-03 … thin6-r630-03 (~226 GiB pools, one per SSD). For r630-04, use data/local-lvm (Ceph OSD disks are separate). Scripts: scripts/proxmox/ensure-r630-spare-node-storage.sh, scripts/proxmox/provision-r630-03-six-ssd-thinpools.sh, optional scripts/proxmox/pve-spare-host-optional-tuneup.sh.

Current imbalance (typical):

Node	IP	LXC count	Load (1/5/15)	Notes
r630-01	192.168.11.11	58	56 / 81 / 92	Historical sample only; re-check live load before acting
r630-02	192.168.11.12	23	~4 / 4 / 4	Light
ml110	192.168.11.10	18	~7 / 7 / 9	Repurposing to OPNsense/pfSense — migrate workloads off to r630-01/r630-02
r630-03	192.168.11.13	0 (spare)	low	Migration target — ~1 TiB data/local-lvm + thin1-r630-03…thin6-r630-03
r630-04	192.168.11.14	0 (spare)	low	Migration target — ~467 GiB thin + Ceph OSDs

Ways to balance:

Cross-host migration (e.g. r630-01 → r630-02, r630-03, or r630-04) — Moves workload off r630-01. IP stays the same if the container uses a static IP; only the Proxmox host changes. (ml110 is no longer a migration target; migrate containers off ml110 first.)
Same-host storage migration (r630-01 data → thin1) — Frees space on the data pool and can improve I/O; does not reduce CPU/load by much. See MIGRATION_PLAN_R630_01_DATA.md.

1. Check cluster (live migrate vs backup/restore)

If all nodes are in the same Proxmox cluster, you can try live migration (faster, less downtime):

ssh root@192.168.11.11 "pvecm status"
ssh root@192.168.11.12 "pvecm status"

If both show the same cluster name and list each other: use pct migrate <VMID> <target_node> --restart from any cluster node (run on r630-01 or from a host that SSHs to r630-01).
If nodes are not in a cluster (or migrate fails due to storage): use backup → copy → restore with the script below.

2. Cross-host migration (r630-01 → r630-02)

Script (backup/restore; works without shared storage):

cd /path/to/proxmox

# One container (replace VMID and target storage)
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh <VMID> [target_storage] [--destroy-source]

# Examples
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh 3501 thin1 --dry-run
./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh 3501 thin1 --destroy-source

Target storage on r630-02: Check with ssh root@192.168.11.12 "pvesm status". Common: thin1, thin2, thin5, thin6.

If cluster works (live migrate):

ssh root@192.168.11.11 "pct migrate <VMID> r630-02 --storage thin1 --restart"
# Then remove source CT if desired: pct destroy <VMID> --purge 1

3. Good candidates to move (r630-01 → r630-02)

Containers that reduce load and are safe to move (no critical chain/consensus; IP can stay static). Prefer moving several smaller ones rather than one critical RPC.

VMID	Name / role	Notes
3500	oracle-publisher-1	Oracle publisher
3501	ccip-monitor-1	CCIP monitor
7804	gov-portals-dev	Gov portals (already migrated in past; verify current host)
8640	vault-phoenix-1	Vault (if not critical path)
8642	vault-phoenix-3	Vault
10232	CT10232	Small service
10235	npmplus-alltra-hybx	NPMplus instance (has its own NPM; update UDM port forward if needed)
10236	npmplus-fourth	NPMplus instance
10030–10092	order-* (identity, intake, finance, etc.)	Order stack; move as a group if desired
10200–10210	order-prometheus, grafana, opensearch, haproxy	Monitoring/HA; move with order-* or after

Do not move (keep on r630-01 for now):

10233 — npmplus (main NPMplus; 76.53.10.36 → .167)
2101 — besu-rpc-core-1 (core RPC for deploy/admin)
2420/2430/2440/2460/2470/2480 — edge/private RPC lanes (critical; migrate only deliberately)
1000–1002, 1500–1502 — validators and sentries (consensus)
10130, 10150, 10151 — dbis-frontend, dbis-api (core apps; move only with a plan)
100, 101, 102, 104, 105 — mail, datacenter, cloudflared, gitea (infra); ~~103 Omada~~ retired 2026-04-04

4. Migrating workloads off ml110 (before OPNsense/pfSense repurpose)

ml110 (192.168.11.10) is being repurposed to OPNsense/pfSense (WAN aggregator between 6–10 cable modems and UDM Pros). All containers/VMs on ml110 must be migrated to r630-01 or r630-02 before the repurpose.

If cluster: ssh root@192.168.11.10 "pct migrate <VMID> r630-01 --storage <storage> --restart" or ... r630-02 ...
If no cluster: Use backup on ml110, copy to r630-01 or r630-02, restore there (see MIGRATE_CT_R630_01_TO_R630_02.md and adapt for source=ml110, target=r630-01 or r630-02).

After all workloads are off ml110, remove ml110 from the cluster (or reinstall the node with OPNsense/pfSense). See ML110_OPNSENSE_PFSENSE_WAN_AGGREGATOR.md.

4b. Prepare r630-03 for migrations from r630-01 (mail, TsunamiSwap, 57xx AI, Studio)

Goal: Move selected LXCs off r630-01 onto r630-03 (192.168.11.13) to reduce load. Use cluster online migration with explicit target storage local-lvm (each node has its own pve VG + data thin pool; disks are copied to r630-03).

Verified batch (source r630-01, static IPs on vmbr0 / gw 192.168.11.1):

VMID	Hostname	RAM (MiB)	Cores	rootfs (source)	Size (config)	Notes
100	proxmox-mail-gateway	4096	2	thin1 (r630-01 only)	10G	Must use `--storage local-lvm` (not thin1 on target).
5010	tsunamiswap	16384	8	local-lvm	160G	Largest disk; migrate when window allows.
5702	ai-inf-1	16384	4	local-lvm	30G
5705	ai-inf-2	16384	4	local-lvm	30G
7805	sankofa-studio	8192	4	local-lvm	60G	`studio.sankofa.nexus` — NPM unchanged if IP stays `.72`.

Rough total new allocation on r630-03: ~290G thin + ~60G RAM cap (not all resident at once). r630-03 had ~1 TiB free on data / local-lvm and ~503 GiB host RAM (check live: pvesm status, free -h on .13).

Preparation checklist

Cluster: pvecm status on r630-01 and r630-03 — same cluster, Quorate: Yes. local-lvm in /etc/pve/storage.cfg must list r630-03 (and source node).
Network: Target node has vmbr0 on the same LAN/VLAN as r630-01 (static CT IPs unchanged).
Backups: Take vzdump (or ZFS snapshot policy) for each VMID before migrating.
Order (suggested): smaller disks first, 5010 last: 100 → 5702 → 5705 → 7805 → 5010. Alternatively do 5010 in a maintenance window first if you want the biggest copy done when load is lowest.

Migrate (from any node that can run pct, typically r630-01):

# Replace NODE with r630-01 if you SSH there first.
# Always set target storage so thin1-only CT 100 lands on r630-03's pool.

ssh root@192.168.11.11 "pct migrate 100 r630-03 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 5702 r630-03 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 5705 r630-03 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 7805 r630-03 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 5010 r630-03 --storage local-lvm --restart"

If a migrate fails (lock, storage), stop the CT (pct stop <vmid>), retry with --restart, or use offline backup/restore per MIGRATE_CT_R630_01_TO_R630_02.md adapted for target r630-03 and storage local-lvm.

Afterward: Update docs that still say these VMIDs live on r630-01 (pct list -a | grep <vmid>). Optional: bash scripts/proxmox/ensure-r630-spare-node-storage.sh --node r630-03 (dry-run) if you change storage layout.

Helper (prints the same plan): bash scripts/proxmox/print-migrate-r630-01-to-r630-03-plan.sh

4c. First-wave offload from r630-01 to r630-04 (Order / vault / portal support workloads)

Goal: Reduce r630-01 skew and free pressure on /var/lib/vz / thin1 by moving a low-risk first wave onto r630-04 (192.168.11.14), which is a spare node with active data + local-lvm and essentially 0% thin usage.

Validated target readiness (live checks):

r630-04 is quorate in the same five-node cluster (pvecm status).
vmbr0 is up on 192.168.11.14/24.
pvesm status shows data and local-lvm both active.
lvs pve/data shows ~466.7G thin capacity with ~0% Data% and ~1% Meta%.
bash scripts/proxmox/ensure-r630-spare-node-storage.sh --node r630-04 reports local-lvm active and no corrective action needed.

Recommended first wave (ordered):

VMID	Hostname	RAM	rootfs	Why this batch
10201	`order-grafana`	2G	`thin1:20G`	Very light support service; good first canary
10210	`order-haproxy`	2G	`thin1:20G`	Small edge for Order surface; easy to validate
7804	`gov-portals-dev`	2G	`thin1:20G`	Small app workload; static IP on `vmbr0`
10020	`order-redis`	4G	`thin1:50G`	Light in current sample; frees `thin1` space
10230	`order-vault`	2G	`thin1:50G`	Small support workload
10092	`order-mcp-legal`	4G	`thin1:50G`	Small support workload
8640	`vault-phoenix-1`	4G	`thin1:50G`	Light current usage; frees `thin1`
8642	`vault-phoenix-3`	4G	`local-lvm:50G`	Similar profile; keep after a few easy wins
10091	`order-portal-internal`	4G	`thin1:50G`	Low CPU and RAM in live sample
10090	`order-portal-public`	4G	`thin1:50G`	Low CPU and RAM in live sample
10070	`order-legal`	4G	`thin1:50G`	Low current pressure
10200	`order-prometheus`	4G	`thin1:100G`	Still reasonable, but leave last due to larger disk

Suggested batching:

Canary batch: 10201, 10210, 7804
Small support batch: 10020, 10230, 10092
Portal / vault batch: 8640, 8642, 10091, 10090, 10070
Last in wave: 10200

Migrate (one at a time or by the batches above):

ssh root@192.168.11.11 "pct migrate 10201 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10210 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 7804 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10020 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10230 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10092 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 8640 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 8642 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10091 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10090 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10070 r630-04 --storage local-lvm --restart"
ssh root@192.168.11.11 "pct migrate 10200 r630-04 --storage local-lvm --restart"

Preflight before the first command

bash scripts/verify/poll-lxc-cluster-health.sh
bash scripts/proxmox/ensure-r630-spare-node-storage.sh --node r630-04
ssh root@192.168.11.14 "pvecm status; pvesm status | egrep '^(data|local-lvm|local)'"
Take vzdump or snapshot coverage for the chosen batch if you want rollback points

Post-check after each batch

ssh root@192.168.11.11 "pct list" | egrep '^(VMID|10201|10210|7804|10020|10230|10092|8640|8642|10091|10090|10070|10200)\b'
ssh root@192.168.11.14 "pct list" | egrep '^(VMID|10201|10210|7804|10020|10230|10092|8640|8642|10091|10090|10070|10200)\b'
Re-run bash scripts/verify/poll-lxc-cluster-health.sh and confirm r630-01 skew / vz pressure trend down

Helper (prints the same plan): bash scripts/proxmox/print-migrate-r630-01-to-r630-04-first-wave.sh

5. After migration

IP: Containers keep the same IP if they use static IP in the CT config; no change needed for NPM/DNS if they point by IP.
Docs: Update any runbooks or configs that assume “VMID X is on r630-01” (e.g. config/ip-addresses.conf comments, backup scripts).
Verify: Re-run bash scripts/check-all-proxmox-hosts.sh and confirm load and container counts.

6. Special-case CTs

Blockscout 5000

5000 (blockscout-1) cannot use the normal pvesh ... /migrate flow because it has a host-local bind mount:

mp1: /var/lib/vz/logs-vmid5000,mp=/var/log-remote

Live pvesh create /nodes/<src>/lxc/5000/migrate ... aborts with:

cannot migrate local bind mount point 'mp1'

Use the dedicated stop-and-restore helper instead:

bash scripts/proxmox/migrate-blockscout-5000-to-r630-04.sh --dry-run
PROXMOX_OPS_APPLY=1 bash scripts/proxmox/migrate-blockscout-5000-to-r630-04.sh --apply

The helper does four things in order:

Seed and final-sync /var/lib/vz/logs-vmid5000 to r630-04
Stop CT 5000
Create and copy a vzdump archive, then pct restore it to local-lvm on r630-04
Re-apply mp1 on the target and start the CT

The bind-mounted log tree is intentionally kept as a host path on the target:

/var/lib/vz/logs-vmid5000

7. Quick reference

Goal	Command / doc
Check current load	`bash scripts/check-all-proxmox-hosts.sh`
Migrate one CT (r630-01 → r630-02)	`./scripts/maintenance/migrate-ct-r630-01-to-r630-02.sh <VMID> thin1 [--destroy-source]`
Plan r630-01 → r630-03 (100, 5010, 57xx, 7805)	`bash scripts/proxmox/print-migrate-r630-01-to-r630-03-plan.sh` — see §4b
Move Blockscout 5000 (bind mount)	`bash scripts/proxmox/migrate-blockscout-5000-to-r630-04.sh --dry-run`
Same-host (data → thin1)	MIGRATION_PLAN_R630_01_DATA.md, `migrate-ct-r630-01-data-to-thin1.sh`
Full migration doc	MIGRATE_CT_R630_01_TO_R630_02.md

15 KiB Raw Blame History Unescape Escape