Add Sankofa consolidated hub operator tooling

2026-04-13 21:41:14 -07:00
parent 49740f1a59
commit b7eebb87b3
42 changed files with 2635 additions and 14 deletions
--- a/docs/03-deployment/SANKOFA_API_HUB_NPM_CUTOVER_AND_POST_CUTOVER_RUNBOOK.md
+++ b/docs/03-deployment/SANKOFA_API_HUB_NPM_CUTOVER_AND_POST_CUTOVER_RUNBOOK.md
@@ -0,0 +1,92 @@
+# Sankofa Phoenix API hub — NPM cutover and post-cutover
+
+**Purpose:** Ordered steps when moving public `phoenix.sankofa.nexus` traffic from direct Apollo (`:4000`) to Tier-1 nginx on the Phoenix stack (`:8080` by default). Complements [SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md](../02-architecture/SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md) and [SANKOFA_R630_01_CONSOLIDATION_AND_HUB_PLACEMENT_GOAL.md](./SANKOFA_R630_01_CONSOLIDATION_AND_HUB_PLACEMENT_GOAL.md).
+
+**Not covered here:** corporate apex, portal SSO, or Keycloak realm edits (see portal/Keycloak runbooks).
+
+---
+
+## 0. Preconditions
+
+- API hub installed and healthy on LAN: `curl -sS "http://${IP_SANKOFA_PHOENIX_API:-192.168.11.50}:8080/health"` and a GraphQL POST to `/graphql` succeed.
+- Backup: NPM export or UI backup, plus application-level backup if you change Phoenix/dbis systemd or env on the CT.
+
+---
+
+## 1. `dbis_core` (rate limits and `req.ip`)
+
+1. Set `TRUST_PROXY=1` on the `dbis_core` process (see `scripts/deployment/ensure-dbis-api-trust-proxy-on-ct.sh` for VMIDs **10150** / **10151**).
+2. **`TRUST_PROXY_HOPS`** (optional; default `1` in code): Express counts **reverse proxies that terminate the TCP connection to Node** — typically **one** (either NPM **or** the API hub), even when browsers traversed Cloudflare → NPM → hub → `dbis_core`. Raise hops only if your stack adds **another** reverse proxy **in series** directly in front of the same listener (uncommon). When unsure, leave unset and validate `req.ip` / rate-limit keys with two real client IPs.
+3. Ensure **`ALLOWED_ORIGINS`** lists every browser origin that calls the API (portal, admin, studio, marketing SPAs as applicable). Production forbids `*`.
+4. Restart `dbis_core` and confirm logs show no CORS or startup validation errors.
+
+---
+
+## 2. NPM fleet (`update-npmplus-proxy-hosts-api.sh`)
+
+1. In repo `.env` (operator workstation), set:
+   - `SANKOFA_NPM_PHOENIX_PORT=8080`
+   - Optionally `IP_SANKOFA_NPM_PHOENIX_API=…` if the hub listens on a different LAN IP than `IP_SANKOFA_PHOENIX_API`.
+2. Run the fleet script with valid `NPM_*` credentials (same as other NPM updates).
+3. In NPM UI, confirm `phoenix.sankofa.nexus` and `www.phoenix.sankofa.nexus` forward WebSockets (subscriptions use `/graphql-ws`).
+
+---
+
+## 3. Verification
+
+| Check | Command or action |
+|--------|-------------------|
+| Public HTTPS health | `curl -fsS "https://phoenix.sankofa.nexus/health"` (or hub-exposed health path you standardized) |
+| GraphQL | POST `https://phoenix.sankofa.nexus/graphql` with a trivial query |
+| WebSocket upgrade (TLS + hub) | `bash scripts/verify/smoke-phoenix-graphql-wss-public.sh` (expects **HTTP 101** via `curl --http1.1`; optional `PHOENIX_WSS_INCLUDE_LAN=1` for hub `:8080`; optional `PHOENIX_WSS_CURL_MAXTIME` default **8**s per probe because curl waits on open WS). Full handshake: `pnpm run verify:phoenix-graphql-ws-subscription` (`connection_init` → `connection_ack`). If Node clients report **RSV1** on `/graphql-ws`, CT **7800** should not register `@fastify/websocket` alongside standalone `ws` (apply `scripts/deployment/ensure-sankofa-phoenix-graphql-ws-remove-fastify-websocket-7800.sh`). **Process crashes on WS disconnect:** `websocket.ts` must import `logger` — `scripts/deployment/ensure-sankofa-phoenix-websocket-ts-import-logger-7800.sh`. Hub nginx: `scripts/deployment/ensure-sankofa-phoenix-api-hub-graphql-ws-proxy-headers-7800.sh` (`Accept-Encoding ""`, `proxy_buffering off` in `/graphql-ws`). Optional host guard: `scripts/deployment/ensure-sankofa-phoenix-7800-nft-dport-4000-guard.sh` + `config/nftables/sankofa-phoenix-7800-guard-dport-4000.nft`. |
+| IRU / public limits | Hit a rate-limited route from two different public IPs and confirm keys differ (validates forwarded client IP) |
+
+---
+
+## 4. Post-cutover hardening (dual path)
+
+After NPM points at `:8080` and traffic is stable:
+
+- **Bind Apollo to loopback** (recommended when hub upstream is `127.0.0.1:4000`):  
+  `PROXMOX_OPS_APPLY=1` `PROXMOX_OPS_ALLOWED_VMIDS=7800` `bash scripts/deployment/ensure-sankofa-phoenix-apollo-bind-loopback-7800.sh --apply --vmid 7800`  
+  Confirm VLAN cannot connect to `:4000`; hub `:8080` and public `https://phoenix.sankofa.nexus` still work. **Alternative:** host firewall on CT 7800 — see `scripts/deployment/plan-phoenix-apollo-port-4000-restrict-7800.sh --ssh`.
+- **Hub `/graphql-ws` proxy headers** (idempotent; safe with existing installs):  
+  `PROXMOX_OPS_APPLY=1` `PROXMOX_OPS_ALLOWED_VMIDS=7800` `bash scripts/deployment/ensure-sankofa-phoenix-api-hub-graphql-ws-proxy-headers-7800.sh --apply --vmid 7800`
+- **Hub nginx `ExecReload`** (systemd, idempotent):  
+  `PROXMOX_OPS_APPLY=1` `PROXMOX_OPS_ALLOWED_VMIDS=7800` `bash scripts/deployment/ensure-sankofa-phoenix-api-hub-systemd-exec-reload-7800.sh --apply --vmid 7800`
+- **Phoenix API DB migrations** (after DB auth works):  
+  `PROXMOX_OPS_APPLY=1` `PROXMOX_OPS_ALLOWED_VMIDS=7800` `bash scripts/deployment/ensure-sankofa-phoenix-api-db-migrate-up-7800.sh --apply --vmid 7800`
+- **Phoenix API `.env` LAN parity** (Keycloak + Sankofa Postgres host, dedupe passwords, `NODE_ENV` policy, `TERMINATE_TLS_AT_EDGE`):  
+  `source scripts/lib/load-project-env.sh` then  
+  `PROXMOX_OPS_APPLY=1` `PROXMOX_OPS_ALLOWED_VMIDS=7800` `bash scripts/deployment/ensure-sankofa-phoenix-api-env-lan-parity-7800.sh --apply --vmid 7800`  
+  Default appends **`NODE_ENV=development`** until `DB_PASSWORD` / `KEYCLOAK_CLIENT_SECRET` meet production length; use **`PHOENIX_API_NODE_ENV=production`** only after secrets and TLS policy are ready.  
+  If Postgres returns **28P01** (auth failed), align **`DB_USER`** (typically **`sankofa`**, not `postgres`) and **`DB_PASSWORD`** with the **`sankofa`** role on VMID **7803** (`ALTER USER … PASSWORD` on the Postgres CT), then run **`ensure-sankofa-phoenix-api-db-migrate-up-7800.sh`** so **`audit_logs`** exists — see [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md).  
+  For **`PHOENIX_API_NODE_ENV=production`** without local certs: run **`ensure-sankofa-phoenix-tls-config-terminate-at-edge-7800.sh`** first and keep **`TERMINATE_TLS_AT_EDGE=1`** in `.env`.
+- Inventory: [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md) (Phoenix row + VMID 7800 table).
+
+---
+
+## 5. Rollback
+
+1. Unset `SANKOFA_NPM_PHOENIX_PORT` or set it back to `4000` (or your direct Apollo port).
+2. Re-run the NPM fleet script.
+3. If `dbis_core` had `TRUST_PROXY_HOPS=2` only for the hub path, reduce hops or disable trust proxy per your direct topology.
+
+---
+
+## 6. References
+
+- Installer: `scripts/deployment/install-sankofa-api-hub-nginx-on-pve.sh`
+- Hub graphql-ws headers (live CT): `scripts/deployment/ensure-sankofa-phoenix-api-hub-graphql-ws-proxy-headers-7800.sh`
+- Phoenix `websocket.ts` logger import (prevents crash on disconnect): `scripts/deployment/ensure-sankofa-phoenix-websocket-ts-import-logger-7800.sh`
+- Phoenix API `.env` LAN parity: `scripts/deployment/ensure-sankofa-phoenix-api-env-lan-parity-7800.sh`
+- Phoenix API DB migrate up (CT 7800): `scripts/deployment/ensure-sankofa-phoenix-api-db-migrate-up-7800.sh`
+- Phoenix TLS (terminate at edge, production without local certs): `scripts/deployment/ensure-sankofa-phoenix-tls-config-terminate-at-edge-7800.sh`
+- Hub unit `ExecReload`: `scripts/deployment/ensure-sankofa-phoenix-api-hub-systemd-exec-reload-7800.sh`
+- LAN smoke: `scripts/verify/verify-sankofa-consolidated-hub-lan.sh`
+- Hub GraphQL smoke: `scripts/verify/smoke-phoenix-api-hub-lan.sh`
+- Public / LAN WebSocket upgrade smoke: `scripts/verify/smoke-phoenix-graphql-wss-public.sh`
+- Loopback bind for Apollo: `scripts/deployment/ensure-sankofa-phoenix-apollo-bind-loopback-7800.sh`
+- Read-only plan (firewall alternative): `scripts/deployment/plan-phoenix-apollo-port-4000-restrict-7800.sh` (`--ssh` on LAN)
+- Example config syntax: `scripts/verify/check-sankofa-consolidated-nginx-examples.sh`
+- Gap review: `docs/02-architecture/NON_CHAIN_ECOSYSTEM_PLAN_REVIEW_AND_GAPS.md`
--- a/docs/03-deployment/SANKOFA_R630_01_CONSOLIDATION_AND_HUB_PLACEMENT_GOAL.md
+++ b/docs/03-deployment/SANKOFA_R630_01_CONSOLIDATION_AND_HUB_PLACEMENT_GOAL.md
@@ -0,0 +1,96 @@
+# Goal: relieve r630-01 via consolidation + hub placement (not nginx alone)
+
+**Status:** Operator goal / runbook  
+**Last updated:** 2026-04-13
+
+## 1. What you are optimizing for
+
+**Primary goal:** reduce **guest count** and **steady-state CPU / pressure** on **r630-01** (`192.168.11.11`) by:
+
+1. **Retiring CTs** that only existed to serve **small, non-chain web** surfaces (static or low-SSR), after those surfaces are merged into a **single web hub** guest (or static export + nginx).
+2. **Placing new hub LXCs** (nginx-only or low-RAM) on **less busy nodes** (typically **r630-03 / r630-04** per health reports), instead of stacking more edge services on r630-01.
+3. **Optionally migrating** existing Sankofa / Phoenix / DBIS-related CTs **off** r630-01 when they are **not** chain-critical for that node.
+
+**Non-goal:** expecting the **API hub nginx** colocated on VMID **7800** to materially lower r630-01 load. That pattern is for **routing simplicity** and a path to **fewer public upstreams**; load relief comes from **fewer guests** and **better placement**, not from reverse proxy CPU.
+
+---
+
+## 2. Current anchor facts (from inventory docs)
+
+Treat `pct list` on each node as authoritative when planning; the table below is a **documentation snapshot** of common r630-01-adjacent workloads:
+
+| Area | Typical on r630-01 today | Notes |
+|------|---------------------------|--------|
+| Sankofa Phoenix stack | **7800** API, **7801** portal, **7802** Keycloak, **7803** Postgres, **7806** public web | Tightly coupled for latency; migrations need cutover windows |
+| DBIS API | **10150** (`IP_DBIS_API`) | Often co-dependent with Phoenix / portal flows |
+| NPMplus | **10233** / **10234** (see `ALL_VMIDS_ENDPOINTS.md`) | Edge; may stay on r630-01 or follow your NPM HA policy |
+| Chain-critical | **2101**, **2103** (Besu core lanes) | **Do not** “consolidate away” without chain runbooks |
+
+---
+
+## 3. Phased execution (explicit consolidation + placement)
+
+### Phase 0 — Measure (read-only)
+
+1. Latest cluster health JSON: `bash scripts/verify/poll-lxc-cluster-health.sh` (writes `reports/status/lxc_cluster_health_*.json`).
+2. Rebalance **plan only**:  
+   `bash scripts/verify/plan-lxc-rebalance-from-health-report.sh --source r630-01 --target r630-04 --limit 12`  
+   Adjust `--target` to the node with **headroom** (load, PSI, storage). Review exclusions (chain-critical / infra patterns) in the script output.
+3. Record **which VMIDs must stay** on r630-01 vs **candidates to move** in your change ticket.
+
+### Phase 1 — Consolidate **non-chain web** (fewer guests)
+
+1. Architecture: [SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md](../02-architecture/SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md) (static-first vs one Node process).
+2. Build static exports (or one monorepo SSR host) so **multiple FQDNs** can share **one nginx** `server_name` / `map $host` pattern (`config/nginx/sankofa-non-chain-frontends.example.conf`).
+3. **Provision the web hub LXC on the target node** (not r630-01 if the goal is offload). Use a **new IP** from your IPAM; update `.env` overrides `IP_SANKOFA_WEB_HUB` / port when ready.
+4. NPM dry-run → apply: point marketing / microsite hosts at the web hub upstream.
+
+**Outcome:** retire legacy one-site-one-CT guests **after** TTL / rollback window.
+
+### Phase 2 — API hub **placement** (avoid piling onto r630-01)
+
+**Today:** Tier-1 API hub nginx may be colocated on **7800** (same CT as Apollo) for a fast LAN proof — that does **not** reduce r630-01 guest count.
+
+**Target pattern for load relief:**
+
+1. Create a **small** Debian LXC on **r630-03 or r630-04** (dedicated “phoenix-api-hub” VMID), **only** nginx + `sankofa-phoenix-api-hub.service`.
+2. Upstreams in that hub: `proxy_pass` to **LAN IPs** of **7800:4000** (GraphQL) and **10150:3000** (`dbis_core`) — cross-node proxy is fine on VLAN 11.
+3. Run `install-sankofa-api-hub-nginx-on-pve.sh` with `--vmid <new-hub-vmid>` on the **target** node’s PVE host (set `PROXMOX_HOST` if not r630-01).
+4. NPM: point `phoenix.sankofa.nexus` to **hub IP:8080** (or keep **4000** direct until validated). Before declaring success, run **WebSocket** smoke (`graphql-ws` through NPM) and confirm **`dbis_core` `TRUST_PROXY`** + trusted proxy list include the hub (see [NON_CHAIN_ECOSYSTEM_PLAN_REVIEW_AND_GAPS.md](../02-architecture/NON_CHAIN_ECOSYSTEM_PLAN_REVIEW_AND_GAPS.md) §2.1–2.2).
+5. **Disable / remove** hub nginx from **7800** if you no longer want dual stacks (maintenance window; validate `systemctl stop sankofa-phoenix-api-hub` on 7800 only after NPM uses the new hub).
+
+**Outcome:** Phoenix CT can stay on r630-01 for DB locality, while **edge proxy RAM/CPU** sits on a lighter node — or later migrate 7800 itself after Phase 3.
+
+### Phase 3 — Migrate heavy CTs (optional, highest impact)
+
+Use **scoped** `pct migrate` from the planner output. Rules from project safety:
+
+- Named VMID list, **dry-run** first, maintenance window, rollback IP/NPM plan.
+- After any move: update `get_host_for_vmid` in `scripts/lib/load-project-env.sh` and [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md).
+
+### Phase 4 — Retire + verify
+
+1. Destroy **only** CTs that are fully replaced (config backups, DNS, NPM rows removed).
+2. Re-run health poll + E2E verifier profile for public hosts you moved.
+
+---
+
+## 4. Decision record (fill as you execute)
+
+| Decision | Choice | Date |
+|----------|--------|------|
+| Web hub target node | r630-0? | |
+| API hub target node (nginx-only LXC) | r630-0? | |
+| NPM phoenix upstream | :4000 direct / :8080 hub | |
+| VMIDs retired after consolidation | | |
+
+---
+
+## 5. Related references
+
+- [NON_CHAIN_ECOSYSTEM_HYPERSCALER_STYLE_MODEL.md](../02-architecture/NON_CHAIN_ECOSYSTEM_HYPERSCALER_STYLE_MODEL.md) (cell types, edge plane vs chain plane)  
+- [SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md](../02-architecture/SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md)  
+- [PROXMOX_LOAD_BALANCING_RUNBOOK.md](../04-configuration/PROXMOX_LOAD_BALANCING_RUNBOOK.md)  
+- [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md)  
+- `scripts/deployment/install-sankofa-api-hub-nginx-on-pve.sh`  
+- `scripts/verify/verify-sankofa-consolidated-hub-lan.sh`