Add Sankofa consolidated hub operator tooling
This commit is contained in:
@@ -0,0 +1,92 @@
|
||||
# Sankofa Phoenix API hub — NPM cutover and post-cutover
|
||||
|
||||
**Purpose:** Ordered steps when moving public `phoenix.sankofa.nexus` traffic from direct Apollo (`:4000`) to Tier-1 nginx on the Phoenix stack (`:8080` by default). Complements [SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md](../02-architecture/SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md) and [SANKOFA_R630_01_CONSOLIDATION_AND_HUB_PLACEMENT_GOAL.md](./SANKOFA_R630_01_CONSOLIDATION_AND_HUB_PLACEMENT_GOAL.md).
|
||||
|
||||
**Not covered here:** corporate apex, portal SSO, or Keycloak realm edits (see portal/Keycloak runbooks).
|
||||
|
||||
---
|
||||
|
||||
## 0. Preconditions
|
||||
|
||||
- API hub installed and healthy on LAN: `curl -sS "http://${IP_SANKOFA_PHOENIX_API:-192.168.11.50}:8080/health"` and a GraphQL POST to `/graphql` succeed.
|
||||
- Backup: NPM export or UI backup, plus application-level backup if you change Phoenix/dbis systemd or env on the CT.
|
||||
|
||||
---
|
||||
|
||||
## 1. `dbis_core` (rate limits and `req.ip`)
|
||||
|
||||
1. Set `TRUST_PROXY=1` on the `dbis_core` process (see `scripts/deployment/ensure-dbis-api-trust-proxy-on-ct.sh` for VMIDs **10150** / **10151**).
|
||||
2. **`TRUST_PROXY_HOPS`** (optional; default `1` in code): Express counts **reverse proxies that terminate the TCP connection to Node** — typically **one** (either NPM **or** the API hub), even when browsers traversed Cloudflare → NPM → hub → `dbis_core`. Raise hops only if your stack adds **another** reverse proxy **in series** directly in front of the same listener (uncommon). When unsure, leave unset and validate `req.ip` / rate-limit keys with two real client IPs.
|
||||
3. Ensure **`ALLOWED_ORIGINS`** lists every browser origin that calls the API (portal, admin, studio, marketing SPAs as applicable). Production forbids `*`.
|
||||
4. Restart `dbis_core` and confirm logs show no CORS or startup validation errors.
|
||||
|
||||
---
|
||||
|
||||
## 2. NPM fleet (`update-npmplus-proxy-hosts-api.sh`)
|
||||
|
||||
1. In repo `.env` (operator workstation), set:
|
||||
- `SANKOFA_NPM_PHOENIX_PORT=8080`
|
||||
- Optionally `IP_SANKOFA_NPM_PHOENIX_API=…` if the hub listens on a different LAN IP than `IP_SANKOFA_PHOENIX_API`.
|
||||
2. Run the fleet script with valid `NPM_*` credentials (same as other NPM updates).
|
||||
3. In NPM UI, confirm `phoenix.sankofa.nexus` and `www.phoenix.sankofa.nexus` forward WebSockets (subscriptions use `/graphql-ws`).
|
||||
|
||||
---
|
||||
|
||||
## 3. Verification
|
||||
|
||||
| Check | Command or action |
|
||||
|--------|-------------------|
|
||||
| Public HTTPS health | `curl -fsS "https://phoenix.sankofa.nexus/health"` (or hub-exposed health path you standardized) |
|
||||
| GraphQL | POST `https://phoenix.sankofa.nexus/graphql` with a trivial query |
|
||||
| WebSocket upgrade (TLS + hub) | `bash scripts/verify/smoke-phoenix-graphql-wss-public.sh` (expects **HTTP 101** via `curl --http1.1`; optional `PHOENIX_WSS_INCLUDE_LAN=1` for hub `:8080`; optional `PHOENIX_WSS_CURL_MAXTIME` default **8**s per probe because curl waits on open WS). Full handshake: `pnpm run verify:phoenix-graphql-ws-subscription` (`connection_init` → `connection_ack`). If Node clients report **RSV1** on `/graphql-ws`, CT **7800** should not register `@fastify/websocket` alongside standalone `ws` (apply `scripts/deployment/ensure-sankofa-phoenix-graphql-ws-remove-fastify-websocket-7800.sh`). **Process crashes on WS disconnect:** `websocket.ts` must import `logger` — `scripts/deployment/ensure-sankofa-phoenix-websocket-ts-import-logger-7800.sh`. Hub nginx: `scripts/deployment/ensure-sankofa-phoenix-api-hub-graphql-ws-proxy-headers-7800.sh` (`Accept-Encoding ""`, `proxy_buffering off` in `/graphql-ws`). Optional host guard: `scripts/deployment/ensure-sankofa-phoenix-7800-nft-dport-4000-guard.sh` + `config/nftables/sankofa-phoenix-7800-guard-dport-4000.nft`. |
|
||||
| IRU / public limits | Hit a rate-limited route from two different public IPs and confirm keys differ (validates forwarded client IP) |
|
||||
|
||||
---
|
||||
|
||||
## 4. Post-cutover hardening (dual path)
|
||||
|
||||
After NPM points at `:8080` and traffic is stable:
|
||||
|
||||
- **Bind Apollo to loopback** (recommended when hub upstream is `127.0.0.1:4000`):
|
||||
`PROXMOX_OPS_APPLY=1` `PROXMOX_OPS_ALLOWED_VMIDS=7800` `bash scripts/deployment/ensure-sankofa-phoenix-apollo-bind-loopback-7800.sh --apply --vmid 7800`
|
||||
Confirm VLAN cannot connect to `:4000`; hub `:8080` and public `https://phoenix.sankofa.nexus` still work. **Alternative:** host firewall on CT 7800 — see `scripts/deployment/plan-phoenix-apollo-port-4000-restrict-7800.sh --ssh`.
|
||||
- **Hub `/graphql-ws` proxy headers** (idempotent; safe with existing installs):
|
||||
`PROXMOX_OPS_APPLY=1` `PROXMOX_OPS_ALLOWED_VMIDS=7800` `bash scripts/deployment/ensure-sankofa-phoenix-api-hub-graphql-ws-proxy-headers-7800.sh --apply --vmid 7800`
|
||||
- **Hub nginx `ExecReload`** (systemd, idempotent):
|
||||
`PROXMOX_OPS_APPLY=1` `PROXMOX_OPS_ALLOWED_VMIDS=7800` `bash scripts/deployment/ensure-sankofa-phoenix-api-hub-systemd-exec-reload-7800.sh --apply --vmid 7800`
|
||||
- **Phoenix API DB migrations** (after DB auth works):
|
||||
`PROXMOX_OPS_APPLY=1` `PROXMOX_OPS_ALLOWED_VMIDS=7800` `bash scripts/deployment/ensure-sankofa-phoenix-api-db-migrate-up-7800.sh --apply --vmid 7800`
|
||||
- **Phoenix API `.env` LAN parity** (Keycloak + Sankofa Postgres host, dedupe passwords, `NODE_ENV` policy, `TERMINATE_TLS_AT_EDGE`):
|
||||
`source scripts/lib/load-project-env.sh` then
|
||||
`PROXMOX_OPS_APPLY=1` `PROXMOX_OPS_ALLOWED_VMIDS=7800` `bash scripts/deployment/ensure-sankofa-phoenix-api-env-lan-parity-7800.sh --apply --vmid 7800`
|
||||
Default appends **`NODE_ENV=development`** until `DB_PASSWORD` / `KEYCLOAK_CLIENT_SECRET` meet production length; use **`PHOENIX_API_NODE_ENV=production`** only after secrets and TLS policy are ready.
|
||||
If Postgres returns **28P01** (auth failed), align **`DB_USER`** (typically **`sankofa`**, not `postgres`) and **`DB_PASSWORD`** with the **`sankofa`** role on VMID **7803** (`ALTER USER … PASSWORD` on the Postgres CT), then run **`ensure-sankofa-phoenix-api-db-migrate-up-7800.sh`** so **`audit_logs`** exists — see [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md).
|
||||
For **`PHOENIX_API_NODE_ENV=production`** without local certs: run **`ensure-sankofa-phoenix-tls-config-terminate-at-edge-7800.sh`** first and keep **`TERMINATE_TLS_AT_EDGE=1`** in `.env`.
|
||||
- Inventory: [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md) (Phoenix row + VMID 7800 table).
|
||||
|
||||
---
|
||||
|
||||
## 5. Rollback
|
||||
|
||||
1. Unset `SANKOFA_NPM_PHOENIX_PORT` or set it back to `4000` (or your direct Apollo port).
|
||||
2. Re-run the NPM fleet script.
|
||||
3. If `dbis_core` had `TRUST_PROXY_HOPS=2` only for the hub path, reduce hops or disable trust proxy per your direct topology.
|
||||
|
||||
---
|
||||
|
||||
## 6. References
|
||||
|
||||
- Installer: `scripts/deployment/install-sankofa-api-hub-nginx-on-pve.sh`
|
||||
- Hub graphql-ws headers (live CT): `scripts/deployment/ensure-sankofa-phoenix-api-hub-graphql-ws-proxy-headers-7800.sh`
|
||||
- Phoenix `websocket.ts` logger import (prevents crash on disconnect): `scripts/deployment/ensure-sankofa-phoenix-websocket-ts-import-logger-7800.sh`
|
||||
- Phoenix API `.env` LAN parity: `scripts/deployment/ensure-sankofa-phoenix-api-env-lan-parity-7800.sh`
|
||||
- Phoenix API DB migrate up (CT 7800): `scripts/deployment/ensure-sankofa-phoenix-api-db-migrate-up-7800.sh`
|
||||
- Phoenix TLS (terminate at edge, production without local certs): `scripts/deployment/ensure-sankofa-phoenix-tls-config-terminate-at-edge-7800.sh`
|
||||
- Hub unit `ExecReload`: `scripts/deployment/ensure-sankofa-phoenix-api-hub-systemd-exec-reload-7800.sh`
|
||||
- LAN smoke: `scripts/verify/verify-sankofa-consolidated-hub-lan.sh`
|
||||
- Hub GraphQL smoke: `scripts/verify/smoke-phoenix-api-hub-lan.sh`
|
||||
- Public / LAN WebSocket upgrade smoke: `scripts/verify/smoke-phoenix-graphql-wss-public.sh`
|
||||
- Loopback bind for Apollo: `scripts/deployment/ensure-sankofa-phoenix-apollo-bind-loopback-7800.sh`
|
||||
- Read-only plan (firewall alternative): `scripts/deployment/plan-phoenix-apollo-port-4000-restrict-7800.sh` (`--ssh` on LAN)
|
||||
- Example config syntax: `scripts/verify/check-sankofa-consolidated-nginx-examples.sh`
|
||||
- Gap review: `docs/02-architecture/NON_CHAIN_ECOSYSTEM_PLAN_REVIEW_AND_GAPS.md`
|
||||
@@ -0,0 +1,96 @@
|
||||
# Goal: relieve r630-01 via consolidation + hub placement (not nginx alone)
|
||||
|
||||
**Status:** Operator goal / runbook
|
||||
**Last updated:** 2026-04-13
|
||||
|
||||
## 1. What you are optimizing for
|
||||
|
||||
**Primary goal:** reduce **guest count** and **steady-state CPU / pressure** on **r630-01** (`192.168.11.11`) by:
|
||||
|
||||
1. **Retiring CTs** that only existed to serve **small, non-chain web** surfaces (static or low-SSR), after those surfaces are merged into a **single web hub** guest (or static export + nginx).
|
||||
2. **Placing new hub LXCs** (nginx-only or low-RAM) on **less busy nodes** (typically **r630-03 / r630-04** per health reports), instead of stacking more edge services on r630-01.
|
||||
3. **Optionally migrating** existing Sankofa / Phoenix / DBIS-related CTs **off** r630-01 when they are **not** chain-critical for that node.
|
||||
|
||||
**Non-goal:** expecting the **API hub nginx** colocated on VMID **7800** to materially lower r630-01 load. That pattern is for **routing simplicity** and a path to **fewer public upstreams**; load relief comes from **fewer guests** and **better placement**, not from reverse proxy CPU.
|
||||
|
||||
---
|
||||
|
||||
## 2. Current anchor facts (from inventory docs)
|
||||
|
||||
Treat `pct list` on each node as authoritative when planning; the table below is a **documentation snapshot** of common r630-01-adjacent workloads:
|
||||
|
||||
| Area | Typical on r630-01 today | Notes |
|
||||
|------|---------------------------|--------|
|
||||
| Sankofa Phoenix stack | **7800** API, **7801** portal, **7802** Keycloak, **7803** Postgres, **7806** public web | Tightly coupled for latency; migrations need cutover windows |
|
||||
| DBIS API | **10150** (`IP_DBIS_API`) | Often co-dependent with Phoenix / portal flows |
|
||||
| NPMplus | **10233** / **10234** (see `ALL_VMIDS_ENDPOINTS.md`) | Edge; may stay on r630-01 or follow your NPM HA policy |
|
||||
| Chain-critical | **2101**, **2103** (Besu core lanes) | **Do not** “consolidate away” without chain runbooks |
|
||||
|
||||
---
|
||||
|
||||
## 3. Phased execution (explicit consolidation + placement)
|
||||
|
||||
### Phase 0 — Measure (read-only)
|
||||
|
||||
1. Latest cluster health JSON: `bash scripts/verify/poll-lxc-cluster-health.sh` (writes `reports/status/lxc_cluster_health_*.json`).
|
||||
2. Rebalance **plan only**:
|
||||
`bash scripts/verify/plan-lxc-rebalance-from-health-report.sh --source r630-01 --target r630-04 --limit 12`
|
||||
Adjust `--target` to the node with **headroom** (load, PSI, storage). Review exclusions (chain-critical / infra patterns) in the script output.
|
||||
3. Record **which VMIDs must stay** on r630-01 vs **candidates to move** in your change ticket.
|
||||
|
||||
### Phase 1 — Consolidate **non-chain web** (fewer guests)
|
||||
|
||||
1. Architecture: [SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md](../02-architecture/SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md) (static-first vs one Node process).
|
||||
2. Build static exports (or one monorepo SSR host) so **multiple FQDNs** can share **one nginx** `server_name` / `map $host` pattern (`config/nginx/sankofa-non-chain-frontends.example.conf`).
|
||||
3. **Provision the web hub LXC on the target node** (not r630-01 if the goal is offload). Use a **new IP** from your IPAM; update `.env` overrides `IP_SANKOFA_WEB_HUB` / port when ready.
|
||||
4. NPM dry-run → apply: point marketing / microsite hosts at the web hub upstream.
|
||||
|
||||
**Outcome:** retire legacy one-site-one-CT guests **after** TTL / rollback window.
|
||||
|
||||
### Phase 2 — API hub **placement** (avoid piling onto r630-01)
|
||||
|
||||
**Today:** Tier-1 API hub nginx may be colocated on **7800** (same CT as Apollo) for a fast LAN proof — that does **not** reduce r630-01 guest count.
|
||||
|
||||
**Target pattern for load relief:**
|
||||
|
||||
1. Create a **small** Debian LXC on **r630-03 or r630-04** (dedicated “phoenix-api-hub” VMID), **only** nginx + `sankofa-phoenix-api-hub.service`.
|
||||
2. Upstreams in that hub: `proxy_pass` to **LAN IPs** of **7800:4000** (GraphQL) and **10150:3000** (`dbis_core`) — cross-node proxy is fine on VLAN 11.
|
||||
3. Run `install-sankofa-api-hub-nginx-on-pve.sh` with `--vmid <new-hub-vmid>` on the **target** node’s PVE host (set `PROXMOX_HOST` if not r630-01).
|
||||
4. NPM: point `phoenix.sankofa.nexus` to **hub IP:8080** (or keep **4000** direct until validated). Before declaring success, run **WebSocket** smoke (`graphql-ws` through NPM) and confirm **`dbis_core` `TRUST_PROXY`** + trusted proxy list include the hub (see [NON_CHAIN_ECOSYSTEM_PLAN_REVIEW_AND_GAPS.md](../02-architecture/NON_CHAIN_ECOSYSTEM_PLAN_REVIEW_AND_GAPS.md) §2.1–2.2).
|
||||
5. **Disable / remove** hub nginx from **7800** if you no longer want dual stacks (maintenance window; validate `systemctl stop sankofa-phoenix-api-hub` on 7800 only after NPM uses the new hub).
|
||||
|
||||
**Outcome:** Phoenix CT can stay on r630-01 for DB locality, while **edge proxy RAM/CPU** sits on a lighter node — or later migrate 7800 itself after Phase 3.
|
||||
|
||||
### Phase 3 — Migrate heavy CTs (optional, highest impact)
|
||||
|
||||
Use **scoped** `pct migrate` from the planner output. Rules from project safety:
|
||||
|
||||
- Named VMID list, **dry-run** first, maintenance window, rollback IP/NPM plan.
|
||||
- After any move: update `get_host_for_vmid` in `scripts/lib/load-project-env.sh` and [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md).
|
||||
|
||||
### Phase 4 — Retire + verify
|
||||
|
||||
1. Destroy **only** CTs that are fully replaced (config backups, DNS, NPM rows removed).
|
||||
2. Re-run health poll + E2E verifier profile for public hosts you moved.
|
||||
|
||||
---
|
||||
|
||||
## 4. Decision record (fill as you execute)
|
||||
|
||||
| Decision | Choice | Date |
|
||||
|----------|--------|------|
|
||||
| Web hub target node | r630-0? | |
|
||||
| API hub target node (nginx-only LXC) | r630-0? | |
|
||||
| NPM phoenix upstream | :4000 direct / :8080 hub | |
|
||||
| VMIDs retired after consolidation | | |
|
||||
|
||||
---
|
||||
|
||||
## 5. Related references
|
||||
|
||||
- [NON_CHAIN_ECOSYSTEM_HYPERSCALER_STYLE_MODEL.md](../02-architecture/NON_CHAIN_ECOSYSTEM_HYPERSCALER_STYLE_MODEL.md) (cell types, edge plane vs chain plane)
|
||||
- [SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md](../02-architecture/SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md)
|
||||
- [PROXMOX_LOAD_BALANCING_RUNBOOK.md](../04-configuration/PROXMOX_LOAD_BALANCING_RUNBOOK.md)
|
||||
- [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md)
|
||||
- `scripts/deployment/install-sankofa-api-hub-nginx-on-pve.sh`
|
||||
- `scripts/verify/verify-sankofa-consolidated-hub-lan.sh`
|
||||
Reference in New Issue
Block a user