13 KiB
Non-chain ecosystem plan — detailed review, gaps, and inconsistencies
Purpose: Critical review of the consolidated Phoenix / web hub / r630-01 offload / hyperscaler-style documents and scripts as of 2026-04-13. Use this as a remediation backlog; update linked docs when items close.
Scope reviewed:
NON_CHAIN_ECOSYSTEM_HYPERSCALER_STYLE_MODEL.md,
SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md,
SANKOFA_R630_01_CONSOLIDATION_AND_HUB_PLACEMENT_GOAL.md,
scripts/deployment/install-sankofa-api-hub-nginx-on-pve.sh,
scripts/verify/verify-sankofa-consolidated-hub-lan.sh,
config/ip-addresses.conf hub defaults,
scripts/lib/load-project-env.sh get_host_for_vmid.
1. Cross-document consistency
| Topic | Hyper-scaler model | Consolidated hub doc | r630-01 goal doc | Verdict |
|---|---|---|---|---|
| Chain vs non-chain boundary | Explicit exclusion list | Matches | Matches | Aligned |
| API hub Tier 1 | Gateway row | Tier 1 nginx | Phase 2 move hub off 7800 | Aligned; live state (hub on 7800) is interim per r630 doc |
| Web hub | Edge-static / SSR cells | Options A/B/C | Phase 1 | Aligned |
| Load relief | Fewer cells + placement | “Moving hubs” note | Non-goal: nginx CPU on same node | Aligned |
| NPM | Single edge story | Fewer upstream IPs possible | NPM repoint | Partial gap: NPM often still one row per FQDN; “fewer rows” is upstream IP convergence, not necessarily fewer proxy host records (see §4.1). |
2. Technical gaps (must fix in implementation, not only docs)
2.1 TRUST_PROXY and client IP for dbis_core (high)
Issue: Tier-1 nginx forwards X-Forwarded-For / X-Real-IP, but dbis_core IRU rate limits and abuse logic require TRUST_PROXY=1 (and correct trusted hop: NPM → hub → app). If dbis_core does not trust the hub IP, it sees only the hub’s LAN address for all users.
Remediation: Document in cutover checklist: set TRUST_PROXY=1 on dbis_core and restrict trusted proxy list to NPM and API hub subnets/IPs. Add integration test: rate limit key changes when X-Forwarded-For varies.
Doc fix: Already mentioned in consolidated §3.3; add explicit “before NPM → hub cutover” gate in SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md operator checklist.
Repo (2026-04-13): dbis_core supports TRUST_PROXY_HOPS (1–10) so Express trust proxy matches NPM-only vs NPM→hub→app; see dbis_core/.env.example. IP allowlisting for proxies remains an ops/network task.
2.2 GraphQL WebSocket through NPM + hub (high)
Issue: graphql-ws requires Upgrade end-to-end. NPM custom locations must allow WebSockets; hub nginx already sets Upgrade / Connection to Apollo. If NPM strips or times out upgrades, subscriptions break silently for some clients.
Remediation: Add explicit E2E: wscat or Apollo subscription smoke through public URL after any NPM port/path change. Document NPM “Websockets support” toggle if applicable.
Repo: scripts/verify/smoke-phoenix-graphql-wss-public.sh (curl HTTP 101 upgrade on wss://…/graphql-ws; use PHOENIX_WSS_INCLUDE_LAN=1 for hub :8080).
2.3 CORS and browser origins (medium)
Issue: Consolidated doc says CORS allowlist “web hub FQDNs only.” Browsers calling https://phoenix.sankofa.nexus/graphql from https://portal.sankofa.nexus are cross-origin; allowlist must include portal, admin, studio, and any SPA origins that call the API—not only the web hub static hostnames.
Remediation: Replace wording with “all documented browser origins that invoke Phoenix or dbis_core from the browser.” Cross-ref SANKOFA_MARKETPLACE_SURFACES.md for IRU public routes.
2.4 Health check path in operator checklist (low — doc error)
Issue: Cutover checklist suggested GET /api/v1/health; dbis_core exposes /health and /v1/health, not under /api/v1/.
Remediation: Checklist corrected in consolidated doc to /health via hub (/api/ prefix does not apply to root health).
2.5 Dual public paths (4000 vs 8080) during migration (medium)
Issue: While both ports are open, clients can bypass hub policies (CORS, future WAF) by targeting :4000 directly if firewalled only at NPM. Hyperscaler model prefers one ingress.
Remediation: After NPM cutover to 8080, firewall Phoenix :4000 to localhost + hub IP only on CT 7800, or bind Apollo to 127.0.0.1 only (application config change—needs Phoenix runbook).
Repo (2026-04-13): scripts/deployment/ensure-sankofa-phoenix-apollo-bind-loopback-7800.sh sets HOST=127.0.0.1 for Fastify on 7800 when hub upstream is 127.0.0.1:4000.
2.6 Stock nginx package disabled on 7800 (medium)
Issue: Installer systemctl disable nginx removes the default Debian nginx.service. If operators expect nginx for ad-hoc static files on that CT, they lose it. Today intentional for dedicated sankofa-phoenix-api-hub.service.
Remediation: Document on CT 7800: only sankofa-phoenix-api-hub serves nginx; do not re-enable stock unit without conflict check.
2.7 proxy_pass URI and trailing slashes (low)
Issue: location /api/ + proxy_pass http://dbis_core_rest; preserves URI prefix—correct for dbis_core mounted at /api/v1. If any route is mounted at root on upstream, mismatch possible.
Remediation: Keep; add note: new BFF routes must use distinct prefixes (/bff/) to avoid colliding with Apollo or dbis_core.
3. Inventory and automation gaps
3.1 get_host_for_vmid omits explicit Sankofa VMIDs (medium)
Issue: Sankofa stack VMIDs 7800–7806 fell through to default *) → r630-01. Behavior matched inventory but was implicit—easy to break if default changes.
Remediation: Add explicit 7800|7801|7802|7803|7806 case arm to get_host_for_vmid with comment “Sankofa Phoenix stack — verify with pct list when migrating.”
Repo (2026-04-13): Explicit 7800–7806 arm on r630-01 in scripts/lib/load-project-env.sh (includes gov portals 7804 and studio 7805).
3.2 Fleet scripts and hub env vars (medium)
Issue: IP_SANKOFA_PHOENIX_API_HUB / SANKOFA_PHOENIX_API_HUB_PORT exist in ip-addresses.conf, but update-npmplus-proxy-hosts-api.sh (and friends) may still hardcode or use only IP_SANKOFA_PHOENIX_API + 4000.
Remediation: Grep fleet scripts; add optional branch: when SANKOFA_PHOENIX_API_HUB_PORT=8080 and flag file or env SANKOFA_NPM_USE_API_HUB=1, emit upstream :8080. Until then, document manual NPM row for hub cutover.
Repo (2026-04-13): update-npmplus-proxy-hosts-api.sh uses SANKOFA_NPM_PHOENIX_PORT (default SANKOFA_PHOENIX_API_PORT) and IP_SANKOFA_NPM_PHOENIX_API for phoenix.sankofa.nexus / www.phoenix. See SANKOFA_API_HUB_NPM_CUTOVER_AND_POST_CUTOVER_RUNBOOK.md.
3.3 PROXMOX_HOST for install script (low)
Issue: install-sankofa-api-hub-nginx-on-pve.sh defaults PROXMOX_HOST to r630-01. For hub on r630-04, operator must export PROXMOX_HOST—easy to miss.
Remediation: Script header already mentions; add one-line echo of resolved host at start of --apply (done partially); extend dry-run to print get_host_for_vmid suggestion when SANKOFA_API_HUB_TARGET_NODE set (future env).
Repo (2026-04-13): Header states PROXMOX_HOST = PVE node; dry-run prints get_host_for_vmid when load-project-env.sh is sourced.
4. Hyperscaler model — internal tensions
4.1 “Single edge” vs NPM reality
Tension: Model says NPM is the only public entry contract. Technically true for TLS, but NPM often implements one proxy host per FQDN. Hyperscalers use one ALB with many rules. Semantic alignment: treat NPM as ALB-equivalent; “single edge” means single trust and cert pipeline, not literally one row.
4.2 Static-first IRU / marketplace
Tension: SANKOFA_PHOENIX_CONSOLIDATED_FRONTEND_AND_API.md suggests static export for IRU/marketplace where compatible. Today much of partner discovery is dynamic (dbis_core + Phoenix marketplace). Over-optimistic without a “dynamic shell + CDN” alternative.
Remediation: In NON_CHAIN doc §3, clarify Edge-static is for marketing and post-login SPAs that only call APIs; IRU public catalog may remain Edge-SSR or API-driven SPA until a static export pipeline exists.
4.3 Token-aggregation and “chain plane” boundary
Tension: NON_CHAIN_ECOSYSTEM_HYPERSCALER_STYLE_MODEL.md excludes token-aggregation runtime tied to chain RPC. Many deployments colocate token-aggregation with explorer or info nginx—hybrid. Risk: teams mis-classify a service and consolidate wrong CT.
Remediation: Add one line: “Token-aggregation API that only proxies to public RPC may be treated as edge-adjacent; workers that hold keys or execute chain writes stay chain-plane.”
4.4 Postgres coupling
Tension: r630 doc says stack is tightly coupled for latency. Hyperscaler “managed DB” often implies network separation. Acceptable as single-AZ pattern; document when splitting Phoenix API from 7803 Postgres requires read replicas or connection pooler (PgBouncer) first.
5. Missing runbook sections (add over time)
| Missing item | Why it matters |
|---|---|
Backup/restore before hub install and before pct migrate |
Hub nginx does not replace backup discipline for Postgres / Keycloak. |
| Keycloak redirect URIs when origins move to web hub IP/hostnames | OIDC failures post-cutover. |
| Certificate issuance when many FQDNs share one upstream IP | NPM still requests certs per host; rate limits / ACME. |
Rollback: restore NPM upstream + systemctl start nginx on 7800? |
Dual-stack rollback path. |
| SLO / error budget | Hyperscaler practice; currently implicit. |
CI for nginx -t on example configs |
GitHub Actions: .github/workflows/validate-sankofa-nginx-examples.yml (Gitea: mirror or add equivalent workflow). |
6. Document maintenance items (quick fixes)
- Consolidated doc §5 — ensure artifact table always lists
install-sankofa-api-hub-nginx-on-pve.shandverify-sankofa-consolidated-hub-lan.shnext to other operator scripts. - Consolidated §3.2 Tier 1 — prefer LAN upstream to
dbis_coreas the default narrative (colocated127.0.0.1:3000is the special case). Clarified in repo. - Decision log — “Web hub pattern” vs filled API tier: use TBD / interim until a web hub is chosen. Updated in repo.
- This file linked from NON_CHAIN_ECOSYSTEM_HYPERSCALER_STYLE_MODEL.md §6 and MASTER_INDEX.md.
7. Prioritized remediation backlog
| Priority | Item | Owner |
|---|---|---|
| P0 | Verify TRUST_PROXY + TRUST_PROXY_HOPS + production trust boundaries for dbis_core when using hub |
LAN: TRUST_PROXY=1 on 10150/10151 via ensure-dbis-api-trust-proxy-on-ct.sh; validate rate-limit keys from two public IPs |
| P0 | WebSocket E2E through NPM after hub port change | Done: smoke-phoenix-graphql-wss-public.sh → HTTP 101; pnpm run verify:phoenix-graphql-ws-subscription → connection_ack (remove unused @fastify/websocket on 7800 if RSV1; see runbook). |
| P1 | CORS / allowed origins list includes all browser callers | App + API |
| P1 | Firewall or bind Apollo to localhost after NPM → 8080 | Done: ensure-sankofa-phoenix-apollo-bind-loopback-7800.sh on 7800 (or use firewall plan if HOST cannot be set) |
| P2 | Explicit get_host_for_vmid entries for 7800–7806 |
Done in load-project-env.sh — re-verify on migrate |
| P2 | NPM fleet SANKOFA_NPM_PHOENIX_PORT / IP_SANKOFA_NPM_PHOENIX_API |
Done in update-npmplus-proxy-hosts-api.sh |
| P3 | Backup/rollback runbook sections | SANKOFA_API_HUB_NPM_CUTOVER_AND_POST_CUTOVER_RUNBOOK.md §0 / §5 |
| P3 | Clarify static-first vs dynamic IRU in NON_CHAIN §3 | Docs |
8. Conclusion
The plan is directionally sound: chain plane separation, cell typing, phased offload from r630-01, and Tier-1 API hub are consistent. The largest gaps are operational truth items (client IP trust, WebSockets, CORS wording, dual-port exposure) and automation drift (NPM scripts vs new env vars, implicit VMID→host). Closing P0–P1 before wide NPM cutover matches how hyperscalers treat ingress migrations: prove identity and transport contracts first, then shift traffic.