- Update dbis_core, cross-chain-pmm-lps, explorer-monorepo, metamask-integration, pr-workspace/chains - Omit embedded publish git dirs and empty placeholders from index Made-with: Cursor
16 KiB
Sankofa IT operations controller — architecture spec
Status: Draft for engineering and IT leadership alignment
Last updated: 2026-04-09 (ADR, collectors contract, VLAN runbook, BFF endpoints, Phase 3–4 skeletons)
Audience: IT team, platform ops, Sankofa admin product owners
1. Goals
You need a single operational program that covers:
| Capability | Intent |
|---|---|
| IP inventory | Authoritative list of every LAN/WAN/VIP assignment, owner, service, and lifecycle (no drift between spreadsheets and config/ip-addresses.conf). |
| VLAN design | Move from today’s flat VLAN 11 to the planned segmentation (validators, RPC, explorer, Sankofa services, tenants) without breaking production. |
| Port mapping | Physical: switch port ↔ patch panel ↔ host NIC ↔ logical bond/VLAN. Logical: UDM port forwards ↔ NPM host ↔ upstream CT/VM. |
| Host efficiency | Compare actual Proxmox capacity (CPU/RAM/storage/network) to workload placement; drive consolidation, spare-node use, and subscription/licensing discipline. |
| IT admin UI | HTML controller under the Sankofa admin surface so the IT team can view/control interfaces, assign licenses/entitlements, run provisioning workflows, and support billing (quotes, usage, invoices handoff). |
This document defines how that fits your existing stack (Proxmox cluster, UDM Pro, UniFi, NPMplus, Keycloak, Phoenix/dbis_core marketplace) and a phased path so you do not boil the ocean.
2. Current state (facts from this repo)
- IP truth is split across
config/ip-addresses.conf,docs/04-configuration/ALL_VMIDS_ENDPOINTS.md, anddocs/11-references/NETWORK_CONFIGURATION_MASTER.md. Automated snapshots:scripts/verify/poll-proxmox-cluster-hardware.sh,reports/status/hardware_and_connected_inventory_*.md. - VLANs: Production today is VLAN 11 only for
192.168.11.0/24. Planned VLANs (110–112, 120, 160, 200–203) are documented in NETWORK_CONFIGURATION_MASTER.md but not implemented as separate broadcast domains on the wire. - Sankofa admin:
admin.sankofa.nexusis client SSO administration today (same upstream as the portal unless split). See FQDN_EXPECTED_CONTENT.md, ALL_VMIDS_ENDPOINTS.md (VMID 7801). Portal source: sibling repoSankofa/portal(scripts/deployment/sync-sankofa-portal-7801.sh). - Marketplace / commercial: Partner IRU flows live in
dbis_core(API + React); native infra is mostly docs + Proxmox, not one database. See SANKOFA_MARKETPLACE_SURFACES.md.
Gap: There is no single product today that unifies IPAM, switch port data, Proxmox actions, UniFi, and billing under Sankofa admin. This spec is the blueprint to add it.
3. Target architecture (recommended)
3.1 UI placement
| Option | Pros | Cons |
|---|---|---|
A — New /it (or /ops) app route inside Sankofa/portal, gated by Keycloak group sankofa-it-admin |
One TLS hostname, shared session patterns, fastest path for “under admin” | Portal bundle grows; must isolate client admin vs IT super-admin |
B — Dedicated host it.sankofa.nexus → small Next.js/Vite SPA + BFF |
Strong separation, independent deploy cadence | Extra NPM row, cert, pipeline |
| C — Embed Grafana + NetBox only | Quick graphs / IPAM | Weak billing/licensing story; less “Sankofa branded” control |
Recommendation: Option A for MVP (fastest), with API on a dedicated backend so you can later move the shell to B without rewriting integrations.
3.2 Backend (“control plane API”)
Introduce a small service (name e.g. sankofa-it-api) not on the public internet without auth:
- Network: VLAN 11 only or private listener + NPM internal host; OIDC client credentials or user JWT from Keycloak.
- Responsibilities:
- Read models: IPAM, devices, port maps, Proxmox inventory snapshot, UniFi device list (cached).
- Write models: change requests with audit log (who/when/what); optional approval queue for destructive actions.
- Connectors (adapters): Proxmox API, UniFi Network API (UDM), NPM API (already scripted in repo), optional NetBox later.
- Do not put Proxmox root tokens in the browser; BFF holds secrets server-side.
3.3 Data model (minimum viable)
| Entity | Fields (illustrative) |
|---|---|
| Subnet / VLAN | id, vlan_id, cidr, name, environment, routing notes |
| IP assignment | address, hostname, vmid?, mac?, vlan, owner_team, service, source_ref (ip-addresses.conf key), status |
| Physical port map | switch_id, switch_port, panel_ref, far_end_host, far_end_nic, vlan_membership, speed, lacp_group |
| Host / hypervisor | serial, model, cluster node, CPU/RAM/disk summary (from poll script / Proxmox) |
| License / entitlement | sku_id, seat_count, valid_from/to, bound_org_or_project, external_ref (Stripe/subscription id) |
| Provisioning job | type (create_ct, resize_disk, assign_ip), payload, status, correlation_id |
Start with Postgres (you already run many PG instances; a dedicated small CT for IT data avoids coupling to app databases).
3.4 Billing and licenses
Treat billing as integrations, not a from-scratch ERP:
- Licenses / seats: map to entitlements table + Keycloak groups or custom claims for “can open IT console / can approve provision.”
- Usage metering: Proxmox storage and CPU per VMID, NPM bandwidth (optional), public IP count — async jobs pushing aggregates nightly.
- Invoicing: export to Stripe Billing, QuickBooks, or NetSuite via CSV/API; the controller shows status and line items, not necessarily full double-entry ledger on day one.
Partner marketplace pricing already has patterns in dbis_core; native infra SKUs should either reuse IruOffering-style tables or link by external_sku_id to avoid two unrelated catalogs.
4. VLAN and efficiency priorities (what matters most first)
Aligned with NETWORK_CONFIGURATION_MASTER.md:
- Document + enforce IP uniqueness before VLAN migration (ARP incidents already noted around Keycloak IP in E2E docs). Automated diff: live
ip neigh/ Proxmox CT IPs vs IPAM. - Segment in this order: (a) out-of-band / IPMI if any, (b) tenant-facing workloads (VLAN 200+), (c) Besu validators/RPC, (d) Sankofa app tier — so blast radius reduction matches risk.
- Use spare cluster capacity: r630-03 / r630-04 are cluster members with large local/ceph-related storage; placing new stateless or batch workloads there reduces pressure on r630-01/02 (see network master narrative).
- ML110 cutover: WAN aggregator repurpose changes .10 from Proxmox to firewall; the controller’s IPAM must flag migration status per host.
5. Port mapping deliverables
| Layer | Tool / owner | Output |
|---|---|---|
| Physical (UniFi XG + patch) | IT + DCIM template | Spreadsheet or NetBox cables + interfaces |
| UDM | UniFi export + manual | Port forward matrix (already partially in network master) |
| NPM | scripts/nginx-proxy-manager/update-npmplus-proxy-hosts-api.sh + API |
Proxy host rows = logical port map to upstream |
| Proxmox | vmbr, VLAN-aware flags |
Map CT net0 → bridge → VLAN |
The HTML controller should show a joined view: public hostname → NPM → LAN IP:port → VMID → node → switch port (where data exists).
5.1 Live data strategy (source of truth)
| Layer | Primary live source | Declared fallback | Drift handling |
|---|---|---|---|
| VMID, node, status, guest IP | Proxmox: pvesh get /cluster/resources + guest config files on shared /etc/pve |
ALL_VMIDS_ENDPOINTS.md | VMID/IP mismatch; guests only in doc or only on cluster |
| Hypervisor capacity | scripts/verify/poll-proxmox-cluster-hardware.sh |
PROXMOX_HOSTS_COMPLETE_HARDWARE_CONFIG.md | Refresh after hardware changes |
| LAN env keys | Parsed literals from ip-addresses.conf |
Same file in git | guest_ips_not_in_ip_addresses_conf vs ip_addresses_conf_ips_not_on_guests; exclude PROXMOX_HOST_*, NETWORK_GATEWAY, UDM_PRO_*, WAN_AGGREGATOR_* from “missing guest” noise |
| Public edge | NPM API (fleet scripts) | E2E tables | Hostname → upstream drift |
| Switch/AP | UniFi Network API | NetBox / spreadsheet | Manual until imported |
Freshness: every artifact includes ISO8601 collected_at; failed collectors must record error in drift.json and must not be presented as current in the IT UI.
6. Phased roadmap
| Phase | Scope | Exit criteria |
|---|---|---|
| 0 — Inventory hardening (live-first) | Runtime truth: Proxmox pvesh /cluster/resources + per-guest config (net0 / ipconfig0) for IP, merged with config/ip-addresses.conf as declared literals; emit live_inventory.json + drift.json with collected_at; duplicate guest IPs → fail or alert. Scripts (add under scripts/it-ops/): export-live-inventory-and-drift.sh (SSH to seed node, pipe lib/collect_inventory_remote.py), compute_ipam_drift.py (merge + drift). CI: .github/workflows/live-inventory-drift.yml — workflow_dispatch + weekly schedule; on GitHub-hosted runners without LAN, collector exits 0 after writing drift.json with seed_unreachable. UI/BFF later: never show inventory without freshness metadata. |
|
| 1 — Read-only IT dashboard | Keycloak group sankofa-it-admin; SPA pages: IPs, VLAN plan (current vs target), cluster nodes, hardware poll link |
IT can onboard without SSH |
| 2 — Port map CRUD | DB + UI for switch/port; import from UniFi API | Export CSV/NetBox |
| 3 — Controlled provisioning | BFF + Proxmox API: start/stop scoped CT, dry-run default (align with proxmox-production-safety rules) |
Audit log + allowlists |
| 4 — Entitlements + billing hooks | License assignment UI; Stripe (or chosen) webhook → entitlement | Invoice export for finance |
7. Security and governance
- Separate IT super-admin from client
admin.sankofa.nexususers (different Keycloak groups). - MFA required for IT group; break-glass local Proxmox access documented, not exposed in UI.
- Change management: any write to network edge (UDM) or production Proxmox requires ticket id in API payload (optional field, enforced in policy later).
8. Related documents
| Topic | Doc |
|---|---|
| IPs, VLAN plan, port forwards | NETWORK_CONFIGURATION_MASTER.md |
| VMID ↔ IP | ALL_VMIDS_ENDPOINTS.md |
| Cabling / 10G | 13_NODE_NETWORK_AND_CABLING_CHECKLIST.md |
| Marketplace vs portal | SANKOFA_MARKETPLACE_SURFACES.md |
| FQDN roles | EXPECTED_WEB_CONTENT.md |
| Hardware poll | scripts/verify/poll-proxmox-cluster-hardware.sh, reports/status/hardware_and_connected_inventory_*.md |
| Proxmox safety | .cursor/rules/proxmox-production-safety.mdc, scripts/lib/proxmox-production-guard.sh |
9. Next engineering actions (concrete)
Done in-repo (Phase 0+):
scripts/it-ops/— remote collector (lib/collect_inventory_remote.py),compute_ipam_drift.py(mergesip-addresses.conf+ALL_VMIDS_ENDPOINTS.mdtable rows),export-live-inventory-and-drift.sh→reports/status/live_inventory.json+drift.json.- Read API stub —
services/sankofa-it-read-api/server.py(GET/health,/v1/inventory/live,/v1/inventory/drift; POST refresh with API key). systemd example:config/systemd/sankofa-it-read-api.service.example. - Workflow
.github/workflows/live-inventory-drift.yml—workflow_dispatch+ weekly; artifacts; no LAN on default runners. - Validation —
scripts/validation/validate-config-files.shrunspy_compileon IT scripts + read API. - Docs — SANKOFA_IT_OPS_LIVE_INVENTORY_SCRIPTS.md, SANKOFA_IT_OPS_KEYCLOAK_PORTAL_NEXT_STEPS.md.
- Keycloak automation (proxmox repo) —
scripts/deployment/keycloak-sankofa-ensure-it-admin-role.shcreates realm rolesankofa-it-admin; operators still assign the role to users in Admin Console. - Portal
/it(Sankofa/portal repo, sibling clone) —src/app/it/page.tsx,src/app/api/it/*(server proxy +IT_READ_API_URL/IT_READ_API_KEYon CT 7801); credentialsADMINpropagated into JWT roles for bootstrap (src/lib/auth.ts). - LAN schedule examples —
config/systemd/sankofa-it-inventory-export.timer.example+.service.examplefor weeklyexport-live-inventory-and-drift.sh. - LAN bootstrap + edge —
scripts/deployment/bootstrap-sankofa-it-read-api-lan.sh(read API on PVE/opt/proxmox, portal env merge, weekly timer on PVE);scripts/nginx-proxy-manager/upsert-it-read-api-proxy-host.sh;scripts/cloudflare/add-it-api-sankofa-dns.sh.
Architecture decision (BFF home): SANKOFA_IT_API_DEPLOYMENT_DECISION.md — read API stays in proxmox repo; full BFF + Postgres on a dedicated CT; dbis_core links via external_sku_id only.
Collectors contract: IT_LIVE_COLLECTORS_CONTRACT.md + config/it-operations/live-collectors-contract.json. Port map spec: IT_PORT_MAP_LAYERS_SPEC.md.
Operator automation: VLAN ordered checklist VLAN_FLAT_11_TO_SEGMENTED_RUNBOOK.md + scripts/it-ops/vlan-segmentation-ordered-checklist.sh. Guarded Proxmox preview: scripts/it-ops/proxmox-guarded-write-adapter.sh. Optional SQLite history: set IT_BFF_SNAPSHOT_DB when running export. Gitea weekly: .gitea/workflows/live-inventory-hardware-weekly.yml.
Remaining (operators / later phases):
- OIDC validation on the read API (or replacement BFF) — set
IT_BFF_OIDC_ISSUERwhen ready; today the portal proxies with server-side API key. - Keycloak — assign IT users to group
sankofa-it-adminor map realm role; enforce MFA per SANKOFA_IT_OPS_KEYCLOAK_PORTAL_NEXT_STEPS.md. - TLS for
it-api.sankofa.nexus—scripts/deployment/request-it-api-tls-npm.sh(orCERT_DOMAINS_FILTER='it-api\.sankofa\.nexus'+request-npmplus-certificates.sh). Duplicate guest IPs (export exit 2) — remediate on cluster. - UniFi / NPM live collectors — Phase 2; stub: GET
/v1/portmap/joinedon read API. - Billing webhook — schema in
config/it-operations/entitlements-schema.sql; outline IT_OPERATIONS_BILLING_STRIPE_OUTLINE.md.
This spec does not replace change control; it gives you a single product vision so IP, VLAN, ports, hosts, licenses, and billing support evolve together instead of in silos.