Files
proxmox/docs/02-architecture/SANKOFA_IT_OPERATIONS_CONTROLLER_SPEC.md
defiQUG dbd517b279 Sync workspace: config, docs, scripts, CI, operator rules, and submodule pointers.
- Update dbis_core, cross-chain-pmm-lps, explorer-monorepo, metamask-integration, pr-workspace/chains
- Omit embedded publish git dirs and empty placeholders from index

Made-with: Cursor
2026-04-12 06:12:20 -07:00

16 KiB
Raw Blame History

Sankofa IT operations controller — architecture spec

Status: Draft for engineering and IT leadership alignment
Last updated: 2026-04-09 (ADR, collectors contract, VLAN runbook, BFF endpoints, Phase 34 skeletons)
Audience: IT team, platform ops, Sankofa admin product owners


1. Goals

You need a single operational program that covers:

Capability Intent
IP inventory Authoritative list of every LAN/WAN/VIP assignment, owner, service, and lifecycle (no drift between spreadsheets and config/ip-addresses.conf).
VLAN design Move from todays flat VLAN 11 to the planned segmentation (validators, RPC, explorer, Sankofa services, tenants) without breaking production.
Port mapping Physical: switch port ↔ patch panel ↔ host NIC ↔ logical bond/VLAN. Logical: UDM port forwards ↔ NPM host ↔ upstream CT/VM.
Host efficiency Compare actual Proxmox capacity (CPU/RAM/storage/network) to workload placement; drive consolidation, spare-node use, and subscription/licensing discipline.
IT admin UI HTML controller under the Sankofa admin surface so the IT team can view/control interfaces, assign licenses/entitlements, run provisioning workflows, and support billing (quotes, usage, invoices handoff).

This document defines how that fits your existing stack (Proxmox cluster, UDM Pro, UniFi, NPMplus, Keycloak, Phoenix/dbis_core marketplace) and a phased path so you do not boil the ocean.


2. Current state (facts from this repo)

  • IP truth is split across config/ip-addresses.conf, docs/04-configuration/ALL_VMIDS_ENDPOINTS.md, and docs/11-references/NETWORK_CONFIGURATION_MASTER.md. Automated snapshots: scripts/verify/poll-proxmox-cluster-hardware.sh, reports/status/hardware_and_connected_inventory_*.md.
  • VLANs: Production today is VLAN 11 only for 192.168.11.0/24. Planned VLANs (110112, 120, 160, 200203) are documented in NETWORK_CONFIGURATION_MASTER.md but not implemented as separate broadcast domains on the wire.
  • Sankofa admin: admin.sankofa.nexus is client SSO administration today (same upstream as the portal unless split). See FQDN_EXPECTED_CONTENT.md, ALL_VMIDS_ENDPOINTS.md (VMID 7801). Portal source: sibling repo Sankofa/portal (scripts/deployment/sync-sankofa-portal-7801.sh).
  • Marketplace / commercial: Partner IRU flows live in dbis_core (API + React); native infra is mostly docs + Proxmox, not one database. See SANKOFA_MARKETPLACE_SURFACES.md.

Gap: There is no single product today that unifies IPAM, switch port data, Proxmox actions, UniFi, and billing under Sankofa admin. This spec is the blueprint to add it.


3.1 UI placement

Option Pros Cons
A — New /it (or /ops) app route inside Sankofa/portal, gated by Keycloak group sankofa-it-admin One TLS hostname, shared session patterns, fastest path for “under admin” Portal bundle grows; must isolate client admin vs IT super-admin
B — Dedicated host it.sankofa.nexus → small Next.js/Vite SPA + BFF Strong separation, independent deploy cadence Extra NPM row, cert, pipeline
C — Embed Grafana + NetBox only Quick graphs / IPAM Weak billing/licensing story; less “Sankofa branded” control

Recommendation: Option A for MVP (fastest), with API on a dedicated backend so you can later move the shell to B without rewriting integrations.

3.2 Backend (“control plane API”)

Introduce a small service (name e.g. sankofa-it-api) not on the public internet without auth:

  • Network: VLAN 11 only or private listener + NPM internal host; OIDC client credentials or user JWT from Keycloak.
  • Responsibilities:
    • Read models: IPAM, devices, port maps, Proxmox inventory snapshot, UniFi device list (cached).
    • Write models: change requests with audit log (who/when/what); optional approval queue for destructive actions.
    • Connectors (adapters): Proxmox API, UniFi Network API (UDM), NPM API (already scripted in repo), optional NetBox later.
  • Do not put Proxmox root tokens in the browser; BFF holds secrets server-side.

3.3 Data model (minimum viable)

Entity Fields (illustrative)
Subnet / VLAN id, vlan_id, cidr, name, environment, routing notes
IP assignment address, hostname, vmid?, mac?, vlan, owner_team, service, source_ref (ip-addresses.conf key), status
Physical port map switch_id, switch_port, panel_ref, far_end_host, far_end_nic, vlan_membership, speed, lacp_group
Host / hypervisor serial, model, cluster node, CPU/RAM/disk summary (from poll script / Proxmox)
License / entitlement sku_id, seat_count, valid_from/to, bound_org_or_project, external_ref (Stripe/subscription id)
Provisioning job type (create_ct, resize_disk, assign_ip), payload, status, correlation_id

Start with Postgres (you already run many PG instances; a dedicated small CT for IT data avoids coupling to app databases).

3.4 Billing and licenses

Treat billing as integrations, not a from-scratch ERP:

  • Licenses / seats: map to entitlements table + Keycloak groups or custom claims for “can open IT console / can approve provision.”
  • Usage metering: Proxmox storage and CPU per VMID, NPM bandwidth (optional), public IP count — async jobs pushing aggregates nightly.
  • Invoicing: export to Stripe Billing, QuickBooks, or NetSuite via CSV/API; the controller shows status and line items, not necessarily full double-entry ledger on day one.

Partner marketplace pricing already has patterns in dbis_core; native infra SKUs should either reuse IruOffering-style tables or link by external_sku_id to avoid two unrelated catalogs.


4. VLAN and efficiency priorities (what matters most first)

Aligned with NETWORK_CONFIGURATION_MASTER.md:

  1. Document + enforce IP uniqueness before VLAN migration (ARP incidents already noted around Keycloak IP in E2E docs). Automated diff: live ip neigh / Proxmox CT IPs vs IPAM.
  2. Segment in this order: (a) out-of-band / IPMI if any, (b) tenant-facing workloads (VLAN 200+), (c) Besu validators/RPC, (d) Sankofa app tier — so blast radius reduction matches risk.
  3. Use spare cluster capacity: r630-03 / r630-04 are cluster members with large local/ceph-related storage; placing new stateless or batch workloads there reduces pressure on r630-01/02 (see network master narrative).
  4. ML110 cutover: WAN aggregator repurpose changes .10 from Proxmox to firewall; the controllers IPAM must flag migration status per host.

5. Port mapping deliverables

Layer Tool / owner Output
Physical (UniFi XG + patch) IT + DCIM template Spreadsheet or NetBox cables + interfaces
UDM UniFi export + manual Port forward matrix (already partially in network master)
NPM scripts/nginx-proxy-manager/update-npmplus-proxy-hosts-api.sh + API Proxy host rows = logical port map to upstream
Proxmox vmbr, VLAN-aware flags Map CT net0 → bridge → VLAN

The HTML controller should show a joined view: public hostname → NPM → LAN IP:port → VMID → node → switch port (where data exists).


5.1 Live data strategy (source of truth)

Layer Primary live source Declared fallback Drift handling
VMID, node, status, guest IP Proxmox: pvesh get /cluster/resources + guest config files on shared /etc/pve ALL_VMIDS_ENDPOINTS.md VMID/IP mismatch; guests only in doc or only on cluster
Hypervisor capacity scripts/verify/poll-proxmox-cluster-hardware.sh PROXMOX_HOSTS_COMPLETE_HARDWARE_CONFIG.md Refresh after hardware changes
LAN env keys Parsed literals from ip-addresses.conf Same file in git guest_ips_not_in_ip_addresses_conf vs ip_addresses_conf_ips_not_on_guests; exclude PROXMOX_HOST_*, NETWORK_GATEWAY, UDM_PRO_*, WAN_AGGREGATOR_* from “missing guest” noise
Public edge NPM API (fleet scripts) E2E tables Hostname → upstream drift
Switch/AP UniFi Network API NetBox / spreadsheet Manual until imported

Freshness: every artifact includes ISO8601 collected_at; failed collectors must record error in drift.json and must not be presented as current in the IT UI.


6. Phased roadmap

Phase Scope Exit criteria
0 — Inventory hardening (live-first) Runtime truth: Proxmox pvesh /cluster/resources + per-guest config (net0 / ipconfig0) for IP, merged with config/ip-addresses.conf as declared literals; emit live_inventory.json + drift.json with collected_at; duplicate guest IPs → fail or alert. Scripts (add under scripts/it-ops/): export-live-inventory-and-drift.sh (SSH to seed node, pipe lib/collect_inventory_remote.py), compute_ipam_drift.py (merge + drift). CI: .github/workflows/live-inventory-drift.ymlworkflow_dispatch + weekly schedule; on GitHub-hosted runners without LAN, collector exits 0 after writing drift.json with seed_unreachable. UI/BFF later: never show inventory without freshness metadata.
1 — Read-only IT dashboard Keycloak group sankofa-it-admin; SPA pages: IPs, VLAN plan (current vs target), cluster nodes, hardware poll link IT can onboard without SSH
2 — Port map CRUD DB + UI for switch/port; import from UniFi API Export CSV/NetBox
3 — Controlled provisioning BFF + Proxmox API: start/stop scoped CT, dry-run default (align with proxmox-production-safety rules) Audit log + allowlists
4 — Entitlements + billing hooks License assignment UI; Stripe (or chosen) webhook → entitlement Invoice export for finance

7. Security and governance

  • Separate IT super-admin from client admin.sankofa.nexus users (different Keycloak groups).
  • MFA required for IT group; break-glass local Proxmox access documented, not exposed in UI.
  • Change management: any write to network edge (UDM) or production Proxmox requires ticket id in API payload (optional field, enforced in policy later).

Topic Doc
IPs, VLAN plan, port forwards NETWORK_CONFIGURATION_MASTER.md
VMID ↔ IP ALL_VMIDS_ENDPOINTS.md
Cabling / 10G 13_NODE_NETWORK_AND_CABLING_CHECKLIST.md
Marketplace vs portal SANKOFA_MARKETPLACE_SURFACES.md
FQDN roles EXPECTED_WEB_CONTENT.md
Hardware poll scripts/verify/poll-proxmox-cluster-hardware.sh, reports/status/hardware_and_connected_inventory_*.md
Proxmox safety .cursor/rules/proxmox-production-safety.mdc, scripts/lib/proxmox-production-guard.sh

9. Next engineering actions (concrete)

Done in-repo (Phase 0+):

  1. scripts/it-ops/ — remote collector (lib/collect_inventory_remote.py), compute_ipam_drift.py (merges ip-addresses.conf + ALL_VMIDS_ENDPOINTS.md table rows), export-live-inventory-and-drift.shreports/status/live_inventory.json + drift.json.
  2. Read API stubservices/sankofa-it-read-api/server.py (GET /health, /v1/inventory/live, /v1/inventory/drift; POST refresh with API key). systemd example: config/systemd/sankofa-it-read-api.service.example.
  3. Workflow .github/workflows/live-inventory-drift.ymlworkflow_dispatch + weekly; artifacts; no LAN on default runners.
  4. Validationscripts/validation/validate-config-files.sh runs py_compile on IT scripts + read API.
  5. DocsSANKOFA_IT_OPS_LIVE_INVENTORY_SCRIPTS.md, SANKOFA_IT_OPS_KEYCLOAK_PORTAL_NEXT_STEPS.md.
  6. Keycloak automation (proxmox repo)scripts/deployment/keycloak-sankofa-ensure-it-admin-role.sh creates realm role sankofa-it-admin; operators still assign the role to users in Admin Console.
  7. Portal /it (Sankofa/portal repo, sibling clone)src/app/it/page.tsx, src/app/api/it/* (server proxy + IT_READ_API_URL / IT_READ_API_KEY on CT 7801); credentials ADMIN propagated into JWT roles for bootstrap (src/lib/auth.ts).
  8. LAN schedule examplesconfig/systemd/sankofa-it-inventory-export.timer.example + .service.example for weekly export-live-inventory-and-drift.sh.
  9. LAN bootstrap + edgescripts/deployment/bootstrap-sankofa-it-read-api-lan.sh (read API on PVE /opt/proxmox, portal env merge, weekly timer on PVE); scripts/nginx-proxy-manager/upsert-it-read-api-proxy-host.sh; scripts/cloudflare/add-it-api-sankofa-dns.sh.

Architecture decision (BFF home): SANKOFA_IT_API_DEPLOYMENT_DECISION.md — read API stays in proxmox repo; full BFF + Postgres on a dedicated CT; dbis_core links via external_sku_id only.

Collectors contract: IT_LIVE_COLLECTORS_CONTRACT.md + config/it-operations/live-collectors-contract.json. Port map spec: IT_PORT_MAP_LAYERS_SPEC.md.

Operator automation: VLAN ordered checklist VLAN_FLAT_11_TO_SEGMENTED_RUNBOOK.md + scripts/it-ops/vlan-segmentation-ordered-checklist.sh. Guarded Proxmox preview: scripts/it-ops/proxmox-guarded-write-adapter.sh. Optional SQLite history: set IT_BFF_SNAPSHOT_DB when running export. Gitea weekly: .gitea/workflows/live-inventory-hardware-weekly.yml.

Remaining (operators / later phases):

  1. OIDC validation on the read API (or replacement BFF) — set IT_BFF_OIDC_ISSUER when ready; today the portal proxies with server-side API key.
  2. Keycloak — assign IT users to group sankofa-it-admin or map realm role; enforce MFA per SANKOFA_IT_OPS_KEYCLOAK_PORTAL_NEXT_STEPS.md.
  3. TLS for it-api.sankofa.nexusscripts/deployment/request-it-api-tls-npm.sh (or CERT_DOMAINS_FILTER='it-api\.sankofa\.nexus' + request-npmplus-certificates.sh). Duplicate guest IPs (export exit 2) — remediate on cluster.
  4. UniFi / NPM live collectors — Phase 2; stub: GET /v1/portmap/joined on read API.
  5. Billing webhook — schema in config/it-operations/entitlements-schema.sql; outline IT_OPERATIONS_BILLING_STRIPE_OUTLINE.md.

This spec does not replace change control; it gives you a single product vision so IP, VLAN, ports, hosts, licenses, and billing support evolve together instead of in silos.