Files
proxmox/docs/02-architecture/SANKOFA_IT_OPERATIONS_CONTROLLER_SPEC.md
defiQUG 3e7c9b9941
All checks were successful
Deploy to Phoenix / deploy (push) Successful in 6s
fix(npm): IT API TLS helper + treat certificate_id string 0 as missing
- jq select includes certificate_id == "0" for NPM JSON quirks
- request-it-api-tls-npm.sh wraps CERT_DOMAINS_FILTER for it-api.sankofa.nexus
- Docs: TLS command, Cloudflare redirect-loop note; spec remaining items

Made-with: Cursor
2026-04-09 02:01:50 -07:00

178 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Sankofa IT operations controller — architecture spec
**Status:** Draft for engineering and IT leadership alignment
**Last updated:** 2026-04-08 (Phase 0 live-first inventory section added)
**Audience:** IT team, platform ops, Sankofa admin product owners
---
## 1. Goals
You need a single operational program that covers:
| Capability | Intent |
|------------|--------|
| **IP inventory** | Authoritative list of every LAN/WAN/VIP assignment, owner, service, and lifecycle (no drift between spreadsheets and `config/ip-addresses.conf`). |
| **VLAN design** | Move from todays **flat VLAN 11** to the **planned segmentation** (validators, RPC, explorer, Sankofa services, tenants) without breaking production. |
| **Port mapping** | Physical: switch port ↔ patch panel ↔ host NIC ↔ logical bond/VLAN. Logical: UDM port forwards ↔ NPM host ↔ upstream CT/VM. |
| **Host efficiency** | Compare **actual** Proxmox capacity (CPU/RAM/storage/network) to workload placement; drive consolidation, spare-node use, and subscription/licensing discipline. |
| **IT admin UI** | **HTML controller** under the **Sankofa admin** surface so the IT team can view/control interfaces, assign **licenses/entitlements**, run **provisioning** workflows, and support **billing** (quotes, usage, invoices handoff). |
This document defines **how** that fits your existing stack (Proxmox cluster, UDM Pro, UniFi, NPMplus, Keycloak, Phoenix/dbis_core marketplace) and a **phased** path so you do not boil the ocean.
---
## 2. Current state (facts from this repo)
- **IP truth is split** across `config/ip-addresses.conf`, `docs/04-configuration/ALL_VMIDS_ENDPOINTS.md`, and `docs/11-references/NETWORK_CONFIGURATION_MASTER.md`. Automated snapshots: `scripts/verify/poll-proxmox-cluster-hardware.sh`, `reports/status/hardware_and_connected_inventory_*.md`.
- **VLANs:** Production today is **VLAN 11 only** for `192.168.11.0/24`. **Planned** VLANs (110112, 120, 160, 200203) are documented in [NETWORK_CONFIGURATION_MASTER.md](../11-references/NETWORK_CONFIGURATION_MASTER.md) but **not** implemented as separate broadcast domains on the wire.
- **Sankofa admin:** `admin.sankofa.nexus` is **client SSO administration** today (same upstream as the portal unless split). See [FQDN_EXPECTED_CONTENT.md](EXPECTED_WEB_CONTENT.md), [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md) (VMID **7801**). Portal source: sibling repo **`Sankofa/portal`** (`scripts/deployment/sync-sankofa-portal-7801.sh`).
- **Marketplace / commercial:** Partner IRU flows live in **`dbis_core`** (API + React); native infra is mostly **docs + Proxmox**, not one database. See [SANKOFA_MARKETPLACE_SURFACES.md](../03-deployment/SANKOFA_MARKETPLACE_SURFACES.md).
**Gap:** There is **no** single product today that unifies IPAM, switch port data, Proxmox actions, UniFi, and billing under Sankofa admin. This spec is the blueprint to add it.
---
## 3. Target architecture (recommended)
### 3.1 UI placement
| Option | Pros | Cons |
|--------|------|------|
| **A — New `/it` (or `/ops`) app route** inside **`Sankofa/portal`**, gated by Keycloak group `sankofa-it-admin` | One TLS hostname, shared session patterns, fastest path for “under admin” | Portal bundle grows; must isolate client admin vs IT super-admin |
| **B — Dedicated host** `it.sankofa.nexus` → small **Next.js/Vite** SPA + BFF | Strong separation, independent deploy cadence | Extra NPM row, cert, pipeline |
| **C — Embed Grafana + NetBox** only | Quick graphs / IPAM | Weak billing/licensing story; less “Sankofa branded” control |
**Recommendation:** **Option A** for MVP (fastest), with **API on a dedicated backend** so you can later move the shell to **B** without rewriting integrations.
### 3.2 Backend (“control plane API”)
Introduce a **small service** (name e.g. `sankofa-it-api`) **not** on the public internet without auth:
- **Network:** VLAN 11 only or **private listener** + NPM **internal** host; OIDC **client credentials** or **user JWT** from Keycloak.
- **Responsibilities:**
- **Read models:** IPAM, devices, port maps, Proxmox inventory snapshot, UniFi device list (cached).
- **Write models:** change requests with **audit log** (who/when/what); optional **approval** queue for destructive actions.
- **Connectors (adapters):** Proxmox API, UniFi Network API (UDM), NPM API (already scripted in repo), optional NetBox later.
- **Do not** put Proxmox root tokens in the browser; **BFF** holds secrets server-side.
### 3.3 Data model (minimum viable)
| Entity | Fields (illustrative) |
|--------|------------------------|
| **Subnet / VLAN** | id, vlan_id, cidr, name, environment, routing notes |
| **IP assignment** | address, hostname, vmid?, mac?, vlan, owner_team, service, source_ref (`ip-addresses.conf` key), status |
| **Physical port map** | switch_id, switch_port, panel_ref, far_end_host, far_end_nic, vlan_membership, speed, lacp_group |
| **Host / hypervisor** | serial, model, cluster node, CPU/RAM/disk summary (from poll script / Proxmox) |
| **License / entitlement** | sku_id, seat_count, valid_from/to, bound_org_or_project, external_ref (Stripe/subscription id) |
| **Provisioning job** | type (create_ct, resize_disk, assign_ip), payload, status, correlation_id |
Start with **Postgres** (you already run many PG instances; a **dedicated small CT** for IT data avoids coupling to app databases).
### 3.4 Billing and licenses
Treat **billing** as **integrations**, not a from-scratch ERP:
- **Licenses / seats:** map to **entitlements** table + Keycloak **groups** or custom claims for “can open IT console / can approve provision.”
- **Usage metering:** Proxmox **storage and CPU** per VMID, NPM bandwidth (optional), public IP count — **async jobs** pushing aggregates nightly.
- **Invoicing:** export to **Stripe Billing**, **QuickBooks**, or **NetSuite** via CSV/API; the controller shows **status** and **line items**, not necessarily full double-entry ledger on day one.
Partner marketplace pricing already has patterns in **`dbis_core`**; **native** infra SKUs should either **reuse** `IruOffering`-style tables or **link** by `external_sku_id` to avoid two unrelated catalogs.
---
## 4. VLAN and efficiency priorities (what matters most first)
Aligned with [NETWORK_CONFIGURATION_MASTER.md](../11-references/NETWORK_CONFIGURATION_MASTER.md):
1. **Document + enforce IP uniqueness** before VLAN migration (ARP incidents already noted around Keycloak IP in E2E docs). Automated **diff**: live `ip neigh` / Proxmox CT IPs vs IPAM.
2. **Segment in this order:** (a) **out-of-band / IPMI** if any, (b) **tenant-facing** workloads (VLAN 200+), (c) **Besu validators/RPC**, (d) **Sankofa app tier** — so blast radius reduction matches risk.
3. **Use spare cluster capacity:** **r630-03** / **r630-04** are cluster members with large local/ceph-related storage; placing **new** stateless or batch workloads there reduces pressure on r630-01/02 (see network master narrative).
4. **ML110 cutover:** WAN aggregator repurpose changes **.10** from Proxmox to firewall; the controllers IPAM must flag **migration status** per host.
---
## 5. Port mapping deliverables
| Layer | Tool / owner | Output |
|-------|----------------|--------|
| **Physical (UniFi XG + patch)** | IT + DCIM template | Spreadsheet or **NetBox** cables + interfaces |
| **UDM** | UniFi export + manual | Port forward matrix (already partially in network master) |
| **NPM** | `scripts/nginx-proxy-manager/update-npmplus-proxy-hosts-api.sh` + API | Proxy host rows = **logical** port map to upstream |
| **Proxmox** | `vmbr`, VLAN-aware flags | Map CT `net0` → bridge → VLAN |
The HTML controller should show a **joined view**: *public hostname → NPM → LAN IP:port → VMID → node → switch port* (where data exists).
---
## 5.1 Live data strategy (source of truth)
| Layer | Primary live source | Declared fallback | Drift handling |
|-------|---------------------|-------------------|----------------|
| **VMID, node, status, guest IP** | Proxmox: `pvesh get /cluster/resources` + guest config files on shared `/etc/pve` | [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md) | VMID/IP mismatch; guests only in doc or only on cluster |
| **Hypervisor capacity** | `scripts/verify/poll-proxmox-cluster-hardware.sh` | [PROXMOX_HOSTS_COMPLETE_HARDWARE_CONFIG.md](PROXMOX_HOSTS_COMPLETE_HARDWARE_CONFIG.md) | Refresh after hardware changes |
| **LAN env keys** | Parsed literals from `ip-addresses.conf` | Same file in git | `guest_ips_not_in_ip_addresses_conf` vs `ip_addresses_conf_ips_not_on_guests`; exclude `PROXMOX_HOST_*`, `NETWORK_GATEWAY`, `UDM_PRO_*`, `WAN_AGGREGATOR_*` from “missing guest” noise |
| **Public edge** | NPM API (fleet scripts) | E2E tables | Hostname → upstream drift |
| **Switch/AP** | UniFi Network API | NetBox / spreadsheet | Manual until imported |
**Freshness:** every artifact includes ISO8601 **`collected_at`**; failed collectors must record **`error`** in `drift.json` and must not be presented as current in the IT UI.
---
## 6. Phased roadmap
| Phase | Scope | Exit criteria |
|-------|--------|----------------|
| **0 — Inventory hardening (live-first)** | **Runtime truth:** Proxmox `pvesh /cluster/resources` + per-guest config (`net0` / `ipconfig0`) for IP, merged with **`config/ip-addresses.conf`** as **declared** literals; emit **`live_inventory.json`** + **`drift.json`** with **`collected_at`**; duplicate guest IPs → fail or alert. **Scripts (add under `scripts/it-ops/`):** `export-live-inventory-and-drift.sh` (SSH to seed node, pipe `lib/collect_inventory_remote.py`), `compute_ipam_drift.py` (merge + drift). **CI:** `.github/workflows/live-inventory-drift.yml``workflow_dispatch` + weekly schedule; on GitHub-hosted runners without LAN, collector exits 0 after writing `drift.json` with `seed_unreachable`. **UI/BFF later:** never show inventory without freshness metadata. |
| **1 — Read-only IT dashboard** | Keycloak group `sankofa-it-admin`; SPA pages: IPs, VLAN plan (current vs target), cluster nodes, hardware poll link | IT can onboard without SSH |
| **2 — Port map CRUD** | DB + UI for switch/port; import from UniFi API | Export CSV/NetBox |
| **3 — Controlled provisioning** | BFF + Proxmox API: start/stop scoped CT, **dry-run default** (align with `proxmox-production-safety` rules) | Audit log + allowlists |
| **4 — Entitlements + billing hooks** | License assignment UI; Stripe (or chosen) webhook → entitlement | Invoice export for finance |
---
## 7. Security and governance
- **Separate** IT super-admin from **client** `admin.sankofa.nexus` users (different Keycloak groups).
- **MFA** required for IT group; **break-glass** local Proxmox access documented, not exposed in UI.
- **Change management:** any **write** to network edge (UDM) or **production** Proxmox requires ticket id in API payload (optional field, enforced in policy later).
---
## 8. Related documents
| Topic | Doc |
|-------|-----|
| IPs, VLAN plan, port forwards | [NETWORK_CONFIGURATION_MASTER.md](../11-references/NETWORK_CONFIGURATION_MASTER.md) |
| VMID ↔ IP | [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md) |
| Cabling / 10G | [13_NODE_NETWORK_AND_CABLING_CHECKLIST.md](../11-references/13_NODE_NETWORK_AND_CABLING_CHECKLIST.md) |
| Marketplace vs portal | [SANKOFA_MARKETPLACE_SURFACES.md](../03-deployment/SANKOFA_MARKETPLACE_SURFACES.md) |
| FQDN roles | [EXPECTED_WEB_CONTENT.md](EXPECTED_WEB_CONTENT.md) |
| Hardware poll | `scripts/verify/poll-proxmox-cluster-hardware.sh`, `reports/status/hardware_and_connected_inventory_*.md` |
| Proxmox safety | `.cursor/rules/proxmox-production-safety.mdc`, `scripts/lib/proxmox-production-guard.sh` |
---
## 9. Next engineering actions (concrete)
**Done in-repo (Phase 0+):**
1. **`scripts/it-ops/`** — remote collector (`lib/collect_inventory_remote.py`), `compute_ipam_drift.py` (merges **`ip-addresses.conf`** + **`ALL_VMIDS_ENDPOINTS.md`** table rows), `export-live-inventory-and-drift.sh``reports/status/live_inventory.json` + `drift.json`.
2. **Read API stub**`services/sankofa-it-read-api/server.py` (GET `/health`, `/v1/inventory/live`, `/v1/inventory/drift`; POST refresh with API key). systemd example: `config/systemd/sankofa-it-read-api.service.example`.
3. **Workflow** `.github/workflows/live-inventory-drift.yml``workflow_dispatch` + weekly; artifacts; no LAN on default runners.
4. **Validation**`scripts/validation/validate-config-files.sh` runs `py_compile` on IT scripts + read API.
5. **Docs** — [SANKOFA_IT_OPS_LIVE_INVENTORY_SCRIPTS.md](../03-deployment/SANKOFA_IT_OPS_LIVE_INVENTORY_SCRIPTS.md), [SANKOFA_IT_OPS_KEYCLOAK_PORTAL_NEXT_STEPS.md](../03-deployment/SANKOFA_IT_OPS_KEYCLOAK_PORTAL_NEXT_STEPS.md).
6. **Keycloak automation (proxmox repo)**`scripts/deployment/keycloak-sankofa-ensure-it-admin-role.sh` creates realm role **`sankofa-it-admin`**; operators still assign the role to users in Admin Console.
7. **Portal `/it` (Sankofa/portal repo, sibling clone)**`src/app/it/page.tsx`, `src/app/api/it/*` (server proxy + `IT_READ_API_URL` / `IT_READ_API_KEY` on CT 7801); credentials **`ADMIN`** propagated into JWT roles for bootstrap (`src/lib/auth.ts`).
8. **LAN schedule examples**`config/systemd/sankofa-it-inventory-export.timer.example` + `.service.example` for weekly `export-live-inventory-and-drift.sh`.
9. **LAN bootstrap + edge**`scripts/deployment/bootstrap-sankofa-it-read-api-lan.sh` (read API on PVE `/opt/proxmox`, portal env merge, weekly timer on PVE); `scripts/nginx-proxy-manager/upsert-it-read-api-proxy-host.sh`; `scripts/cloudflare/add-it-api-sankofa-dns.sh`.
**Remaining (other repos / product):**
1. **Full BFF** with OIDC (Keycloak) and Postgres — **`dbis_core` vs dedicated CT** — decide once.
2. **Keycloak** — assign **`sankofa-it-admin`** to real IT users (role creation is scripted; mapping is manual policy).
3. **TLS for `it-api.sankofa.nexus`**`scripts/deployment/request-it-api-tls-npm.sh` (or `CERT_DOMAINS_FILTER='it-api\.sankofa\.nexus'` + `request-npmplus-certificates.sh`). If public HTTPS redirect-loops, align Cloudflare proxy/SSL mode with NPM. **Duplicate guest IPs** (export exit **2**) — remediate on cluster.
4. **UniFi / NPM** live collectors — Phase 2 of this spec.
This spec does **not** replace change control; it gives you a **single product vision** so IP, VLAN, ports, hosts, licenses, and billing support evolve together instead of in silos.