All checks were successful
Deploy to Phoenix / deploy (push) Successful in 6s
- jq select includes certificate_id == "0" for NPM JSON quirks - request-it-api-tls-npm.sh wraps CERT_DOMAINS_FILTER for it-api.sankofa.nexus - Docs: TLS command, Cloudflare redirect-loop note; spec remaining items Made-with: Cursor
178 lines
14 KiB
Markdown
178 lines
14 KiB
Markdown
# Sankofa IT operations controller — architecture spec
|
||
|
||
**Status:** Draft for engineering and IT leadership alignment
|
||
**Last updated:** 2026-04-08 (Phase 0 live-first inventory section added)
|
||
**Audience:** IT team, platform ops, Sankofa admin product owners
|
||
|
||
---
|
||
|
||
## 1. Goals
|
||
|
||
You need a single operational program that covers:
|
||
|
||
| Capability | Intent |
|
||
|------------|--------|
|
||
| **IP inventory** | Authoritative list of every LAN/WAN/VIP assignment, owner, service, and lifecycle (no drift between spreadsheets and `config/ip-addresses.conf`). |
|
||
| **VLAN design** | Move from today’s **flat VLAN 11** to the **planned segmentation** (validators, RPC, explorer, Sankofa services, tenants) without breaking production. |
|
||
| **Port mapping** | Physical: switch port ↔ patch panel ↔ host NIC ↔ logical bond/VLAN. Logical: UDM port forwards ↔ NPM host ↔ upstream CT/VM. |
|
||
| **Host efficiency** | Compare **actual** Proxmox capacity (CPU/RAM/storage/network) to workload placement; drive consolidation, spare-node use, and subscription/licensing discipline. |
|
||
| **IT admin UI** | **HTML controller** under the **Sankofa admin** surface so the IT team can view/control interfaces, assign **licenses/entitlements**, run **provisioning** workflows, and support **billing** (quotes, usage, invoices handoff). |
|
||
|
||
This document defines **how** that fits your existing stack (Proxmox cluster, UDM Pro, UniFi, NPMplus, Keycloak, Phoenix/dbis_core marketplace) and a **phased** path so you do not boil the ocean.
|
||
|
||
---
|
||
|
||
## 2. Current state (facts from this repo)
|
||
|
||
- **IP truth is split** across `config/ip-addresses.conf`, `docs/04-configuration/ALL_VMIDS_ENDPOINTS.md`, and `docs/11-references/NETWORK_CONFIGURATION_MASTER.md`. Automated snapshots: `scripts/verify/poll-proxmox-cluster-hardware.sh`, `reports/status/hardware_and_connected_inventory_*.md`.
|
||
- **VLANs:** Production today is **VLAN 11 only** for `192.168.11.0/24`. **Planned** VLANs (110–112, 120, 160, 200–203) are documented in [NETWORK_CONFIGURATION_MASTER.md](../11-references/NETWORK_CONFIGURATION_MASTER.md) but **not** implemented as separate broadcast domains on the wire.
|
||
- **Sankofa admin:** `admin.sankofa.nexus` is **client SSO administration** today (same upstream as the portal unless split). See [FQDN_EXPECTED_CONTENT.md](EXPECTED_WEB_CONTENT.md), [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md) (VMID **7801**). Portal source: sibling repo **`Sankofa/portal`** (`scripts/deployment/sync-sankofa-portal-7801.sh`).
|
||
- **Marketplace / commercial:** Partner IRU flows live in **`dbis_core`** (API + React); native infra is mostly **docs + Proxmox**, not one database. See [SANKOFA_MARKETPLACE_SURFACES.md](../03-deployment/SANKOFA_MARKETPLACE_SURFACES.md).
|
||
|
||
**Gap:** There is **no** single product today that unifies IPAM, switch port data, Proxmox actions, UniFi, and billing under Sankofa admin. This spec is the blueprint to add it.
|
||
|
||
---
|
||
|
||
## 3. Target architecture (recommended)
|
||
|
||
### 3.1 UI placement
|
||
|
||
| Option | Pros | Cons |
|
||
|--------|------|------|
|
||
| **A — New `/it` (or `/ops`) app route** inside **`Sankofa/portal`**, gated by Keycloak group `sankofa-it-admin` | One TLS hostname, shared session patterns, fastest path for “under admin” | Portal bundle grows; must isolate client admin vs IT super-admin |
|
||
| **B — Dedicated host** `it.sankofa.nexus` → small **Next.js/Vite** SPA + BFF | Strong separation, independent deploy cadence | Extra NPM row, cert, pipeline |
|
||
| **C — Embed Grafana + NetBox** only | Quick graphs / IPAM | Weak billing/licensing story; less “Sankofa branded” control |
|
||
|
||
**Recommendation:** **Option A** for MVP (fastest), with **API on a dedicated backend** so you can later move the shell to **B** without rewriting integrations.
|
||
|
||
### 3.2 Backend (“control plane API”)
|
||
|
||
Introduce a **small service** (name e.g. `sankofa-it-api`) **not** on the public internet without auth:
|
||
|
||
- **Network:** VLAN 11 only or **private listener** + NPM **internal** host; OIDC **client credentials** or **user JWT** from Keycloak.
|
||
- **Responsibilities:**
|
||
- **Read models:** IPAM, devices, port maps, Proxmox inventory snapshot, UniFi device list (cached).
|
||
- **Write models:** change requests with **audit log** (who/when/what); optional **approval** queue for destructive actions.
|
||
- **Connectors (adapters):** Proxmox API, UniFi Network API (UDM), NPM API (already scripted in repo), optional NetBox later.
|
||
- **Do not** put Proxmox root tokens in the browser; **BFF** holds secrets server-side.
|
||
|
||
### 3.3 Data model (minimum viable)
|
||
|
||
| Entity | Fields (illustrative) |
|
||
|--------|------------------------|
|
||
| **Subnet / VLAN** | id, vlan_id, cidr, name, environment, routing notes |
|
||
| **IP assignment** | address, hostname, vmid?, mac?, vlan, owner_team, service, source_ref (`ip-addresses.conf` key), status |
|
||
| **Physical port map** | switch_id, switch_port, panel_ref, far_end_host, far_end_nic, vlan_membership, speed, lacp_group |
|
||
| **Host / hypervisor** | serial, model, cluster node, CPU/RAM/disk summary (from poll script / Proxmox) |
|
||
| **License / entitlement** | sku_id, seat_count, valid_from/to, bound_org_or_project, external_ref (Stripe/subscription id) |
|
||
| **Provisioning job** | type (create_ct, resize_disk, assign_ip), payload, status, correlation_id |
|
||
|
||
Start with **Postgres** (you already run many PG instances; a **dedicated small CT** for IT data avoids coupling to app databases).
|
||
|
||
### 3.4 Billing and licenses
|
||
|
||
Treat **billing** as **integrations**, not a from-scratch ERP:
|
||
|
||
- **Licenses / seats:** map to **entitlements** table + Keycloak **groups** or custom claims for “can open IT console / can approve provision.”
|
||
- **Usage metering:** Proxmox **storage and CPU** per VMID, NPM bandwidth (optional), public IP count — **async jobs** pushing aggregates nightly.
|
||
- **Invoicing:** export to **Stripe Billing**, **QuickBooks**, or **NetSuite** via CSV/API; the controller shows **status** and **line items**, not necessarily full double-entry ledger on day one.
|
||
|
||
Partner marketplace pricing already has patterns in **`dbis_core`**; **native** infra SKUs should either **reuse** `IruOffering`-style tables or **link** by `external_sku_id` to avoid two unrelated catalogs.
|
||
|
||
---
|
||
|
||
## 4. VLAN and efficiency priorities (what matters most first)
|
||
|
||
Aligned with [NETWORK_CONFIGURATION_MASTER.md](../11-references/NETWORK_CONFIGURATION_MASTER.md):
|
||
|
||
1. **Document + enforce IP uniqueness** before VLAN migration (ARP incidents already noted around Keycloak IP in E2E docs). Automated **diff**: live `ip neigh` / Proxmox CT IPs vs IPAM.
|
||
2. **Segment in this order:** (a) **out-of-band / IPMI** if any, (b) **tenant-facing** workloads (VLAN 200+), (c) **Besu validators/RPC**, (d) **Sankofa app tier** — so blast radius reduction matches risk.
|
||
3. **Use spare cluster capacity:** **r630-03** / **r630-04** are cluster members with large local/ceph-related storage; placing **new** stateless or batch workloads there reduces pressure on r630-01/02 (see network master narrative).
|
||
4. **ML110 cutover:** WAN aggregator repurpose changes **.10** from Proxmox to firewall; the controller’s IPAM must flag **migration status** per host.
|
||
|
||
---
|
||
|
||
## 5. Port mapping deliverables
|
||
|
||
| Layer | Tool / owner | Output |
|
||
|-------|----------------|--------|
|
||
| **Physical (UniFi XG + patch)** | IT + DCIM template | Spreadsheet or **NetBox** cables + interfaces |
|
||
| **UDM** | UniFi export + manual | Port forward matrix (already partially in network master) |
|
||
| **NPM** | `scripts/nginx-proxy-manager/update-npmplus-proxy-hosts-api.sh` + API | Proxy host rows = **logical** port map to upstream |
|
||
| **Proxmox** | `vmbr`, VLAN-aware flags | Map CT `net0` → bridge → VLAN |
|
||
|
||
The HTML controller should show a **joined view**: *public hostname → NPM → LAN IP:port → VMID → node → switch port* (where data exists).
|
||
|
||
---
|
||
|
||
## 5.1 Live data strategy (source of truth)
|
||
|
||
| Layer | Primary live source | Declared fallback | Drift handling |
|
||
|-------|---------------------|-------------------|----------------|
|
||
| **VMID, node, status, guest IP** | Proxmox: `pvesh get /cluster/resources` + guest config files on shared `/etc/pve` | [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md) | VMID/IP mismatch; guests only in doc or only on cluster |
|
||
| **Hypervisor capacity** | `scripts/verify/poll-proxmox-cluster-hardware.sh` | [PROXMOX_HOSTS_COMPLETE_HARDWARE_CONFIG.md](PROXMOX_HOSTS_COMPLETE_HARDWARE_CONFIG.md) | Refresh after hardware changes |
|
||
| **LAN env keys** | Parsed literals from `ip-addresses.conf` | Same file in git | `guest_ips_not_in_ip_addresses_conf` vs `ip_addresses_conf_ips_not_on_guests`; exclude `PROXMOX_HOST_*`, `NETWORK_GATEWAY`, `UDM_PRO_*`, `WAN_AGGREGATOR_*` from “missing guest” noise |
|
||
| **Public edge** | NPM API (fleet scripts) | E2E tables | Hostname → upstream drift |
|
||
| **Switch/AP** | UniFi Network API | NetBox / spreadsheet | Manual until imported |
|
||
|
||
**Freshness:** every artifact includes ISO8601 **`collected_at`**; failed collectors must record **`error`** in `drift.json` and must not be presented as current in the IT UI.
|
||
|
||
---
|
||
|
||
## 6. Phased roadmap
|
||
|
||
| Phase | Scope | Exit criteria |
|
||
|-------|--------|----------------|
|
||
| **0 — Inventory hardening (live-first)** | **Runtime truth:** Proxmox `pvesh /cluster/resources` + per-guest config (`net0` / `ipconfig0`) for IP, merged with **`config/ip-addresses.conf`** as **declared** literals; emit **`live_inventory.json`** + **`drift.json`** with **`collected_at`**; duplicate guest IPs → fail or alert. **Scripts (add under `scripts/it-ops/`):** `export-live-inventory-and-drift.sh` (SSH to seed node, pipe `lib/collect_inventory_remote.py`), `compute_ipam_drift.py` (merge + drift). **CI:** `.github/workflows/live-inventory-drift.yml` — `workflow_dispatch` + weekly schedule; on GitHub-hosted runners without LAN, collector exits 0 after writing `drift.json` with `seed_unreachable`. **UI/BFF later:** never show inventory without freshness metadata. |
|
||
| **1 — Read-only IT dashboard** | Keycloak group `sankofa-it-admin`; SPA pages: IPs, VLAN plan (current vs target), cluster nodes, hardware poll link | IT can onboard without SSH |
|
||
| **2 — Port map CRUD** | DB + UI for switch/port; import from UniFi API | Export CSV/NetBox |
|
||
| **3 — Controlled provisioning** | BFF + Proxmox API: start/stop scoped CT, **dry-run default** (align with `proxmox-production-safety` rules) | Audit log + allowlists |
|
||
| **4 — Entitlements + billing hooks** | License assignment UI; Stripe (or chosen) webhook → entitlement | Invoice export for finance |
|
||
|
||
---
|
||
|
||
## 7. Security and governance
|
||
|
||
- **Separate** IT super-admin from **client** `admin.sankofa.nexus` users (different Keycloak groups).
|
||
- **MFA** required for IT group; **break-glass** local Proxmox access documented, not exposed in UI.
|
||
- **Change management:** any **write** to network edge (UDM) or **production** Proxmox requires ticket id in API payload (optional field, enforced in policy later).
|
||
|
||
---
|
||
|
||
## 8. Related documents
|
||
|
||
| Topic | Doc |
|
||
|-------|-----|
|
||
| IPs, VLAN plan, port forwards | [NETWORK_CONFIGURATION_MASTER.md](../11-references/NETWORK_CONFIGURATION_MASTER.md) |
|
||
| VMID ↔ IP | [ALL_VMIDS_ENDPOINTS.md](../04-configuration/ALL_VMIDS_ENDPOINTS.md) |
|
||
| Cabling / 10G | [13_NODE_NETWORK_AND_CABLING_CHECKLIST.md](../11-references/13_NODE_NETWORK_AND_CABLING_CHECKLIST.md) |
|
||
| Marketplace vs portal | [SANKOFA_MARKETPLACE_SURFACES.md](../03-deployment/SANKOFA_MARKETPLACE_SURFACES.md) |
|
||
| FQDN roles | [EXPECTED_WEB_CONTENT.md](EXPECTED_WEB_CONTENT.md) |
|
||
| Hardware poll | `scripts/verify/poll-proxmox-cluster-hardware.sh`, `reports/status/hardware_and_connected_inventory_*.md` |
|
||
| Proxmox safety | `.cursor/rules/proxmox-production-safety.mdc`, `scripts/lib/proxmox-production-guard.sh` |
|
||
|
||
---
|
||
|
||
## 9. Next engineering actions (concrete)
|
||
|
||
**Done in-repo (Phase 0+):**
|
||
|
||
1. **`scripts/it-ops/`** — remote collector (`lib/collect_inventory_remote.py`), `compute_ipam_drift.py` (merges **`ip-addresses.conf`** + **`ALL_VMIDS_ENDPOINTS.md`** table rows), `export-live-inventory-and-drift.sh` → `reports/status/live_inventory.json` + `drift.json`.
|
||
2. **Read API stub** — `services/sankofa-it-read-api/server.py` (GET `/health`, `/v1/inventory/live`, `/v1/inventory/drift`; POST refresh with API key). systemd example: `config/systemd/sankofa-it-read-api.service.example`.
|
||
3. **Workflow** `.github/workflows/live-inventory-drift.yml` — `workflow_dispatch` + weekly; artifacts; no LAN on default runners.
|
||
4. **Validation** — `scripts/validation/validate-config-files.sh` runs `py_compile` on IT scripts + read API.
|
||
5. **Docs** — [SANKOFA_IT_OPS_LIVE_INVENTORY_SCRIPTS.md](../03-deployment/SANKOFA_IT_OPS_LIVE_INVENTORY_SCRIPTS.md), [SANKOFA_IT_OPS_KEYCLOAK_PORTAL_NEXT_STEPS.md](../03-deployment/SANKOFA_IT_OPS_KEYCLOAK_PORTAL_NEXT_STEPS.md).
|
||
6. **Keycloak automation (proxmox repo)** — `scripts/deployment/keycloak-sankofa-ensure-it-admin-role.sh` creates realm role **`sankofa-it-admin`**; operators still assign the role to users in Admin Console.
|
||
7. **Portal `/it` (Sankofa/portal repo, sibling clone)** — `src/app/it/page.tsx`, `src/app/api/it/*` (server proxy + `IT_READ_API_URL` / `IT_READ_API_KEY` on CT 7801); credentials **`ADMIN`** propagated into JWT roles for bootstrap (`src/lib/auth.ts`).
|
||
8. **LAN schedule examples** — `config/systemd/sankofa-it-inventory-export.timer.example` + `.service.example` for weekly `export-live-inventory-and-drift.sh`.
|
||
9. **LAN bootstrap + edge** — `scripts/deployment/bootstrap-sankofa-it-read-api-lan.sh` (read API on PVE `/opt/proxmox`, portal env merge, weekly timer on PVE); `scripts/nginx-proxy-manager/upsert-it-read-api-proxy-host.sh`; `scripts/cloudflare/add-it-api-sankofa-dns.sh`.
|
||
|
||
**Remaining (other repos / product):**
|
||
|
||
1. **Full BFF** with OIDC (Keycloak) and Postgres — **`dbis_core` vs dedicated CT** — decide once.
|
||
2. **Keycloak** — assign **`sankofa-it-admin`** to real IT users (role creation is scripted; mapping is manual policy).
|
||
3. **TLS for `it-api.sankofa.nexus`** — `scripts/deployment/request-it-api-tls-npm.sh` (or `CERT_DOMAINS_FILTER='it-api\.sankofa\.nexus'` + `request-npmplus-certificates.sh`). If public HTTPS redirect-loops, align Cloudflare proxy/SSL mode with NPM. **Duplicate guest IPs** (export exit **2**) — remediate on cluster.
|
||
4. **UniFi / NPM** live collectors — Phase 2 of this spec.
|
||
|
||
This spec does **not** replace change control; it gives you a **single product vision** so IP, VLAN, ports, hosts, licenses, and billing support evolve together instead of in silos.
|