chore: sync all changes to Gitea

- Config, docs, scripts, and backup manifests - Submodule refs unchanged (m = modified content in submodules) Made-with: Cursor
2026-03-02 11:37:34 -08:00
parent ed85135249
commit b3a8fe4496
883 changed files with 73580 additions and 4796 deletions
--- a/docs/02-architecture/AI_AGENTS_57XX_DEPLOYMENT_TASKS.md
+++ b/docs/02-architecture/AI_AGENTS_57XX_DEPLOYMENT_TASKS.md
@@ -0,0 +1,231 @@
+# AI / Agents 57xx — Full Deployment Task List
+
+**Last Updated:** 2026-02-26  
+**Source:** [AI_AGENTS_57XX_DEPLOYMENT_PLAN.md](AI_AGENTS_57XX_DEPLOYMENT_PLAN.md), [VMID_ALLOCATION_FINAL.md](VMID_ALLOCATION_FINAL.md)  
+**VMID band:** 5700–5999
+
+This document is the **single ordered checklist** for deploying the full 57xx stack. Copy-paste commands and paths are ready for operators. Artifacts (compose files, agent script) live in **`scripts/57xx-deploy/`** and can be copied to target VMs.
+
+---
+
+## Prerequisites (all 57xx VMs)
+
+- [ ] **A.1** Ubuntu/Debian with Docker Engine + Compose plugin.
+- [ ] **A.2** Create standard dirs and install Docker (once per host):
+
+```bash
+sudo apt update
+sudo apt install -y ca-certificates curl gnupg ufw
+curl -fsSL https://get.docker.com | sudo sh
+sudo usermod -aG docker $USER
+# Log out/in or: newgrp docker
+sudo mkdir -p /opt/ai/{mcp,inference,agent,state}/{config,data,logs}
+sudo chown -R $USER:$USER /opt/ai
+```
+
+- [ ] **A.3** Network: ensure 5703 → 5701:3000, 5703 → 5702:8000, and 5701/5703 → 5704:5432,6379 are allowed (replace hostnames with your VM hostnames or IPs if needed).
+
+---
+
+## Task 1 — Repo and submodule (once per environment)
+
+- [ ] **1.1** Clone proxmox repo with submodules, or from existing repo root init submodules:
+
+```bash
+# Option A: fresh clone
+git clone --recurse-submodules <PROXMOX_REPO_URL> /opt/proxmox
+
+# Option B: from repo root
+git submodule update --init --recursive
+```
+
+- [ ] **1.2** Confirm submodule exists:
+
+```bash
+ls -la /opt/proxmox/ai-mcp-pmm-controller/README.md
+# or from your workspace: <REPO_ROOT>/ai-mcp-pmm-controller/
+```
+
+---
+
+## Task 2 — VM 5701 (MCP Hub) — required
+
+- [ ] **2.1** On the host that will run VMID 5701 (or the machine playing 5701):
+
+```bash
+cd /opt/proxmox/ai-mcp-pmm-controller
+# or: cd <REPO_ROOT>/ai-mcp-pmm-controller
+```
+
+- [ ] **2.2** Create logs dir:
+
+```bash
+mkdir -p logs
+```
+
+- [ ] **2.3** Create local `.env` (gitignored; do not commit secrets):
+
+```bash
+# Minimum:
+RPC_URL=https://YOUR_CHAIN_RPC_URL
+CHAIN=arbitrum
+ALLOW_WRITE=false
+EXECUTION_ARMED=false
+```
+
+- [ ] **2.4** (Optional) Edit `config/allowlist.json`: replace placeholder pool addresses and base/quote tokens before using pool tools.
+
+- [ ] **2.5** Start the hub:
+
+```bash
+docker compose build --no-cache   # first time or after Dockerfile change
+docker compose --env-file .env up -d
+```
+
+- [ ] **2.6** Validate:
+
+```bash
+curl -fsS http://127.0.0.1:3000/health
+# Expect: {"ok":true,"chain":"arbitrum"} (or your CHAIN value)
+```
+
+- [ ] **2.7** (Optional) Interface discovery once you have a pool address:
+
+```bash
+curl -sS http://127.0.0.1:3000/mcp/call \
+  -H 'content-type: application/json' \
+  -d '{"tool":"dodo.identify_pool_interface","params":{"pool":"0xPOOL"}}' | jq
+```
+
+Use `functions_found`, `notes`, and `detected_profile` to choose the right ABI/profile.
+
+---
+
+## Task 3 — VM 5704 (Memory/State) — optional
+
+- [ ] **3.1** On VM 5704 host, create state dirs:
+
+```bash
+sudo mkdir -p /opt/ai/state/data/postgres /opt/ai/state/data/redis
+sudo chown -R $USER:$USER /opt/ai/state
+```
+
+- [ ] **3.2** Copy compose and env from repo (or run `./scripts/57xx-deploy/copy-to-opt-ai.sh` from repo root):
+
+```bash
+# Option A: script (from repo root)
+./scripts/57xx-deploy/copy-to-opt-ai.sh
+
+# Option B: manual
+cp /opt/proxmox/scripts/57xx-deploy/5704-state/docker-compose.yml /opt/ai/state/
+cp /opt/proxmox/scripts/57xx-deploy/5704-state/.env.example /opt/ai/state/.env
+# Edit .env: set POSTGRES_PASSWORD
+```
+
+- [ ] **3.3** Start state stack:
+
+```bash
+cd /opt/ai/state
+docker compose up -d
+docker compose ps
+```
+
+- [ ] **3.4** Validate:
+
+```bash
+pg_isready -h 127.0.0.1 -U ai -d ai
+redis-cli -h 127.0.0.1 ping
+```
+
+---
+
+## Task 4 — VM 5702 (Inference) — optional
+
+- [ ] **4.1** On VM 5702 host, create model dir:
+
+```bash
+sudo mkdir -p /opt/ai/inference/data/models
+sudo chown -R $USER:$USER /opt/ai/inference
+```
+
+- [ ] **4.2** Place a GGUF model at `/opt/ai/inference/data/models/model.gguf` (or adjust compose `command` for your filename).
+
+- [ ] **4.3** Copy compose and start:
+
+```bash
+cp /opt/proxmox/scripts/57xx-deploy/5702-inference/docker-compose.yml /opt/ai/inference/
+cd /opt/ai/inference
+docker compose up -d
+```
+
+- [ ] **4.4** (Optional) Validate: `curl -sS http://127.0.0.1:8000/` (llama.cpp may not have `/health`).
+
+---
+
+## Task 5 — VM 5703 (Agent Worker) — optional
+
+- [ ] **5.1** On VM 5703 host, copy agent config and compose:
+
+```bash
+cp /opt/proxmox/scripts/57xx-deploy/5703-agent/agent.py /opt/ai/agent/config/
+cp /opt/proxmox/scripts/57xx-deploy/5703-agent/docker-compose.yml /opt/ai/agent/
+cp /opt/proxmox/scripts/57xx-deploy/5703-agent/.env.example /opt/ai/agent/.env
+```
+
+- [ ] **5.2** Edit `/opt/ai/agent/.env`: set `MCP_URL` (e.g. `http://5701:3000/mcp/call`), `INF_URL` (e.g. `http://5702:8000`). If using 5704, set `PG_DSN` and/or `REDIS_URL`.
+
+- [ ] **5.3** Edit `/opt/ai/agent/config/agent.py`: replace `POOL_ADDRESS_HERE` with a real allowlisted pool address when using `dodo.get_pool_state`.
+
+- [ ] **5.4** Start agent:
+
+```bash
+cd /opt/ai/agent
+docker compose up -d
+docker logs -f ai-agent-prod
+```
+
+---
+
+## Task 6 — Post-deploy validation
+
+- [ ] **6.1** MCP (5701): `curl -fsS http://5701:3000/health` (or from 5701 host: `http://127.0.0.1:3000/health`).
+- [ ] **6.2** State (5704): `pg_isready -h 5704 -U ai -d ai` and `redis-cli -h 5704 ping`.
+- [ ] **6.3** Inference (5702): `curl -sS http://5702:8000/` if applicable.
+- [ ] **6.4** Agent (5703): `docker logs --tail=50 ai-agent-prod` — no repeated errors.
+
+---
+
+## Task 7 — Hardening (before enabling write tools on 5701)
+
+- [ ] **7.1** Pool allowlist populated and reviewed.
+- [ ] **7.2** Max slippage, max notional per tx/day, cooldown, and circuit breaker (see [AI_AGENTS_57XX_DEPLOYMENT_PLAN.md](AI_AGENTS_57XX_DEPLOYMENT_PLAN.md) § Hardening checklist).
+- [ ] **7.3** Only then set `ALLOW_WRITE=true` and `EXECUTION_ARMED=true` in 5701 `.env` and restart MCP.
+
+---
+
+## Artifact locations (in repo)
+
+| VMID | Artifacts |
+|------|-----------|
+| 5701 | `ai-mcp-pmm-controller/` (submodule): `docker-compose.yml`, `Dockerfile`, `config/`, `.env` (local, gitignored) |
+| 5704 | `scripts/57xx-deploy/5704-state/`: `docker-compose.yml`, `.env.example` |
+| 5702 | `scripts/57xx-deploy/5702-inference/`: `docker-compose.yml` |
+| 5703 | `scripts/57xx-deploy/5703-agent/`: `agent.py`, `docker-compose.yml`, `.env.example` |
+
+**Copy all optional artifacts in one go:** from repo root run `./scripts/57xx-deploy/copy-to-opt-ai.sh` (creates `/opt/ai/*` dirs and copies 5704/5702/5703 files; does not overwrite existing `.env`).
+
+---
+
+## Quick reference — ports and callers
+
+| VMID | Service | Port | Allowed callers |
+|------|---------|------|-----------------|
+| 5701 | MCP Hub | 3000 | 5702, 5703 |
+| 5702 | Inference | 8000 | 5703 |
+| 5704 | Postgres | 5432 | 5701, 5703 |
+| 5704 | Redis | 6379 | 5701, 5703 |
+
+---
+
+**Owner:** Architecture  
+**See also:** [AI_AGENTS_57XX_DEPLOYMENT_PLAN.md](AI_AGENTS_57XX_DEPLOYMENT_PLAN.md) (Appendices A–F), [ai-mcp-pmm-controller/README.md](../../ai-mcp-pmm-controller/README.md)
--- a/docs/02-architecture/AI_AGENTS_57XX_MCP_CONTRACTS_AND_CHAINS.md
+++ b/docs/02-architecture/AI_AGENTS_57XX_MCP_CONTRACTS_AND_CHAINS.md
@@ -0,0 +1,90 @@
+# Smart Contracts and Blockchains for MCP Token/Pool Addresses
+
+**Purpose:** What smart contracts must exist on which blockchains so the 5701 MCP hub can be given pool and token addresses in its allowlist.
+
+**MCP behavior:** The MCP does **not** deploy contracts. It reads from existing contracts. You configure `config/allowlist.json` with one `chain` (e.g. `arbitrum`) and a list of pools; each pool has `pool_address`, `base_token`, `quote_token`, and `profile`. The MCP calls RPC on that chain to read pool state (getMidPrice, getOraclePrice, reserves, etc.) and token decimals. So **every address in the allowlist must point to an already-deployed contract** on the chosen chain.
+
+---
+
+## 1. What the MCP needs per pool
+
+| Field | Meaning | Must exist on chain |
+|-------|---------|---------------------|
+| **pool_address** | PMM pool contract (DODO-style: getMidPrice, getOraclePrice, getBaseReserve, getQuoteReserve, _K_, _LP_FEE_RATE_, etc.) | Yes — one contract per pool |
+| **base_token** | Base asset (e.g. cWUSDT, cUSDT) — ERC-20 | Yes |
+| **quote_token** | Quote asset (e.g. USDC, USDT) — ERC-20 | Yes |
+
+The MCP supports one chain at a time via `CHAIN` and `RPC_URL`. To support multiple chains you run multiple MCP instances (or one allowlist per chain and switch config).
+
+---
+
+## 2. Chain 138 (SMOM-DBIS-138)
+
+| Item | Status | Notes |
+|------|--------|--------|
+| **DODOPMMIntegration** | Deployed | `0x79cdbaFBaA0FdF9F55D26F360F54cddE5c743F7D` — creates and owns PMM pools |
+| **Pools** | Created via integration | Call `createPool` / `createCUSDTCUSDCPool` etc.; pool addresses from creation or `pools(base, quote)` |
+| **Base tokens (cUSDT, cUSDC, …)** | Deployed (core) | e.g. cUSDT `0x93E66202A11B1772E55407B32B44e5Cd8eda7f22`, cUSDC `0xf22258f57794CC8E06237084b353Ab30fFfa640b` (see [CHAIN138_TOKEN_ADDRESSES](../11-references/CHAIN138_TOKEN_ADDRESSES.md)) |
+| **Quote tokens (USDT, USDC)** | On-chain | Use addresses from Chain 138 config / token API |
+
+**Contracts you need to have (so the MCP has addresses):**
+
+- **Already deployed:** DODOPMMIntegration; core compliant tokens (cUSDT, cUSDC, etc.).
+- **You must do:** Create pools via DODOPMMIntegration (`createCUSDTCUSDCPool`, `createPool(cUSDT, USDT, ...)`, etc.). Then put in the MCP allowlist: each pool’s address, and the base/quote token addresses used for that pool.
+
+No additional smart contracts need to be **deployed** for the MCP beyond what already exists on 138; you only need to **create pools** from the existing integration and then configure the MCP allowlist with those pool and token addresses.
+
+---
+
+## 3. Other blockchains (public chains with cW* design)
+
+The **cross-chain-pmm-lps** design assumes per-chain **cW*** (bridged) tokens and **hub** stables (USDC/USDT), with **single-sided PMM pools** (cW* / hub) on each chain. `config/pool-matrix.json` and `config/deployment-status.json` list the chains and pairs. Today **deployment-status.json** has **no** addresses filled for these chains (1, 56, 137, 10, 100, 25, 42161, 42220, 1111, 43114, 8453).
+
+So that the MCP can have token and pool addresses on a given public chain, the following must **exist** (be deployed or already there):
+
+| What | Who deploys / source | Notes |
+|------|----------------------|--------|
+| **cW* tokens** (cWUSDT, cWUSDC, …) | Bridge (e.g. CCIP) or custom wrapper | Bridged representation of Chain 138 compliant tokens; address per chain. |
+| **Hub stables** (USDC, USDT, …) | Usually already exist | Native Circle/Tether (or chain canonical) deployments; use canonical address per chain. |
+| **PMM pool contracts** (one per pair) | You or DODO | DODO-style pool with getMidPrice, getOraclePrice, reserves, k, fee. Either: (a) deploy your own PMM factory + pools (e.g. DODO Vending Machine–compatible or custom), or (b) use existing DODO deployments on that chain if they match the MCP’s `dodo_pmm_v2_like` profile. |
+
+**Blockchains in the design (pool-matrix / deployment-status):**
+
+- **1** — Ethereum Mainnet  
+- **10** — Optimism  
+- **25** — Cronos  
+- **56** — BSC (BNB Chain)  
+- **100** — Gnosis Chain  
+- **137** — Polygon  
+- **1111** — Wemix  
+- **8453** — Base  
+- **42161** — Arbitrum One  
+- **42220** — Celo  
+- **43114** — Avalanche C-Chain  
+
+For **each** chain where you want the MCP to work you need:
+
+1. **Token contracts:** Addresses for the cW* tokens (and any other base tokens) and for the hub quote tokens (USDC/USDT, etc.) on that chain.  
+2. **Pool contracts:** At least one PMM pool per pair you want to manage (e.g. cWUSDT/USDC, cWUSDC/USDC). Each pool must expose the view functions expected by the MCP’s pool profile (e.g. `dodo_pmm_v2_like`).
+
+So: **no** new chain-specific contracts are “for the MCP” itself; the MCP only needs **addresses** of tokens and pools that already exist. On public chains those tokens and pools either must be **deployed** by you (or your bridge/PMM stack) or come from existing protocols (e.g. DODO) that match the MCP’s interface.
+
+---
+
+## 4. Summary table — “What must be deployed so the MCP has addresses”
+
+| Blockchain | Smart contracts / actions needed so MCP has addresses |
+|------------|--------------------------------------------------------|
+| **Chain 138** | DODOPMMIntegration already deployed. **Create pools** via it (cUSDT/cUSDC, cUSDT/USDT, cUSDC/USDC, etc.). Use existing cUSDT/cUSDC and chain USDT/USDC addresses. No extra contract deployment required. |
+| **Ethereum (1), BSC (56), Polygon (137), Optimism (10), Gnosis (100), Cronos (25), Arbitrum (42161), Base (8453), Celo (42220), Wemix (1111), Avalanche (43114)** | (1) **cW* token** addresses on that chain (via your bridge or wrapper). (2) **Hub stable** addresses (USDC/USDT — usually exist). (3) **PMM pool** contracts per pair (deploy DODO-style or use existing DODO on that chain). Until these exist and are recorded (e.g. in deployment-status or allowlist), the MCP has nothing to point at on that chain. |
+
+---
+
+## 5. References
+
+- MCP allowlist shape: `ai-mcp-pmm-controller/config/allowlist.json`
+- MCP pool profile (view methods): `ai-mcp-pmm-controller/config/pool_profiles.json`
+- Chain 138 tokens: `docs/11-references/CHAIN138_TOKEN_ADDRESSES.md`
+- Chain 138 DODO: `smom-dbis-138/docs/integration/DODO_PMM_INTEGRATION.md`, `smom-dbis-138/docs/deployment/DEPLOYED_CONTRACTS_OVERVIEW.md`
+- Per-chain pool design: `cross-chain-pmm-lps/config/pool-matrix.json`, `cross-chain-pmm-lps/config/deployment-status.json`
+- DEX/pool gaps: `docs/11-references/DEX_AND_CROSS_CHAIN_CONTRACTS_NEEDED.md`
--- a/docs/02-architecture/PROXMOX_HA_CLUSTER_ROADMAP.md
+++ b/docs/02-architecture/PROXMOX_HA_CLUSTER_ROADMAP.md
@@ -121,4 +121,4 @@ So **yes — it should be full HA** if you want automatic failover and no single
 - **Current:** Cluster only; no shared storage; no Proxmox HA; manual migration and manual restart after maintenance.
 - **Target:** Full HA = shared storage + HA manager + HA resources so that when you power down an R630 (e.g. for DIMM B2 reseat), critical VMs/containers are restarted on another node automatically.

-See also: [PROXMOX_CLUSTER_ARCHITECTURE.md](./PROXMOX_CLUSTER_ARCHITECTURE.md) (current cluster and “Future Enhancements”), [NPMPLUS_HA_SETUP_GUIDE.md](../04-configuration/NPMPLUS_HA_SETUP_GUIDE.md) (NPMplus-level HA with Keepalived).
+See also: [PROXMOX_CLUSTER_ARCHITECTURE.md](./PROXMOX_CLUSTER_ARCHITECTURE.md) (current cluster and “Future Enhancements”), [NPMPLUS_HA_SETUP_GUIDE.md](../04-configuration/NPMPLUS_HA_SETUP_GUIDE.md) (NPMplus-level HA with Keepalived). For **13× R630 + DoD/MIL-spec** (full HA, Ceph, fencing, RAM/drives, STIG hardening), see **[R630_13_NODE_DOD_HA_MASTER_PLAN.md](./R630_13_NODE_DOD_HA_MASTER_PLAN.md)**.
--- a/docs/02-architecture/R630_13_NODE_DOD_HA_MASTER_PLAN.md
+++ b/docs/02-architecture/R630_13_NODE_DOD_HA_MASTER_PLAN.md
@@ -0,0 +1,273 @@
+# 13× R630 Proxmox Cluster — DoD/MIL-Spec HA Master Plan
+
+**Last Updated:** 2026-03-02  
+**Document Version:** 1.0  
+**Status:** Active — Master plan for 13-node HA, RAM/storage, and DoD/MIL compliance
+
+---
+
+## 1. Executive Summary
+
+This document defines the target architecture for a **13-node Dell PowerEdge R630** Proxmox cluster with:
+
+- **Full HA and failover** (shared storage, HA manager, fencing, automatic recovery).
+- **DoD/MIL-spec alignment** (STIG-style hardening, audit, encryption, change control, documentation).
+- **RAM and drive specifications** for each R630 to support Ceph, VMs/containers, and growth.
+
+**Scope:** All 13 R630s as Proxmox cluster nodes; optional separate management node (e.g. ml110) or integration of management on a subset of R630s. Design assumes **hyper-converged** (Proxmox + Ceph on same nodes) for shared storage and true HA.
+
+---
+
+## 2. Cluster Design — 13 Nodes
+
+### 2.1 Node roles and quorum
+
+| Item | Requirement |
+|------|-------------|
+| **Total nodes** | 13 × R630 |
+| **Quorum** | Majority = 7. With 13 nodes, up to 6 can be down and cluster still has quorum. |
+| **Fencing** | Required for HA: failed node must be fenced (power off/reboot) so Ceph and HA manager can safely restart resources elsewhere. |
+| **Qdevice** | Optional: add a quorum device (e.g. small VM or appliance) so quorum survives more node failures; not required with 13 nodes but improves resilience. |
+
+### 2.2 Recommended node layout
+
+| Role | Node count | Purpose |
+|------|------------|---------|
+| **Proxmox + Ceph MON/MGR/OSD** | 13 | Every R630 runs Proxmox and participates in Ceph (MON, MGR, OSD) for shared storage. |
+| **Ceph OSD** | 13 | Each node contributes disk as Ceph OSD; replication (e.g. size=3, min_size=2) across nodes. |
+| **Proxmox HA** | 13 | HA manager can restart VMs/containers on any node; VM disks on Ceph. |
+| **Optional dedicated** | 0 | No dedicated “monitor-only” nodes required; MON/MGR run on all or a subset (e.g. 3–5 MONs). |
+
+### 2.3 Network and addressing
+
+- **Management:** One subnet (e.g. 192.168.11.0/24) for Proxmox API, SSH, Ceph public/cluster.
+- **Ceph:** Separate VLAN or subnet for Ceph cluster network (recommended for DoD: isolate storage traffic).
+- **VLANs:** Same VLAN-aware bridge (e.g. vmbr0) on all nodes so VMs/containers keep IPs when failed over.
+- **IP plan for 13 R630s:** Reserve 13 consecutive IPs (e.g. 192.168.11.11–192.168.11.23 for r630-01 … r630-13). Document in `config/ip-addresses.conf` and DNS.
+
+---
+
+## 3. RAM Specifications — R630
+
+### 3.1 R630 memory capabilities (reference)
+
+| Spec | Value |
+|------|--------|
+| **DIMM slots** | 24 (12 per socket in 2-socket) |
+| **Max RAM** | Up to 1.5 TB (with compatible LRDIMMs) |
+| **Typical configs** | 32 GB, 64 GB, 128 GB, 256 GB, 384 GB, 512 GB (depending on DIMM size and count) |
+| **ECC** | Required for DoD/MIL; R630 supports ECC RDIMM/LRDIMM |
+
+### 3.2 Recommended RAM per node (DoD HA + Ceph)
+
+| Tier | RAM per node | Use case |
+|------|----------------|---------|
+| **Minimum** | 128 GB | Ceph OSD + a few VMs; acceptable for lab or light production. |
+| **Recommended** | 256 GB | Production: Ceph (OSD + MON/MGR) + many VMs/containers; headroom for failover and recovery. |
+| **High** | 384–512 GB | Heavy workloads, large Ceph OSD count per node, or when consolidating from existing 503 GB nodes. |
+
+**Ceph guidance:** Proxmox/Ceph recommend **≥ 8 GiB per OSD** for OSD memory. With 6–8 OSDs per node (see storage), **48–64 GiB** for Ceph plus Proxmox and guest overhead → **128 GB minimum**, **256 GB recommended**.
+
+**DoD/MIL note:** Prefer **256 GB per node** for 13-node production so that (1) multiple node failures still leave enough capacity for HA migrations and (2) Ceph recovery and rebalancing do not cause OOM or instability.
+
+### 3.3 RAM placement (if mixing sizes)
+
+If not all nodes have the same RAM:
+
+- Put **largest RAM** in nodes that run the most VMs or Ceph MON/MGR.
+- Ensure **at least 128 GB** on every node that runs Ceph OSDs.
+- Document exact DIMM layout per node (slot, size, speed) for change control and troubleshooting.
+
+---
+
+## 4. Drive Specifications — R630
+
+### 4.1 R630 drive options (reference)
+
+- **Internal bays:** Typically 8 × 2.5" SATA/SAS (or 10-bay with optional kit); some configs support NVMe (e.g. 4 × NVMe via PCIe).
+- **Boot:** 2 drives in mirror (ZFS mirror or hardware RAID1) for Proxmox OS — **redundant, DoD-compliant**.
+- **Data:** Remaining drives for Ceph OSD and/or local LVM (if hybrid).
+
+### 4.2 Recommended drive layout per R630 (full Ceph)
+
+| Purpose | Drives | Type | Size (example) | Configuration |
+|---------|--------|------|----------------|---------------|
+| **Boot (OS)** | 2 | SSD | 240–480 GB each | ZFS mirror (preferred) or HW RAID1; Proxmox root only. |
+| **Ceph OSD** | 4–6 | SSD (or NVMe) | 480 GB – 1 TB each | One OSD per drive; no RAID (Ceph provides replication). |
+
+**Example per node:** 2 × 480 GB boot (ZFS mirror) + 6 × 960 GB SSD = 6 Ceph OSDs per node.  
+**Cluster total:** 13 × 6 = 78 OSDs; with replication 3×, usable capacity ≈ (78 × 0.9 TB) / 3 ≈ **~23 TB** (before bluestore overhead; adjust for actual sizes).
+
+### 4.3 DoD/MIL storage requirements
+
+- **Encryption:** At-rest encryption for sensitive data. Options: Ceph encryption (e.g. dm-crypt for OSD), or encrypted VMs (LUKS inside guest). Document which layers are encrypted and key management.
+- **Integrity:** ZFS for boot (checksum, scrub). Ceph provides replication and recovery; use **bluestore** with checksums.
+- **Sanitization:** Follow DoD 5220.22-M or NIST SP 800-88 for decommissioning/destruction of drives.
+- **Spare:** Maintain spare drives and document replacement and wipe procedures.
+
+### 4.4 Sizing for your workload
+
+- **Current (from docs):** ~50+ VMIDs, mix of Besu, Blockscout, DBIS, NPMplus, etc.; growth ~20–50 GB/month.
+- **Target:** Size Ceph pool so that **used + 2 years growth** stays &lt; 75% of usable. Example: 15–20 TB usable → ~5–7 TB used now + growth headroom.
+
+---
+
+## 5. Full HA and Failover Architecture
+
+### 5.1 Components
+
+| Component | Role |
+|-----------|------|
+| **Proxmox cluster** | 13 nodes; same cluster name; corosync for quorum. |
+| **Ceph** | Shared storage: MON (3–5 nodes), MGR (2+), OSD on all 13. Replication size=3, min_size=2. |
+| **Proxmox HA** | HA manager enabled; VMs/containers on Ceph added as HA resources; start/stop order and groups as needed. |
+| **Fencing (STONITH)** | Mandatory: when a node is declared lost, fence device powers it off (or reboots) so Ceph and HA can safely reassign resources. Use Proxmox’s built-in fence agents (e.g. **fence_pve** with Proxmox API or IPMI/IDRAC). |
+| **Network** | Redundant links where possible; same VLAN/bridge config on all nodes so failover does not change VM IPs. |
+
+### 5.2 Ceph design (summary)
+
+- **Pools:** At least one pool for VM/container disks (e.g. `ceph-vm`); optionally separate pool for backups or bulk data.
+- **Replication:** size=3, min_size=2; tolerate 2 node failures without data loss (with 13 nodes).
+- **Network:** Separate cluster network (e.g. 10.x or dedicated VLAN) for Ceph backend traffic; public for client (Proxmox) access.
+- **MON/MGR:** 3 or 5 MONs (odd); 2 MGRs minimum. Spread across nodes for availability.
+
+### 5.3 HA resource and failover behavior
+
+- **HA resources:** Add each critical VM/CT as HA resource; define groups (e.g. “database first, then app”) and restart order.
+- **Failure:** Node down → fencing → Ceph marks OSDs out → HA manager restarts VMs on other nodes using Ceph disks.
+- **Maintenance:** Put node in maintenance → migrate VMs off (or let HA relocate) → fence not triggered; perform RAM/drive work.
+
+### 5.4 What “full HA” gives you (DoD-relevant)
+
+- **No single point of failure:** Storage replicated; compute can run on any node.
+- **Automatic failover:** No manual migration for HA-managed guests.
+- **Controlled maintenance:** Node can be taken down without losing services; documented procedures for patching and hardware changes.
+
+---
+
+## 6. DoD/MIL-Spec Compliance Framework
+
+### 6.1 Alignment with DISA STIG / DoD requirements
+
+DoD/MIL typically implies (summary; you must map to your exact ATO/contract):
+
+| Area | Requirement | Implementation |
+|------|-------------|----------------|
+| **Hardening** | DISA STIG or equivalent for OS and applications | Apply STIG/CIS to Debian (Proxmox host) and guests; document exceptions. |
+| **Authentication** | Strong auth, no default passwords, MFA where required | SSH key-only on Proxmox; no password SSH; RBAC in Proxmox; MFA for critical UIs if required. |
+| **Access control** | Least privilege, RBAC, audit | Proxmox roles and permissions; separate admin vs operator; audit logs. |
+| **Encryption** | TLS in transit; encryption at rest for sensitive data | TLS 1.2+ for API and Ceph; at-rest encryption (Ceph or LUKS) as required. |
+| **Audit and logging** | Centralized, tamper-resistant, retention | rsyslog/syslog-ng to central log host; retention per policy; integrity (e.g. signed/hash). |
+| **Change control** | Documented changes, rollback capability | Change tickets; config in Git; backups before changes; runbooks. |
+| **Backup and recovery** | Regular backups, tested restore | Proxmox backups to separate storage; Ceph snapshots; DR runbook and tests. |
+| **Physical and environmental** | Physical security, power, cooling | Out of scope for this doc; document in facility plan. |
+
+### 6.2 Hardening checklist (Proxmox + Debian)
+
+Use this as an operational checklist; align with your STIG version.
+
+**Proxmox hosts (Debian base):**
+
+- [ ] **SSH:** Key-only auth; PasswordAuthentication no; PermitRootLogin prohibit-password or key-only; strong ciphers/KexAlgorithms.
+- [ ] **Firewall:** Restrict Proxmox API (8006) and SSH to management VLAN/CIDR; default deny.
+- [ ] **Services:** Disable unnecessary services; only Proxmox, Ceph, corosync, and required dependencies.
+- [ ] **Session timeout:** User session timeout (e.g. 900 s) in shell profile and/or Proxmox UI.
+- [ ] **TLS:** TLS 1.2+ only; strong ciphers for pveproxy and Ceph.
+- [ ] **Updates:** Security updates applied on a defined schedule; test in non-prod first.
+- [ ] **FIPS:** If required by contract, use FIPS-validated crypto (kernel/openssl); document and test.
+- [ ] **File permissions:** Sensitive files (keys, tokens) mode 600/400; no world-writable.
+- [ ] **Audit:** auditd or equivalent for critical files and commands; logs to central host.
+
+**Ceph:**
+
+- [ ] **Auth:** Cephx enabled; key management per DoD key management policy.
+- [ ] **Network:** Cluster network isolated; no Ceph ports exposed to user VLANs.
+- [ ] **Encryption:** At-rest encryption for OSD if required; key escrow and rotation documented.
+
+**Guests (VMs/containers):**
+
+- [ ] **Per-guest hardening:** STIG/CIS per OS (e.g. Ubuntu, RHEL); documented baseline.
+- [ ] **Secrets:** No secrets in configs in Git; use Vault or Proxmox secrets where applicable.
+
+**Existing automation (this repo):** Use `scripts/security/run-security-on-proxmox-hosts.sh` (SSH key-only + firewall 8006), `scripts/security/setup-ssh-key-auth.sh`, and `scripts/security/firewall-proxmox-8006.sh`; extend to all 13 hosts and run with `--apply` after validating with `--dry-run`. Extend host list in scripts or via env (e.g. all R630 IPs).
+
+### 6.3 Audit and documentation
+
+- **Configuration baseline:** All Proxmox and Ceph configs in version control; changes via PR/ticket.
+- **Runbooks:** Install, upgrade, add node, remove node, replace drive, fence test, backup/restore, disaster recovery.
+- **Evidence:** Run STIG/CIS scans (e.g. OpenSCAP, Nessus) and retain reports for assessors.
+- **Change log:** Document every change (who, when, why, ticket); link to runbook.
+
+---
+
+## 7. Phased Implementation
+
+### Phase 1 — Prepare (no downtime)
+
+1. **IP and DNS:** Assign and document 13 IPs for R630s; update `config/ip-addresses.conf` and DNS.
+2. **RAM:** Upgrade all 13 R630s to at least 128 GB (256 GB recommended); document DIMM layout.
+3. **Drives:** Install boot mirror (2 × SSD) and data drives (4–6 SSD per node) on each R630; configure ZFS mirror for boot.
+4. **Proxmox install:** Install Proxmox VE on all 13; same version; join to one cluster; configure VLAN-aware bridge and management IPs.
+5. **Hardening:** Apply SSH key-only, firewall, and STIG/CIS checklist to all nodes; document exceptions.
+
+### Phase 2 — Ceph
+
+1. **Ceph install:** Install Ceph on all 13 nodes (Proxmox Ceph integration); create MON (3 or 5), MGR (2), OSD (all nodes).
+2. **Pools:** Create replication pool (size=3, min_size=2) for VM disks; add as Proxmox storage.
+3. **Network:** Configure Ceph public and cluster networks; validate connectivity and latency.
+4. **Tests:** Fill and drain; kill OSD/node and verify recovery; document procedures.
+
+### Phase 3 — HA and fencing
+
+1. **Fencing:** Configure fence_pve (or IPMI/IDRAC) for each node; test fence from another node.
+2. **HA manager:** Enable HA in cluster; add critical VMs/containers as HA resources; set groups and order.
+3. **Failover tests:** Power off one node; verify fencing and HA restart on another node; repeat for 2-node failure if desired.
+4. **Runbooks:** Document failover test results and operational procedures.
+
+### Phase 4 — Migrate workload
+
+1. **Migrate disks:** Move VM/container disks from local storage to Ceph (live migration or backup/restore).
+2. **Decommission local-only:** Once all HA resources are on Ceph, remove or repurpose local LVM for non-HA or cache.
+3. **Monitoring and alerting:** Integrate with central monitoring; alerts for quorum loss, Ceph health, fence events, HA failures.
+
+### Phase 5 — DoD/MIL continuous compliance
+
+1. **Scans:** Schedule STIG/CIS scans; remediate and document exceptions.
+2. **Backup and DR:** Automate backups; test restore quarterly; update DR runbook.
+3. **Change control:** All changes via ticket + runbook; config in Git; periodic review of permissions and audit logs.
+
+---
+
+## 8. References and Related Docs
+
+| Document | Purpose |
+|----------|---------|
+| [PROXMOX_HA_CLUSTER_ROADMAP.md](./PROXMOX_HA_CLUSTER_ROADMAP.md) | Current HA roadmap (3-node); extend to 13-node. |
+| [PROXMOX_CLUSTER_ARCHITECTURE.md](./PROXMOX_CLUSTER_ARCHITECTURE.md) | Cluster and storage overview. |
+| [PHYSICAL_DRIVES_AND_CONFIG.md](../04-configuration/PHYSICAL_DRIVES_AND_CONFIG.md) | Current drive layout (existing 2 R630s + ml110). |
+| Proxmox Ceph documentation | [Ceph in Proxmox](https://pve.proxmox.com/pve-docs/chapter-pveceph.html). |
+| Proxmox HA | [High Availability](https://pve.proxmox.com/pve-docs/chapter-ha-manager.html). |
+| DISA STIG | [DISA STIGs](https://public.cyber.mil/stigs/); Debian/Ubuntu and application STIGs. |
+| CIS Benchmarks | [CIS Benchmarks](https://www.cisecurity.org/cis-benchmarks); Debian, Proxmox if available. |
+
+---
+
+## 9. Summary Table
+
+| Item | Specification |
+|------|----------------|
+| **Nodes** | 13 × Dell PowerEdge R630 |
+| **Quorum** | Majority 7; up to 6 nodes can fail |
+| **RAM per node** | Minimum 128 GB; **recommended 256 GB** (DoD production) |
+| **Boot** | 2 × SSD (e.g. 240–480 GB) ZFS mirror per node |
+| **Data (Ceph)** | 4–6 × SSD (e.g. 480 GB – 1 TB) per node, one OSD per drive |
+| **Shared storage** | Ceph replicated (size=3, min_size=2) |
+| **HA** | Proxmox HA manager; fencing (STONITH) required |
+| **Hardening** | STIG/CIS alignment; SSH key-only; firewall; TLS; audit; change control |
+| **Encryption** | TLS in transit; at-rest per policy (Ceph or LUKS) |
+
+---
+
+**Owner:** Architecture / Infrastructure  
+**Review:** Quarterly or when adding nodes / changing compliance scope  
+**Change control:** Update version and “Last Updated” when changing this plan; link change ticket.
--- a/docs/02-architecture/VMID_ALLOCATION_FINAL.md
+++ b/docs/02-architecture/VMID_ALLOCATION_FINAL.md
@@ -2,12 +2,27 @@

 **Navigation:** [Home](/docs/01-getting-started/README.md) > [Architecture](/docs/01-getting-started/README.md) > VMID Allocation

-**Last Updated:** 2025-01-20  
-**Document Version:** 1.0  
+**Last Updated:** 2026-02-26  
+**Document Version:** 1.1  
 **Status:** 🟢 Active Documentation

 ---

+## VMID Quick Reference (Operational)
+
+| Range | Purpose | Notes |
+|------:|---------|-------|
+| 3000–3003 | Monitor / RPC-adjacent (ml110 / ccip-monitor-1..4) | Within RPC/Gateways (2500–3499). Not CCIP DON. Not AI/Agents. |
+| 5400–5599 | CCIP DON (Chainlink CCIP) | 5410–5429 Commit, 5440–5459 Execute, 5470–5476 RMN. |
+| 5700–5999 | AI / Agents / Dev | Official band for model serving, MCP, agent runtimes. |
+
+**Naming/Tags (recommended):**
+- AI VMs: `ai-<role>-<env>` (e.g. `ai-mcp-prod`, `ai-inf-dev`, `ai-agent-prod`)
+- Monitor/RPC-adjacent: `ccip-monitor-<n>`
+- Proxmox tags: `AI`, `MCP`, `HF`, `MONITOR`, `PROD`/`DEV`
+
+---
+
 ## Complete VMID Allocation Table

 | VMID Range      | Domain                                                          | Total VMIDs | Initial Usage | Available |
@@ -16,7 +31,7 @@
 | 5000–5099       | Blockscout                                                     | 100         | 1             | 99        |
 | 5200–5299       | Cacti                                                           | 100         | 1             | 99        |
 | 5400–5599       | Chainlink CCIP                                                 | 200         | 1+            | 199       |
-| 5700–5999       | (available / buffer)                                           | 300         | 0             | 300       |
+| 5700–5999       | AI / Agents / Dev (model serving, MCP, agent runtimes)         | 300         | 1             | 299       |
 | 6000–6099       | Fabric                                                          | 100         | 1             | 99        |
 | 6200–6299       | FireFly                                                         | 100         | 1             | 99        |
 | 6400–7399       | Indy                                                            | 1,000       | 1             | 999       |
@@ -41,10 +56,14 @@
 - **1500-1503**: Initial sentries (4 nodes)
 - **1504-2499**: Reserved for sentry expansion (996 VMIDs)

-#### RPC / Gateways (2500-3499) - 1,000 VMIDs
- **2500-2502**: Initial RPC nodes (3 nodes)
- **2503-2505**: Besu RPC (HYBX; 3 nodes). **2506-2508 destroyed 2026-02-08** (no longer in use).
- **2509-3499**: Reserved for RPC/Gateway expansion
+#### RPC / Gateways (Besu) — 2500–3499
+- **2500–2508:** In-use RPC/Gateway nodes (2500–2502 initial; 2503–2505 HYBX; 2506–2508 destroyed 2026-02-08).
+- **2509–2999:** Reserved for RPC/Gateway expansion
+- **3000–3003:** **ml110 / monitor-style (RPC-adjacent)** — legacy/current usage
+  - Suggested naming: **ccip-monitor-1..4**
+  - **Not** the CCIP DON allocation (CCIP DON = **5400–5599**)
+  - **Not** the AI/Agents allocation (AI/Agents = **5700–5999**)
+- **3004–3499:** Reserved for RPC/Gateway expansion

 #### Archive / Telemetry (3500-4299) - 800 VMIDs
 - **3500+**: Archive / Snapshots / Mirrors / Telemetry
@@ -78,10 +97,16 @@

 ---

-### Available / Buffer (5700-5999) - 300 VMIDs
+### AI / Agents / Dev — 5700–5999

- **5700**: Dev VM (shared Cursor dev + private Gitea for four users). See [DEV_VM_GITOPS_PLAN.md](../04-configuration/DEV_VM_GITOPS_PLAN.md).
- **5701-5999**: Reserved for future use / buffer space
+This is the **official VMID range** for AI workloads, agent runtimes, MCP servers, and AI/dev experimentation. **Do not** place AI workloads in 3000–3099; that range is within RPC/Gateways expansion and includes legacy monitor/RPC-adjacent nodes (3000–3003).
+
+- **5700:** Dev VM (existing). See [DEV_VM_GITOPS_PLAN.md](../04-configuration/DEV_VM_GITOPS_PLAN.md).
+- **5701–5749:** AI platform services (model serving, MCP hub, auth, observability)
+- **5750–5899:** AI applications (per-project agents, DODO PMM tooling, policy guardrails)
+- **5900–5999:** Experiments / temporary / buffer
+
+**Optional suggested layout:** 5701 = MCP Hub; 5702 = Inference (HF model server); 5703 = Agent Worker (orchestration); 5704 = Memory/State (Postgres/Redis/Vector DB). See [AI_AGENTS_57XX_DEPLOYMENT_PLAN.md](AI_AGENTS_57XX_DEPLOYMENT_PLAN.md) for copy/paste deployment steps (QEMU guest agent, 57xx layout, MCP/DODO PMM, read-only vs execution).

 ---

@@ -131,16 +156,18 @@ VMID_VALIDATORS_START=1000         # Besu validators: 1000-1499
 VMID_SENTRIES_START=1500           # Besu sentries: 1500-2499
 VMID_RPC_START=2500                # Besu RPC: 2500-3499
 VMID_ARCHIVE_START=3500            # Besu archive/telemetry: 3500-4299
-VMID_BESU_RESERVED_START=4300     # Besu reserved: 4300-4999
-VMID_EXPLORER_START=5000          # Blockscout: 5000-5099
-VMID_CACTI_START=5200             # Cacti: 5200-5299
-VMID_CCIP_START=5400              # Chainlink CCIP: 5400-5599
-VMID_BUFFER_START=5700            # Buffer: 5700-5999
-VMID_FABRIC_START=6000            # Fabric: 6000-6099
-VMID_FIREFLY_START=6200           # Firefly: 6200-6299
-VMID_INDY_START=6400              # Indy: 6400-7399
-VMID_SANKOFA_START=7800           # Sankofa/Phoenix/PanTel: 7800-8999
-VMID_SOVEREIGN_CLOUD_START=10000  # Sovereign Cloud: 10000-13999
+VMID_BESU_RESERVED_START=4300      # Besu reserved: 4300-4999
+VMID_EXPLORER_START=5000           # Blockscout: 5000-5099
+VMID_CACTI_START=5200              # Cacti: 5200-5299
+VMID_CCIP_START=5400               # Chainlink CCIP: 5400-5599
+VMID_AI_AGENTS_START=5700          # AI / Agents / Dev: 5700-5999 (model serving, MCP, agent runtimes)
+# Optional alias for backward compatibility (deprecated):
+# VMID_BUFFER_START=5700           # deprecated: use VMID_AI_AGENTS_START
+VMID_FABRIC_START=6000             # Fabric: 6000-6099
+VMID_FIREFLY_START=6200            # Firefly: 6200-6299
+VMID_INDY_START=6400               # Indy: 6400-7399
+VMID_SANKOFA_START=7800            # Sankofa/Phoenix/PanTel: 7800-8999
+VMID_SOVEREIGN_CLOUD_START=10000   # Sovereign Cloud: 10000-13999
 ```

 ---
@@ -153,7 +180,7 @@ VMID_SOVEREIGN_CLOUD_START=10000  # Sovereign Cloud: 10000-13999
 | Blockscout | 5000 | 5099 | 100 | 1 | 99 | 99.0% |
 | Cacti | 5200 | 5299 | 100 | 1 | 99 | 99.0% |
 | Chainlink CCIP | 5400 | 5599 | 200 | 1+ | 199 | 99.5% |
-| Buffer | 5700 | 5999 | 300 | 0 | 300 | 100% |
+| AI/Agents/Dev | 5700 | 5999 | 300 | 1 | 299 | 99.7% |
 | Fabric | 6000 | 6099 | 100 | 1 | 99 | 99.0% |
 | FireFly | 6200 | 6299 | 100 | 1 | 99 | 99.0% |
 | Indy | 6400 | 7399 | 1,000 | 1 | 999 | 99.9% |
@@ -170,11 +197,16 @@ VMID_SOVEREIGN_CLOUD_START=10000  # Sovereign Cloud: 10000-13999
 ✅ **Future-proof** - Large buffers and reserved ranges  
 ✅ **Modular design** - Each service has dedicated range  
 ✅ **Sovereign Cloud Band** - 4,000 VMIDs for SMOM/ICCC/DBIS/Absolute Realms  
+✅ **AI/Agents band (5700–5999)** — Dedicated range for model serving, MCP, agent runtimes; 3000–3003 remain RPC/monitor-adjacent  

 ---

 ## Migration Notes

+**New Additions (v1.1):**
+- **AI/Agents/Dev (5700–5999)** defined as the official band for AI inference, MCP, agent runtimes, vector DB, and AI platform services (not 3000–3099).
+- **3000–3003** explicitly documented as **RPC/monitor-adjacent** (ml110 / ccip-monitor-1..4), not CCIP DON and not AI/Agents.
+
 **Previous Allocations**:
 - Validators: 106-110, 1100-1104 → **1000-1004**
 - Sentries: 111-114, 1110-1113 → **1500-1503**
@@ -187,8 +219,13 @@ VMID_SOVEREIGN_CLOUD_START=10000  # Sovereign Cloud: 10000-13999
 - Indy: 8000, 263 → **6400**

 **New Additions**:
- Buffer: 5700-5999 (300 VMIDs)
+- AI/Agents/Dev: 5700-5999 (300 VMIDs). **Use this band for AI inference, MCP, agent runtimes, vector DB; not 3000-3099.** Sub-ranges: 5701-5749 platform, 5750-5899 apps, 5900-5999 experiments. 3000-3003 remain RPC/monitor-adjacent (ml110/ccip-monitor-1..4).
 - Sankofa/Phoenix/PanTel: 7800-8999 (1,200 VMIDs)
 - Sovereign Cloud Band: 10000-13999 (4,000 VMIDs)
 - **NPMplus Alltra/HYBX:** VMID 10235 (192.168.11.169). See [04-configuration/NPMPLUS_ALLTRA_HYBX_MASTER_PLAN.md](../04-configuration/NPMPLUS_ALLTRA_HYBX_MASTER_PLAN.md). NPMplus range: 10233 (primary), 10234 (HA secondary), 10235 (Alltra/HYBX).

+---
+
+**Owner:** Architecture  
+**Review cadence:** Quarterly or upon new VMID band creation  
+**Change control:** PR required; update Version + Last Updated