Files
proxmox/docs/11-references/XE9680_VS_R750_DECISION.md
defiQUG e4c9dda0fd
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
chore: update submodule references and documentation
- Marked submodules ai-mcp-pmm-controller, explorer-monorepo, and smom-dbis-138 as dirty to reflect recent changes.
- Updated documentation to clarify operator script usage, including dotenv loading and task execution instructions.
- Enhanced the README and various index files to provide clearer navigation and task completion guidance.

Made-with: Cursor
2026-03-04 02:03:08 -08:00

4.9 KiB
Raw Blame History

2× XE9680 vs 3× R750 — GPU/AI tier decision

Last updated: 2026-03-03
Context: The 3× R750 were planned for GPU/AI workloads. This doc compares that plan to using 2× Dell PowerEdge XE9680 (8× NVIDIA A100 80GB SXM4 per node) instead.

Reference: Dell PowerEdge XE9680 — 8× NVIDIA A100 80GB SXM4


Opinion summary

  • For a dedicated GPU/AI tier: 2× XE9680 is the stronger choice than 3× R750 (GPU-equipped) if you need serious capacity: 16× A100 80GB, purpose-built cooling and power, NVLink, and optional 200G InfiniBand. Pricing (vendor example): $94,999 per node on sale (list $179,000) → $189,998 for 2 nodes; vs ~$30k80k for 3× R750 with GPUs.
  • 3× R750 (GPU) still makes sense if you prefer lower cost, fewer GPUs (e.g. 612 total across 3 nodes), PCIe-based GPUs (e.g. A6000, L40S), and more flexibility to mix GPU and CPU workloads per node.
  • Recommendation: Prefer 2× XE9680 for heavy AI/ML (training, large models, 80GB VRAM); prefer 3× R750 + GPUs for cost-sensitive or lighter/mixed GPU workloads.

3× R750 as GPU/AI nodes

Aspect R750 (2U, GPU-equipped)
Form factor 2U per node; 6U total for 3
GPU capacity Typically 24 PCIe GPUs per node (e.g. NVIDIA A6000, L40S, A40) → 612 GPUs total; depends on TDP and slot layout
GPU memory PCIe cards: often 2448GB per GPU; no 80GB SXM4 in 2U
Interconnect PCIe; no NVLink across GPUs; 10G/25G typical unless you add high-speed NICs
Use case Lighter training, inference, mixed workloads; good for dev/staging GPU pools
Cost (approx) ~$10k25k per node (server + GPUs) → ~$30k75k for 3 nodes
Power/cooling Moderate; 2U-appropriate

R750 is a general-purpose 2U that can hold GPUs; it is not an 8-GPU dense AI chassis.


2× XE9680 as GPU/AI nodes

Aspect XE9680 (6U, 8× A100 SXM4)
Form factor 6U per node; 12U total for 2
GPU capacity 8× NVIDIA A100 80GB SXM4 per node → 16× A100 80GB total
GPU memory 80GB per A100; NVLink within node for fast multi-GPU training
Interconnect 100GbE + 200GbE InfiniBand (Mellanox ConnectX-6); ideal for multi-node scaling
Use case Large model training, HPC, generative AI, serious ML/DL workloads
Cost (approx) $94,999/node on sale (list $179,000) → $189,998 for 2 nodes
Power/cooling High; 6U chassis and power designed for 8× A100

XE9680 is purpose-built for dense GPU AI; no 2U server matches this density and memory per node.


Side-by-side (GPU/AI)

Factor 3× R750 (GPU) 2× XE9680
Total GPUs 612 (PCIe, config-dependent) 16× A100 80GB
VRAM per GPU Typically 2448GB 80GB
Multi-GPU PCIe only NVLink within node
Multi-node 10G/25G typical 200G InfiniBand (option)
Rack space 6U 12U
Cost Lower (~$30k75k ballpark) $189,998 for 2× (2 × $94,999 sale)
Best for Lighter AI, mixed workloads, budget Heavy training, large models, max throughput

Recommendation (GPU/AI tier)

Goal Recommendation
Maximum GPU capacity and large-model training 2× XE9680 — 16× A100 80GB, NVLink, InfiniBand; accept higher cost and power.
Lower cost, flexible GPU pool (dev/staging/inference) 3× R750 + GPUs — 612 GPUs, PCIe-based; good value and spread across 3 nodes.
Start small, expand later 3× R750 now; add XE9680 (or similar) later when workload justifies it.

If you replace the planned 3× R750 (GPU) with 2× XE9680: document the GPU tier as “2× XE9680” in the inventory and assign IPs (e.g. .24.25 for the two nodes); the R750 IP block (.24.26) can be repurposed or left for future use.


Inventory and docs


References