09/16/2025 | News release | Distributed by Public on 09/16/2025 09:30
At the VMware Explore 2025 keynote, Chris Wolf announced DirectPath enablement for GPUs with VMware Private AI, marking a major step forward in simplifying and scaling enterprise AI infrastructure. By granting VMs exclusive, high-performance access to NVIDIA GPUs, DirectPath allows organizations to fully harness GPU capabilities without added licensing complexity. This makes it easier to experiment, prototype, and move AI projects into production with confidence. Besides, VMware Private AI brings models closer to enterprise data, delivering secure, efficient, and cost-effective deployments. Jointly engineered by Broadcom and NVIDIA, the solution empowers organizations to accelerate innovation while reducing total cost of ownership (TCO).
These advancements come at a critical time. Serving state-of-the-art large language models (LLMs) like DeepSeek-R1, Meta Llama-3.1-405B-Instruct, and Qwen3-235B-A22B-thinking at full context length often exceeds the capacity of a single 8x H100 GPU server, making distributed inference essential. Aggregating resources from multiple GPU-enabled nodes allows these models to run efficiently, though it introduces new challenges in infrastructure management, interconnect optimization, and workload scheduling.
This is where VMware Cloud Foundation (VCF) plays a vital role. VCF is the industry's first private cloud platform to deliver public cloud scale and agility while providing on-premises security, resilience, and performance-all with lower TCO. Leveraging technologies such as NVIDIA NVLink, NVSwitch, and GPUDirect® RDMA, VCF enables high-bandwidth, low-latency communication across nodes. It also ensures that network interconnects like InfiniBand (IB) and RoCEv2 (RDMA over Converged Ethernet) are used effectively, reducing communication overhead that can limit distributed inference performance. With VCF, enterprises can deploy production-grade distributed inference, ensuring even the largest reasoning models run reliably with predictable performance characteristics.
This blog post summarizes our white paper, "Deploy Distributed LLM inference with GPUDirect RDMA over Infiniband in VMware Private AI", which provides architectural guidance, detailed deployment steps, and technical best practices for distributed LLM inference across multiple GPU nodes on VCF and NVIDIA HGX servers with GPUDirect RDMA over IB.
NVIDIA HGX servers play a central role, with their internal topology-PCIe switches, NVIDIA H100/H200 GPUs, and ConnectX-7 IB HCAs-described in detail. A 1:1 GPU-to-NIC ratio is emphasized as critical for optimal GPUDirect RDMA performance, ensuring each accelerator has a dedicated, high-bandwidth path.
NVLink and NVSwitch enable ultra-fast communication within a single HGX node (up to 8 GPUs), while InfiniBand or RoCEv2 provide the high-bandwidth, low-latency interconnects required to scale inference across multiple HGX servers.
Enabling GPUDirect RDMA within VCF requires specific configurations, such as enabling Access Control Services (ACS) in ESXi and Address Translation Services (ATS) on ConnectX-7 NICs. ATS allows direct DMA transactions between PCIe devices, bypassing the Root Complex and restoring near bare-metal performance in virtualized environments.
A practical framework is included for calculating the minimum number of HGX servers required for LLM inference. Factors such as num_attention_heads and context length are taken into account, with a reference table showing hardware requirements for popular LLMs (e.g., Llama-3.1-405B, DeepSeek-R1, Llama-4-Series, Kimi-K2, etc). For instance, DeepSeek-R1 and Llama-3.1-405B for full context length both require at least two H00-HGX servers.
The solution architecture is broken down into the VKS Cluster, Supervisor Cluster, and critical Service VMs running the NVIDIA Fabric Manager. It highlights the use of Dynamic DirectPath I/O to ensure GPUs and NICs are directly accessible to workload VKS nodes, while NVSwitches are passthrough to Service VMs.
An 8-step deployment workflow is presented, covering:
Concrete examples are included, such as:
Steps are provided for verifying RDMA, GPUDirect RDMA, and NCCL performance across multi-nodes. Benchmarking results are included for models such as DeepSeek-R1-0528 and Llama-3.1-405B-Instruct on 2 HGX nodes, using the GenAI-Perf stress test tool.
For a deeper dive into the technical specifics and deployment procedures, we encourage you to read the full white paper: https://www.vmware.com/docs/vcf-distributed-infer
Ready to get started on your AI and ML journey? Check out these helpful resources: