NVLink Topology for Multi-GPU LLM Deployments

When a 70B-parameter model has to be sliced across four GPUs, every token of output triggers an all-reduce across those GPUs. The interconnect between them, not the GPUs themselves, decides whether you measure latency in tens of milliseconds or in seconds. NVLink is the interconnect that makes tensor-parallel inference economically viable on-premise. This is the practical map of NVLink generations, NVSwitch fabrics, DGX and HGX topologies, and what it all means for sovereign LLM serving in Oman and the GCC.

NVLink in 100 words, and why PCIe alone is not enough

NVLink is NVIDIA's proprietary GPU-to-GPU interconnect, introduced with the P100 in 2016. Each successive generation has doubled or tripled bandwidth: NVLink 1 at 160 GB/s aggregate per GPU, NVLink 2 at 300 GB/s, NVLink 3 at 600 GB/s on the A100, and NVLink 4 at 900 GB/s aggregate (450 GB/s per direction) on the H100. NVIDIA's NVLink product page documents the current generation as 18 NVLink connections per H100, each running at 50 GB/s bidirectional.

PCIe, by comparison, is a host bus, not a GPU fabric. PCIe Gen5 x16 delivers 128 GB/s aggregate (64 GB/s per direction). That is roughly seven times slower than NVLink 4 between two H100s, before you account for protocol overhead. Tensor-parallel inference issues an all-reduce on every transformer layer's output (multiple times per token for some kernels), and that bandwidth gap converts directly into per-token latency. For a 70B model split four ways, the difference between NVLink and PCIe at inference time is roughly 3 to 8x in throughput, depending on the framework and quantisation. For a deeper procurement view of when this matters, our pillar on AI inference hardware comparison walks through the model-to-hardware mapping.

DGX, HGX, and custom 8-GPU topologies

Three packaging tiers exist for an 8-GPU NVLink system:

  • DGX H100/H200. NVIDIA's own integrated server. Fixed BoM: 8 H100 SXM5 GPUs on an HGX baseboard, dual Sapphire Rapids CPUs, 2 TB DDR5, eight ConnectX-7 NICs, two BlueField-3 DPUs, 30 TB NVMe. The DGX H100 user guide is the authoritative reference. You buy the system as one SKU and NVIDIA owns the support contract.
  • HGX baseboard from an OEM. Supermicro AS-8125GS, Dell XE9680, HPE Cray XD675, Lenovo SR685a V3, Inspur NF5688G7. All use the same HGX H100 8-GPU baseboard with the same four NVSwitch ASICs. Differences are chassis, BMC, NIC slots, drive bays, and warranty terms. NVIDIA validates the baseboard, the OEM validates the system. Per the HGX platform page, the NVLink topology is identical across vendors.
  • 4-GPU NVLink-bridged PCIe. A budget alternative for inference-only. Four H100 PCIe cards in a single chassis, paired by NVLink bridges in 2+2 configuration. Bridges deliver 600 GB/s aggregate between paired GPUs, but cross-pair traffic falls back to PCIe. Acceptable for tensor parallelism across pairs, painful for full 4-way splits.

For most sovereign deployments serving fewer than 200 concurrent users, a single 8-GPU HGX system is the correct procurement unit. Buying DGX is the right call when the institution wants a single vendor of record and is willing to pay the integration premium. Buying HGX from an OEM saves 15 to 30 percent and gives you a local hardware support partner, which matters when the rack is air-gapped in Muscat.

NVSwitch and the bandwidth math

NVSwitch is the chip that turns 8 GPUs into a fully-connected mesh. On HGX H100, four third-generation NVSwitch ASICs sit on the baseboard, each providing 64 NVLink 4 ports. The result: every GPU has 900 GB/s aggregate NVLink bandwidth, and any GPU can talk to any other GPU at 450 GB/s in each direction simultaneously. This is the property that makes 8-way tensor parallelism tractable.

Three numbers worth memorising:

  • 900 GB/s per GPU aggregate on H100 (NVLink 4). H200 keeps the same NVLink generation at the same 900 GB/s. Blackwell B200 raises this to 1,800 GB/s per GPU on NVLink 5.
  • 450 GB/s per pairwise direction between any two GPUs in an NVSwitch domain.
  • 64 GB/s per direction on PCIe Gen5 x16. The control path, not the data path.

For all-reduce on a tensor-parallel layer, the bandwidth that matters is the bisection bandwidth of the GPU group. A fully-connected NVSwitch mesh has effectively unlimited bisection at the link layer; the bottleneck becomes the NVSwitch chip itself, which on HGX H100 is rated at 13.6 TB/s aggregate. The NVL Switch System used in DGX SuperPOD extends this fabric across 32 nodes (256 GPUs) with second-level NVSwitches, but most sovereign workloads do not need that scale.

Tensor-parallel vs pipeline-parallel serving

Two parallelism strategies dominate LLM serving, and the choice between them is mostly a function of interconnect:

  • Tensor parallelism (TP) splits each weight matrix across GPUs and runs an all-reduce on every layer's output. It is communication-heavy, latency-sensitive, and requires NVLink-class bandwidth. Inside a single 8-GPU NVSwitch domain, TP=8 is the default for 70B-and-larger models. The vLLM distributed-serving guide recommends keeping TP within the NVLink domain for exactly this reason.
  • Pipeline parallelism (PP) assigns whole layers to whole GPUs and passes activations between them. It is bandwidth-light and tolerant of slower interconnects, but adds pipeline-bubble latency. PP across two HGX nodes connected by 400 Gb/s InfiniBand is a common pattern for models that exceed a single 8-GPU memory budget.

The serving recipe that maximises throughput on most sovereign deployments: TP=8 inside the NVSwitch domain, PP=2 or 4 across nodes if needed, and data parallelism (independent replicas) for horizontal scale. Network design between nodes is its own discipline, covered in our note on 25 GbE and 100 GbE for AI clusters. The throughput you can expect from each topology, with realistic prompt and output lengths, is benchmarked in LLM concurrency benchmarks per GPU.

Hosn sizes the topology to the workload, not the brochure. If your concurrent-user target is below 50 and your model is 32B or smaller, you do not need NVSwitch. If you are serving a 70B reasoning model to a ministry of 500 users with a 200 ms first-token target, you do. Email [email protected] for a one-hour briefing on which tier matches the workload, the room, and the budget.

Frequently asked

Do I need NVLink for inference, or is PCIe enough?

If the model fits in a single GPU's memory, PCIe is fine. If you need tensor parallelism across two or more GPUs to fit a 70B-class or larger model, NVLink is effectively mandatory: PCIe Gen5 x16 caps at 64 GB/s per direction, while NVLink 4 between two H100 GPUs delivers 450 GB/s per direction. Tensor parallelism issues all-reduce calls every forward pass, and that gap shows up as token latency immediately.

What is the difference between DGX and HGX?

HGX is the NVIDIA-designed 8-GPU baseboard sold to OEMs (Supermicro, Dell, Lenovo, HPE, Quanta, Inspur). DGX is NVIDIA's own integrated system built on the same baseboard plus a validated chassis, BMC, networking, and software stack. HGX gives buyers procurement flexibility and component-level support; DGX gives you a single throat to choke and a reference firmware bundle. The NVLink topology inside is identical.

Can I scale tensor parallelism across multiple servers?

Cross-node tensor parallelism requires either NVLink Switch System (the 256-GPU NVL fabric in DGX SuperPOD) or extremely fast InfiniBand (NDR 400 Gb/s or better) with GPU-direct RDMA. Without one of those, you should switch to pipeline parallelism between nodes and keep tensor parallelism within the 8-GPU NVSwitch domain. This is the default pattern in vLLM and TensorRT-LLM.

Does Hosn require a full DGX H100 to deploy?

No. The Hosn Kernel runs on a 1-GPU L40S or RTX 6000 Ada. The Hosn Tower uses a 4-GPU NVLink-bridged H100 PCIe configuration. Only the Hosn Rack tier uses an 8-GPU HGX H100 or H200 system with full NVSwitch. We size by user count, latency target, and model footprint, not by vendor enthusiasm.