Is 25GbE enough for a single-node AI inference appliance?

Yes. A single-node Hosn Tower running Gemma 4 or Qwen 3.6 for inference rarely saturates 25GbE on the front end. Token streaming, RAG queries against a local NFS share, and a few hundred concurrent users sit comfortably inside 5 to 10 Gbps of real traffic. 25GbE leaves plenty of headroom and keeps switch and cabling cost low.

When does an AI cluster actually need 100GbE?

Whenever GPUs on different physical nodes have to exchange tensors during the same operation. That includes multi-node tensor or pipeline parallelism, distributed fine-tuning, very large RAG indexes that span nodes, and shared parameter servers. Once the all-reduce step crosses a node boundary, 25GbE becomes the bottleneck and 100GbE plus RoCE v2 is the right floor.

Should we pick RoCE v2 or InfiniBand for a sovereign AI cluster?

RoCE v2 over Ethernet is usually the better sovereign choice. It runs on standard data-centre switches, is supported by every major vendor, and integrates cleanly with the rest of an Omani institution's network. InfiniBand is faster on paper for very large training jobs, but it adds a parallel fabric, specialised cabling, and a smaller pool of local engineers. For inference and single-rack training, RoCE v2 wins on operability.

Does 100GbE require lossless Ethernet?

For RoCE v2 to reach its potential, yes. The fabric needs Priority Flow Control and Explicit Congestion Notification configured per priority class, otherwise dropped packets cause RDMA stalls and NCCL slowdowns. A standard 100GbE link without lossless tuning still beats 25GbE for raw throughput, but it will not deliver the RDMA latency profile that distributed training expects.

100GbE vs 25GbE for an AI Cluster Backbone, Hosn Blog

Most procurement specs for sovereign AI clusters quietly default to 100GbE on the backbone. Sometimes that is exactly right. Sometimes it doubles the switch budget for a workload that would happily run on 25GbE. The honest answer depends on whether GPUs on different physical nodes ever need to talk to each other inside the same operation, and how lossless that conversation has to be. This guide walks an Omani buyer through the four questions that actually decide the fabric, before any vendor quote is signed.

1. Where Ethernet bandwidth lives in an AI cluster

An AI cluster has three traffic planes that compete for the same NICs:

Host-to-host collective traffic. Distributed training and multi-node inference exchange tensors via NCCL all-reduce, all-gather, and reduce-scatter operations. This traffic is bursty, latency-sensitive, and the most demanding on the fabric.
Storage traffic. Model weights, RAG vector indexes, training shards, and checkpoint writes pass between GPU nodes and the storage tier. A 70B model checkpoint is hundreds of gigabytes; saving it over 25GbE takes minutes, over 100GbE it takes seconds.
Ingress and management. User requests, API calls, log shipping, monitoring, and operator SSH. This rarely exceeds a few gigabits even at peak.

The first plane is the one that pushes 25GbE to its limit. NVIDIA's NCCL library is what most stacks use for those collectives, and on a 100GbE class link with RoCE in a non-switched mesh it typically reaches about 60 to 65 percent of line rate, around 10 GB/s on two nodes and 7 to 8 GB/s on three. That is the realistic ceiling buyers should plan against, not the marketing number on the switch datasheet.

2. When 25GbE is enough

For a large class of sovereign AI workloads, 25GbE is not just adequate, it is the right answer.

Single-node serving. A Hosn Tower or Hosn Kernel running Gemma 4 or Qwen 3.6 for an institutional chat assistant keeps all GPU traffic inside the chassis over NVLink or PCIe. The Ethernet port only carries token streams, RAG queries, and admin traffic. Even at a few hundred concurrent users this rarely tops 5 to 10 Gbps of sustained throughput.
Modest concurrency inference farms. Two or three independent inference nodes serving distinct departments, each handling its own users, do not share tensors. They share storage and front-end load balancers. 25GbE per node with a single 100GbE uplink to the spine is plenty.
RAG over local NFS. A bilingual Arabic-English RAG index of 50 to 200 GB sitting on local NFS, queried by a single inference node, fits inside 25GbE comfortably. Embedding lookups are small, and the vector store's hot working set lives in RAM.
Audit and analytics workloads. Batch summarisation of correspondence archives, classified document workflows, or parliamentary research, where latency targets are seconds, not milliseconds.

For all these cases, 25GbE plus a clean lossless config keeps the fabric simple, the switch budget low, and the operator burden small.

3. When 100GbE pays off

The moment GPUs on different physical nodes have to participate in the same forward or backward pass, the calculus flips. Multi-node tensor parallelism splits a single layer of weights across machines. Pipeline parallelism splits stages of the model across machines. Both require synchronous, low-latency, high-bandwidth exchanges every step.

Distributed fine-tuning. LoRA on a single node is fine over 25GbE. Full fine-tuning of a 70B parameter model across four or eight nodes, with FP8 or BF16 gradients flying back and forth, will saturate 25GbE inside the first few iterations.
Multi-node tensor parallel serving. Running a model that does not fit in one node's GPU memory, for example a quantised 200B+ class model split across two H100 nodes, is the textbook case for 100GbE plus RoCE v2.
Very large RAG. National-archive scale indexes that span multiple storage nodes, with cross-node sharding, benefit from 100GbE on the storage plane even when the inference itself is single-node.
Frequent checkpointing. Training runs that checkpoint every few minutes need the storage plane to drain in seconds, not minutes, to avoid GPU idle time.

The pillar guide on sovereign AI rack power, cooling, and air-gap covers how the fabric choice ripples into PDU sizing, cable trays, and the air-gap perimeter design. Network choice is rarely just a network choice.

4. RDMA, RoCE v2, and InfiniBand briefly

Once a buyer commits to 100GbE for collectives, the next question is what runs on top.

Plain TCP. Works, but burns CPU and adds latency. Acceptable for storage and ingress, not for collectives.
RoCE v2. RDMA over Converged Ethernet. The mainstream choice for Ethernet AI clusters. It needs lossless Ethernet via Priority Flow Control and ECN configured per traffic class, otherwise the RDMA queue pairs stall under load. Cisco, Arista, and Mellanox/NVIDIA switches all support it. Meta's at-scale paper documents how RoCE underpins their distributed training fabric, and the same playbook scales down to a sovereign rack.
InfiniBand. A parallel non-Ethernet fabric. Slightly lower latency, slightly higher bandwidth in 2026, but it is a second physical network with its own cabling, switches, and operations skill. For Omani institutions standardising on Ethernet, the operability cost rarely justifies it outside very large training clusters.

For most Hosn deployments the recommended baseline is 25GbE per node with a 100GbE spine for single-rack inference, and 100GbE per node with RoCE v2 lossless for any workload that crosses node boundaries. Pricing for the cluster, switching, and lossless tuning is by quotation.

If you are sizing a sovereign AI cluster and want a second opinion before the procurement spec freezes, email [email protected] for a one-hour briefing. We will walk through your workload mix, the fabric implications, and the specific switch and NIC SKUs that fit your air-gap rules.

Sources: NVIDIA NCCL 2.27 release notes, RDMA over Ethernet for Distributed AI Training at Meta Scale (SIGCOMM 2024), RDMA over Converged Ethernet specification overview.

1. Where Ethernet bandwidth lives in an AI cluster

2. When 25GbE is enough

3. When 100GbE pays off

4. RDMA, RoCE v2, and InfiniBand briefly

Frequently asked

Related

Sovereign AI rack: power, cooling, and air-gap

Air-gap network architecture for AI

NVLink topology for multi-GPU LLM serving