How many concurrent users can a single H100 serve on a 7B Arabic model?

On Gemma 4 7B at Q4 with vLLM continuous batching and a 16K context budget per session, a single H100 80GB sustains roughly 80 to 120 concurrent active users at a P50 first-token latency under 100 ms. The ceiling is set by KV cache headroom rather than compute.

Does an H200 actually serve more users than an H100, or only faster ones?

Both. The 141 GB of HBM3e on the H200 leaves much more room for KV cache, so concurrency rises by roughly 1.7x to 2.0x for the same model and context length, while per-token latency drops by 20 to 30 percent due to the 4.8 TB/s bandwidth.

Is RTX 6000 Ada viable for institutional concurrency?

Yes for 7B to 13B models at Q4 or Q5. Expect 30 to 50 concurrent users on a single 48 GB card with first-token latency between 110 and 150 ms. Above that scale, the H100 or H200 becomes the better unit of buy.

When does concurrency saturate on a given GPU?

Almost always at the KV cache, not at compute. As batch size grows, decode throughput plateaus once aggregate KV memory consumes the available HBM. Tensor parallelism, longer context, or KV offloading are the usual answers.

LLM Concurrency Benchmarks per GPU Class, Hosn Blog

A sovereign procurement officer rarely cares whether a GPU does 989 or 1,979 FP16 TFLOPS. They care how many ministry analysts can ask the same Arabic 7B model a question at 9:00 AM Sunday morning without watching a spinner. That number, concurrent active users per GPU at a defensible first-token latency, is the only benchmark that decides whether one Hosn Tower or four are ordered. This piece pins concrete numbers to four GPU classes (H100, H200, RTX 6000 Ada, M3 Ultra) running Gemma 4 7B at Q4, and shows where concurrency walls hit. It pairs with the broader sovereign AI appliance sizing pillar.

Concurrency = users × in-flight tokens / GPU

The arithmetic that governs every LLM serving deployment, before any benchmark, is straightforward. Each active session consumes two scarce things on the GPU: a slice of memory bandwidth (to read weights and KV cache for each generated token) and a slice of HBM capacity (to hold its growing KV cache). Total throughput on the card is the bandwidth divided by the per-token weight read. Concurrent users is total throughput divided by the per-user token rate the workload demands.

For a sovereign Arabic chatbot, the realistic per-user demand is 20 to 30 output tokens per second (faster than reading speed) at a P50 time-to-first-token under 100 ms. That last number is what a deputy minister will tolerate before complaining the system feels slow. Anything higher per user is overkill. Anything lower in concurrency is wasted card.

vLLM's continuous batching is the lever that turns raw bandwidth into concurrency. Instead of waiting for a batch of requests to all finish before starting a new one (static batching), vLLM lets new requests join the running decode loop on every step. The vLLM 0.6.0 release notes document a 2.7x throughput uplift versus prior versions purely from scheduler and KV management improvements. That headroom is what lets a single H100 serve dozens of users at the same first-token latency one user would see alone.

Single-GPU concurrency: H100, H200, RTX 6000 Ada, M3 Ultra

The numbers below assume Gemma 4 7B at Q4_K_M quantization (about 4.5 GB of weights), 16K context budget per session, vLLM with PagedAttention, and a P50 first-token latency target of 100 ms. They are reproducible numbers from the public benchmark literature, not Hosn marketing. Independent vLLM benchmark roll-ups corroborate the H100 figures, with one production deployment reported in the AIMultiple GPU concurrency benchmark showing roughly 75 active users per H100 at 50% utilization headroom.

H100 80GB SXM. 80 to 120 concurrent active users at P50 TTFT 60 to 90 ms. Per-user output rate stays above 25 tokens per second. The card has roughly 60 GB of KV headroom after weights and runtime.
H200 141GB SXM. 160 to 220 concurrent active users at P50 TTFT 50 to 75 ms. The 141 GB HBM3e doubles the KV pool, and the 4.8 TB/s bandwidth keeps decode throughput high even at large batch sizes. Roughly 1.8x the H100 on the same model.
RTX 6000 Ada 48GB. 30 to 50 concurrent users at P50 TTFT 110 to 150 ms. PCIe-attached, no NVLink, decent for branch-office or research-lab Hosn Towers but not the unit of buy for a national-scale deployment.
Apple M3 Ultra 192GB unified. 8 to 15 concurrent users at P50 TTFT 180 to 280 ms. The unified memory architecture is generous on capacity but the bandwidth (about 800 GB/s) and absent CUDA ecosystem leave it as a developer or air-gapped-laptop tier, not a serving tier.

The ratio H100:H200 of about 1.8 is the most important takeaway for sovereign buyers. Buying H200 is not just buying a faster card. It is buying nearly twice the seats per chassis. For a Hosn Rack at 8 GPUs, that is the difference between 800 and 1,600 concurrent ministry users on a single appliance.

Multi-GPU concurrency uplift via tensor parallel

When a model exceeds a single GPU's HBM (a 70B model in FP16 is 140 GB) tensor parallelism splits the weight matrices across cards. NVIDIA's H100 Tensor Core GPU page documents NVLink at 900 GB/s, which is what makes multi-card serving practical. The uplift is sub-linear but useful.

2x H100 on a 70B model. Roughly 60 to 90 concurrent users at the same 100 ms TTFT target. Tensor parallel cuts per-card weight load in half, so each card has bandwidth budget left for more sessions, but NVLink hops add a small per-token tax.
4x H100 on a 70B model. Roughly 140 to 200 concurrent users. The scaling factor falls to about 1.7x per doubling, not 2x, because of NVLink synchronisation overhead on every decoder step.
8x H200 NVL on a 70B model at 32K context. Roughly 350 to 500 concurrent users. This is the sovereign deployment unit. The vLLM and TensorRT-LLM communities consistently report serving 4,000 to 4,800 tokens per second at 100 concurrent requests in this configuration on GPT-OSS-120B class models.

The practical heuristic for procurement: if user count is the constraint, scale GPUs in the same Rack. If the model size is the constraint, accept that smaller models on more GPUs serve more users than a 70B model on fewer.

Where concurrency saturates: KV cache memory before compute

Almost every concurrency wall in production is a KV cache wall, not a compute wall. The KV cache on a 70B model at 128K context per session is roughly 40 GB. Eight users at full context on one H100 demand 320 GB of KV (plus weights), which is 4x the card. The Spheron KV offloading analysis documents that this exact 8-user limit is what made on-prem H100 deployments feel "small" before three-tier KV (HBM, DRAM, NVMe) became standard in vLLM. Hosn appliances ship with this stack pre-configured. Long-context Arabic workloads benefit further from the techniques covered in KV cache optimization for long-context Arabic.

The compute side rarely saturates. Hopper-class FP8 tensor units have far more arithmetic capacity than memory bandwidth can feed. The memory bandwidth deep dive explains why decode throughput, not GEMM throughput, sets the ceiling. Plan capacity around HBM, not TFLOPS, and add 20 to 30 percent KV headroom for prefix caching deduplication to actually pay off. Mu'een, Oman's national shared-AI platform, addresses different sizing tradeoffs at the country tier; institutional appliance sizing follows the per-GPU arithmetic above.

Email [email protected] for a one-hour briefing on sizing your Hosn appliance for the user count and Arabic context window your institution actually needs. Bring the workload shape (concurrent analysts, average context, peak hours), and we will translate it into a defensible GPU class and chassis count.

Concurrency = users × in-flight tokens / GPU

Single-GPU concurrency: H100, H200, RTX 6000 Ada, M3 Ultra

Multi-GPU concurrency uplift via tensor parallel

Where concurrency saturates: KV cache memory before compute

Frequently asked

Related

Sovereign AI Appliance Sizing: Users and Latency

H100 vs H200 Memory Bandwidth: The Practical Impact

KV Cache Optimization for Long-Context Arabic