Sizing a Sovereign AI Appliance: Concurrent Users, Latency, and Throughput

A sovereign buyer signs off on a "two-hundred-user AI appliance" and the system falls over the day a cabinet briefing arrives, because the procurement document never separated named accounts from peak active sessions, never named a P99 latency target, and never accounted for the way a 256K context window inflates the KV cache per request. This guide turns the conversation from "how big a box do we need" into the three or four numbers a sovereign appliance must actually be sized against. It is the sizing companion to the on-premise AI pillar guide.

The three numbers every sovereign buyer should ask for

Most sizing arguments dissolve once three numbers are written down. Demand them from any vendor, and demand them from your own users before you write the RFP.

  • Peak concurrent active requests. Not user accounts, not registered seats, not "everyone in the directorate". The count of requests in flight at the busiest fifteen-minute window of the busiest day of the month. A ministry of 800 staff might peak at 60 concurrent active requests. A treasury desk of 12 might peak at 12. Size against the peak.
  • P50 first-token latency. The median time, in milliseconds, from "user hits enter" to "first visible token streams back". This sets the felt responsiveness of the system. Below 200 ms feels instant. Above 500 ms feels sluggish. Above 1.5 seconds feels broken.
  • P99 end-to-end latency. The time, in seconds, that the worst one in a hundred requests takes from start to finish. P99 is what determines whether a cabinet briefing or an executive dashboard remains usable under load. P50 lies. P99 tells the truth.

A fourth number is sometimes useful: tokens per second per user during steady streaming. A model that streams at 40 tokens per second feels fluent for chat. At 15 tokens per second, users start re-reading. At under 10, they start switching tabs. Streaming speed is a separate axis from first-token latency, and a well-tuned appliance optimises both.

If a vendor cannot produce these four numbers for your exact workload, the proposal is not a sizing study, it is a wish list.

The KV cache trap

The most common sizing mistake on sovereign appliances is treating GPU memory as "model weights plus a bit of overhead". In a long-context world, the KV cache, not the weights, dominates.

Every transformer layer caches the keys and values of every prior token in the sequence so that the next-token prediction does not redo all of the prior attention work. The size of that cache, in bytes, is roughly:

2 × num_layers × hidden_size × context_length × bytes_per_element × num_active_sessions

Plug in concrete numbers. A 27B-class model with 64 layers, 5120 hidden size, 256K context, and FP16 (2 bytes) needs about 33.5 GB of KV cache for a single fully loaded request. Run ten such sessions and you have 335 GB of cache, which exceeds the memory of every single GPU on the market, including the H200's 141 GB. The weights themselves are 54 GB in FP16. The cache is six times the model.

This is the "trap" buyers walk into when they take a glossy 256K-context spec at face value. The model can technically read 256K tokens. The appliance cannot serve 100 users doing it simultaneously without either segmenting the cache, paging it, or refusing the request.

Three mitigations apply, and a sovereign appliance should use all three.

  • PagedAttention. The vLLM team's PagedAttention paper introduced an OS-style paged virtual memory model for the KV cache. Instead of one contiguous slab per session, the cache lives in fixed-size blocks shared across sessions, eliminating the fragmentation that earlier servers wasted memory on. Real-world serving sees two-to-four-times throughput uplift from this single technique.
  • Prefix caching. When many requests share a system prompt, retrieved passages, or a common document, the KV cache for the shared prefix is computed once and reused. On a sovereign workload where every minister sees the same retrieved memo, prefix caching can cut effective cache load by an order of magnitude.
  • Quantised cache. Storing keys and values in INT8 or FP8 instead of FP16 halves or quarters the cache footprint at small quality cost. Modern serving stacks expose this as a flag.

The right buying question is not "how big is the model?" It is "what is the steady-state KV cache footprint at my target concurrency, and what fraction of GPU memory remains for everything else?"

Throughput math: tokens per second to users supported

Every GPU has a measurable peak throughput on a given model, expressed in tokens per second across the whole device. Translating that number into "users supported" is straightforward, with one caveat that bites unprepared buyers.

Start with the device throughput at your target batch size. NVIDIA's published figures for the H100 SXM and the H200 give a reference point. Independent benchmarks from the vLLM and Hugging Face TGI projects fill in real-world numbers for specific models. A 27B dense model running on a single H200 with continuous batching reaches roughly 6,000 to 9,000 tokens per second aggregate at batch 64, depending on context distribution.

Now divide. If a target user expects 30 tokens per second of streaming output, and the device produces 8,000 tokens per second total, the device can serve up to 8,000 / 30, that is 266 concurrent streaming users, in theory. In practice, three deductions apply:

  • Idle and prefill cost. Users do not stream continuously. They prefill a prompt, wait, read the answer, and prefill again. A practical "active session" multiplier of 0.4 to 0.6 against the theoretical maximum is realistic.
  • Tail latency reserve. Running a GPU at 95 percent utilisation produces unacceptable P99 latency because every burst lands in a queue. Reserving 25 to 30 percent headroom keeps tails inside budget.
  • KV cache capacity. Even if compute can serve 266 users, the KV cache may only fit 80 of them at full context. Sizing has to clear the lower of the two ceilings.

The disciplined buyer sizes by both compute throughput and cache capacity, takes the lower of the two, and applies a safety factor. The undisciplined buyer multiplies vendor peak numbers and is surprised when the system stalls.

Latency budget by use-case

Not every workload needs the same latency. Sovereign appliances usually serve a mix, and a healthy sizing exercise allocates a different budget to each.

  • Conversational chat. First-token P50 under 200 ms, P99 under 500 ms. Streaming speed above 30 tokens per second per user. End-to-end response on a 400-token answer under 15 seconds. Users feel real-time.
  • Document analysis. First-token under 1.5 seconds, end-to-end up to 8 seconds for a 2,000-token answer over a 50-page input. Users expect a thinking pause and tolerate it.
  • Agentic workflows. Multi-step tool use can run 30 seconds to a few minutes total. Per-step latency must still be tight, because steps multiply, but the user is not staring at the screen waiting for each token.
  • Batch. Overnight summarisation of a regulatory archive, classification of a year of correspondence, embedding a corpus for retrieval. Hours are acceptable. Schedule these out of the interactive window.

Do not size the entire appliance against the chat budget. Mixed workloads with explicit queue priorities serve more users on the same hardware than a uniform low-latency contract that nobody actually needs.

Sizing recipes for 50, 200, 500, and 2,000 concurrent users

Concrete recipes, calibrated against open-weight models in the 27B to 70B range with retrieval-augmented generation against an institution's own document store. Each recipe assumes continuous batching, paged attention, and prefix caching enabled by default.

  • 50 concurrent active sessions, 27B model, 32K typical context. A single workstation-class node with 96 to 128 GB of accelerator memory handles this comfortably. The Hosn Tower configuration (single NVIDIA RTX 6000 Ada or 6000 Blackwell, 96 to 128 GB GPU, high-clock host) is the natural fit. Expect P50 first-token under 250 ms and 35 tokens per second per user. See the H100, H200, RTX 6000, and Mac Studio comparison for accelerator trade-offs.
  • 200 concurrent active sessions, 27B to 70B model, 64K typical context. One H100 SXM 80 GB or one H200 141 GB with continuous batching, or two RTX 6000 Blackwell cards in tensor parallel. The Rack starts here. P50 first-token under 350 ms achievable. Memory pressure is real, so cache quantisation and prefix sharing earn their keep.
  • 500 concurrent active sessions, 70B model, 64K to 128K context. A 4U Rack with two to four H200 accelerators in tensor parallel, NVMe-backed cache spillover, and a queue-aware load balancer. Plan for 250 ms first-token P50 and 25 tokens per second per user under load. Add a hot-spare node if uptime SLA is strict.
  • 2,000 concurrent active sessions, multi-model, 128K context. An 8U Rack or a small cluster: eight H200 cards split across two physical nodes, dedicated retrieval and embedding servers, autoscaled batch lane on a separate accelerator pool, and replicated model weights. This is ministry-scale, and it requires storage, networking, power, and cooling planning that goes beyond the GPU. See the rack power, cooling, and air-gap guide for the supporting infrastructure.

These are recipes, not prescriptions. The real workload will deviate. The discipline is to write the four numbers (peak concurrent, P50, P99, tokens per second per user) on the proposal cover page and revisit them after every benchmark run.

Continuous batching changes the equation

A static batching server processes requests in fixed batches: it gathers N prompts, runs them together to completion, then accepts the next N. This works for offline jobs and almost nothing else. The moment users mix short and long requests, the batch waits for the slowest tail, the GPU sits idle on the short ones, and effective throughput collapses to a fraction of the peak.

Continuous batching, sometimes called dynamic or in-flight batching, schedules at the token level. Every step, the scheduler decides which active requests to advance and which new requests to admit, with no requirement that all requests in the batch finish together. The technique was popularised by the Anyscale continuous batching primer, baked into vLLM, Hugging Face's Text Generation Inference, and NVIDIA's Triton Inference Server with TensorRT-LLM, and reported throughput gains of two to twenty-three times over naive serving.

For a sovereign appliance, this changes the buying conversation in two ways. First, throughput numbers from a vendor proposal are only meaningful when continuous batching is on. Demand the benchmark. Second, the same hardware with batching tuned correctly often serves the next concurrency tier without a hardware upgrade, which means the right sizing engagement starts with software tuning before it specifies more silicon.

Burst versus steady-state workloads

Two sovereign workloads can have identical daily token volume and require different machines. A back-office analyst team of 200 produces a smooth load curve through the working day. A cabinet of 30 ministers and 200 aides clusters demand into the fifteen minutes before a session opens, then goes quiet for hours. Same total tokens, very different appliance.

Burst-heavy institutions need:

  • One tier of headroom above their average load, sized to the burst window, not the daily average.
  • A graceful queue with explicit user feedback ("estimated 12 seconds") rather than a hard rejection or a silent stall.
  • Priority lanes so an executive briefing does not sit behind a research analyst's batch summary job.
  • Pre-warmed caches before scheduled events. If the cabinet always reads the same brief at 9 a.m., pre-load the prefix cache at 8:55.

Steady-state institutions can run closer to peak utilisation, often at 70 to 80 percent of capacity, without breaking P99 budgets. They benefit more from raw throughput tuning than from headroom.

The sizing exercise should classify each user group as burst or steady, then size each group's lane separately, with a small shared pool to soak up cross-group spillover.

The decision matrix: Kernel, Tower, or Rack

A sovereign appliance graduates from one tier to the next at clear thresholds. The matrix below is what a Hosn scoping briefing produces in the first hour. Per-GPU concurrency benchmarks back up each row.

  • Kernel (workstation tier). Up to about 4 concurrent active sessions, single user or small cell, 27B-class model, 32K context typical. Right for a minister's chief of staff, an intelligence cell, a pilot. Apple M3 Ultra Mac Studio with 256 GB unified memory is the reference build. Streaming feels personal and instant. Not the answer when a directorate joins.
  • Tower (departmental tier). 5 to roughly 200 concurrent active sessions, 27B to 70B model, 64K context. Single RTX 6000 Ada or Blackwell, or one H100 80 GB, with continuous batching and PagedAttention. Right for a directorate, a regulatory desk, a treasury team. Most Omani sovereign workloads land here.
  • Rack (institutional tier). 200 to several thousand concurrent active sessions, multi-model, fine-tuning capacity, redundancy. Two to eight H100 or H200 accelerators, dedicated retrieval and embedding nodes, hot spares, full air-gap network. Right for a ministry, a central bank, a sovereign fund.
  • Above Rack. National platforms, the population of an entire country, language-and-policy fine-tuning at scale. This is the territory of Mu'een, Oman's national shared-AI platform, and is a different procurement problem from a single institution's appliance.

The right tier is the smallest tier that covers the peak workload with one tier of headroom. Buying the Rack for a Tower workload is a procurement mistake, not a security upgrade. Buying the Tower for a Rack workload is the tail-latency disaster every sovereign team wants to avoid.

The honest sizing conversation lasts about an hour. It starts with the four numbers, goes through the KV cache math, lands on a tier, and produces a proposal you can defend in a board meeting. Email [email protected] or message us on WhatsApp at +968 9889 9100 for a one-hour briefing. We will leave you with the four numbers in writing whether or not you buy.

Frequently asked

What is the single most important number when sizing a sovereign AI appliance?

Peak concurrent active requests, not total user accounts. Two thousand named users with twenty actively prompting at any moment is a different machine from two thousand users all hitting the system at once. Size for the peak window, not the user list.

How much GPU memory does the KV cache really need?

For a 27B-class model at 256K context, the KV cache for a single fully loaded request can exceed 30 GB. Multiply by concurrent active sessions and the cache, not the weights, becomes the dominant memory consumer. PagedAttention and prefix caching reduce the practical footprint, but the back-of-envelope formula 2 times layers times hidden size times context times bytes per element per session must always be checked.

What latency target is realistic for chat versus document analysis?

Chat needs first-token latency under 300 milliseconds and steady streaming above 30 tokens per second per user to feel natural. Document analysis tolerates 3 to 8 seconds end-to-end because users expect a thinking pause. Batch jobs (overnight summarisation, classification of a backlog) tolerate minutes to hours and should run when interactive load is low.

Does continuous batching change the answer?

Yes, dramatically. A naive static-batch server sized for 50 concurrent users may fall over at 60. The same hardware with continuous batching enabled can sustain 150 to 250 because GPU cycles wasted on padding and idle slots get reclaimed. Modern serving stacks (vLLM, TGI, TensorRT-LLM) all implement it. Buyers should require continuous batching as a baseline, not an upgrade.

How do burst events change sizing?

A cabinet briefing that drives 200 ministers and aides into the system in a fifteen-minute window has a different sizing profile from a back-office team of 200 analysts that prompts steadily through the workday. Burst workloads need headroom and queue discipline. Steady workloads can run closer to peak utilisation. We size burst-heavy institutions one tier above their average load and require a graceful queue, not a hard rejection.

When does an institution graduate from Tower to Rack?

When peak concurrency exceeds roughly 200 active sessions on a 27B to 70B-class model, when the institution needs more than one large model running side by side, when fine-tuning becomes a regular workload, or when redundancy and high availability become operational requirements. Below those thresholds, the Tower is the more disciplined choice.