Storage Tier Design for LLM Weights and KV Cache

Most sovereign AI appliance designs start at the GPU and stop at the network port. The storage tier in between, where weights live, where KV cache spills, and where retrieval corpora are read, decides whether the box restarts in 40 seconds or 12 minutes, and whether one user can stall every other concurrent session. This piece breaks the tier into three roles and gives field-tested rules for each on a Hosn-class on-premise box.

Three storage roles inside one appliance

Every on-premise LLM deployment quietly runs three distinct storage workloads, with very different access patterns. Treating them as one bulk-storage problem is the most common architectural mistake we see in sovereign RFPs.

  • Model weights. Read once at startup, possibly re-read on every model swap. Sequential, cold-cache, latency-tolerant if the operator is patient, latency-critical if the SLA promises sub-minute restarts. A Gemma 4 or Qwen 3.6 checkpoint in BF16 lands between 70 and 200 gigabytes; a quantized GGUF variant lands between 20 and 60 gigabytes (Hugging Face large model loading docs).
  • KV cache spill. Random, latency-critical, written and read in fixed-size pages by the serving engine when GPU memory fills up. vLLM and SGLang both expose CPU and disk swap targets explicitly so long-lived sessions can be paused without losing their context (vLLM PagedAttention design).
  • Dataset and RAG corpus. Read-mostly, mixed sequential and random, often shared across nodes when the deployment grows past a single appliance. Indexed embeddings and document blobs can run from low gigabytes (one ministry's policy library) to multiple terabytes (a national archive).

The tiering principle is simple: put weights and KV spill on local NVMe inside the GPU node, and put shared corpora on a network filesystem that any node can mount. Mixing the two on a single shared appliance is what produces the long restarts and noisy-neighbour stalls that show up in proof-of-concept reviews.

Local NVMe for weights and KV spill

Weights and KV spill both want a fast, dedicated, local block device. The right answer in 2026 is U.2 or E1.S NVMe drives on PCIe Gen4 or Gen5 inside the GPU chassis. A single Gen4 enterprise drive sustains 6 to 7 gigabytes per second on sequential read; a Gen5 drive doubles that to 12 to 14 gigabytes per second (SNIA Storage Developer Conference 2023, PCIe Gen5 NVMe). That is the difference between cold-loading a 140 gigabyte BF16 checkpoint in 22 seconds and stretching the same load to four minutes on a SATA SSD.

For KV spill the dominant metric is random-read latency under concurrent load. Enterprise NVMe drives with end-to-end power loss protection and steady-state random read latencies under 100 microseconds at queue depth 32 are the right pick. Consumer-grade M.2 drives can match the raw throughput numbers in a benchmark but collapse under sustained mixed read-write pressure once their SLC cache fills, which is exactly the regime KV spill produces under load. We standardise on dual U.2 drives in mirror for weights, and a separate dedicated drive for KV scratch with no RAID, because the data is reproducible and latency-sensitive.

Network storage for shared corpora

Once a sovereign deployment has more than one GPU node, or once the retrieval corpus crosses a few hundred gigabytes, network-attached storage stops being optional. The two viable approaches inside an air-gapped Hosn deployment are NFS over RDMA on commodity hardware, or a dedicated parallel filesystem like Weka or VAST.

  • NFS over RDMA. Mature, well-understood, and runs on the same 100 gigabit Ethernet fabric used by the GPU nodes for tensor and pipeline parallelism. The kernel client supports RDMA transport directly, so a corpus served from a single storage head saturates a 100 gigabit link without ceremony. Right answer for ministries, regulators, and most sovereign banks.
  • Weka or VAST. Parallel filesystems that scale-out across many storage nodes and present a single namespace. Worth the integration cost when concurrent training is on the table, when the corpus exceeds tens of terabytes, or when many GPU nodes hammer the same dataset and a single NFS head becomes the bottleneck. Reference deployments in the AI training literature show GPUDirect Storage paths bypassing the host CPU entirely (NVIDIA GPUDirect Storage overview).

For most sovereign customers, the right starting point is NFS over RDMA. It uses the same physical fabric as the rest of the 100/25 gigabit AI cluster network, which keeps the bill of materials, the integration burden, and the failure domain small.

Air-gap implications: signed bundles, no remote object stores

The storage tier looks different inside a true air-gapped deployment. There is no S3 bucket, no Hugging Face mirror, no Artifactory reachable from the appliance. Every model file, tokenizer, fine-tune adapter, and embedding index has to arrive across the gap as a signed bundle, verified locally before it touches the working storage. The classified-tier air-gap network architecture piece covers the transfer mechanics in depth; the storage-tier consequence is that the on-box layout has to support that workflow without code changes.

In practice this means three things on every Hosn appliance. First, weights live under a versioned directory tree with manifest files that capture SHA-256 hashes and bundle signatures, not under whatever path a download script happened to choose. Second, the loader verifies signatures against pinned operator keys before mapping any tensor into VRAM. Third, the appliance never reaches out to a remote registry to "check for updates"; updates only arrive when an operator delivers and verifies a new bundle. That is the hard line that separates a real air-gapped box from a cloud-attached one wearing fortress paint.

Read the pillar piece on AI rack power cooling airgap for how this storage tier sits inside the wider rack and facility design. To size NVMe and network storage for your specific workload, email [email protected] for a one-hour briefing.

Frequently asked

Why are local NVMe drives preferred over network storage for LLM weights?

Cold-loading a 70 to 200 gigabyte model file across a network share adds tens of seconds to a restart and creates a hard dependency on a separate storage box. Local PCIe Gen4 or Gen5 U.2 NVMe drives in the GPU node deliver 6 to 14 gigabytes per second sustained read, so weights stream into VRAM in well under a minute and the appliance stays self-contained.

When should KV cache spill to SSD instead of just dropping the session?

Spill is worthwhile when sessions are long-lived and likely to resume, for example a regulator analyst paused for a meeting. Modern serving stacks like vLLM and SGLang can offload idle KV blocks to NVMe and recover them on resume, which beats recomputing 200 thousand tokens of context. For one-shot question answering, eviction without spill is usually cheaper.

Do we need a Weka or VAST cluster, or is plain NFS over RDMA enough?

For a single Hosn Tower or Rack serving inference and modest fine-tuning, NFS over RDMA on standard hardware is enough for shared retrieval corpora. A parallel filesystem like Weka or VAST starts to pay off when many GPU nodes train simultaneously on the same corpus or when datasets exceed tens of terabytes with strict random-read latency targets.

How do model updates reach an air-gapped Hosn appliance?

Hosn ships every model, tokenizer, and weight as a signed bundle on a one-way medium, typically encrypted media or a hardened diode transfer. The appliance verifies signatures against pinned keys before loading. There is no reach-back to a remote object store, no cloud weight registry, and no automatic update path that crosses the air gap.