Inference Quantization vs Hardware Class: Where to Spend the Budget

Sovereign procurement always lands on the same fork in the road. The line item for inference compute is fixed, and the architect has to choose: spend on a flagship accelerator and run lightly quantized weights, or spend on cheaper silicon and lean on aggressive quantization to make the model fit. The wrong call burns capex one way or quality the other way. This piece works through the tradeoff with three concrete scenarios, explains why perplexity is the wrong measuring stick, and ends with a decision matrix mapped to the cluster's pillar reference on H100, H200, RTX 6000 Ada, and Mac Studio comparison.

The capex/quality tradeoff in one frame

Quantization reduces the number of bits each model weight uses. FP16 is the modern training and reference precision; FP8, INT8, Q5, and Q4 are progressively more aggressive compressions. Each step down halves or near-halves memory, and because LLM inference is memory-bandwidth-bound, smaller weights also generate tokens faster. The catch is quality: every step down costs a small amount of accuracy, sometimes invisible, sometimes contractually material.

Hardware class works in the opposite direction. A flagship GPU like the H200 has more VRAM, faster HBM3e memory, and dedicated FP8 tensor cores. It runs the same model at the same quant level faster than an RTX 6000 Ada, and at higher quant levels than a Mac Studio can fit at all. The buyer's question is whether to spend the marginal dollar on the silicon or on the precision.

The honest framing: quantization is the cheaper knob, but it has a quality floor. Hardware is the more expensive knob, but it raises the ceiling on what is even possible. The right answer depends on which workload sits on the rack.

Three scenarios that flip the answer

The same budget produces very different optimal configurations depending on the use case. Three concrete scenarios show why a single rule of thumb fails.

  • Q4 Gemma 4 on M3 Ultra vs FP16 on H100. A Mac Studio M3 Ultra with 192 GB unified memory runs a Q4 quantized 27B Gemma 4 comfortably for a small team, drawing under 200 W, with no rack, no cooling, no datacentre overhead. The same model at FP16 needs an H100 80 GB plus the rack, PDU, and cooling to support it. For a ten-person sovereign team running chat and summarisation, the Mac Studio path delivers within ALUE noise of the H100 path at perhaps a tenth of the all-in cost. The H100 only earns its keep when concurrency, long context, or multi-tenant SLAs enter the picture.
  • FP8 70B on H100 vs FP16 70B on H200. A 70B model at FP8 fits inside a single H100 80 GB and runs faster than the same model at FP16 on an H200 141 GB, because FP8 doubles effective tensor throughput on Hopper. The quality delta is small on most reasoning tasks, as documented in NVIDIA's Transformer Engine FP8 primer. The H200 only pulls ahead clearly on context windows past 64k tokens, where every extra GB of HBM lets the KV cache breathe. For chat and short-context RAG, FP8 on H100 wins on cost per useful token.
  • CPU plus Q4 on edge vs RTX 6000 Ada. A modern Xeon or EPYC workstation with 256 GB DDR5, running a Q4 model under llama.cpp, will produce roughly 10 to 20 tokens per second on a 27B model. An RTX 6000 Ada will produce 50 to 80 on the same workload at the same or better quality. For a single-analyst, low-volume edge node where a GPU is a procurement headache, the CPU path is acceptable. For any shared service, the GPU path is the right call. The crossover is concurrency, not raw speed.

Don't trust perplexity, measure ALUE delta

Most quantization research reports perplexity on WikiText, C4, or another English-heavy corpus. Perplexity collapses the entire output distribution into one number, which is convenient and misleading. A quantized Arabic LLM can hold perplexity inside half a point of FP16 and still lose three points on classical Arabic, two points on named-entity recognition, and noticeable ground on rare-diacritic generation. The averaging hides the regression because the bulk of training tokens are mundane.

The right procurement test is a benchmark delta on Arabic-specific suites, paired with an institutional eval built from the buyer's own corpus. We cover the full landscape of GGUF quantization for Arabic in a sibling article. The headline rule: any quant level that loses more than three ALUE points or more than five points on the institutional eval should not ship to production. The Arabic and English deltas often diverge by a factor of two to three, with Arabic taking the larger hit, as the broader 2024 work on quantization quality at scale documents (arXiv:2402.16775).

Decision matrix per use case

The simplified buyer's matrix below maps workload shape to the recommended quant and hardware combination. It is the same rubric we use in scoping briefings.

  • Chat and summarisation, single team. Q4 or Q5 on Mac Studio, RTX 6000 Ada, or single H100. Quality-noise indistinguishable from FP16 in real use.
  • RAG over institutional corpus, mixed concurrency. FP8 70B on H100, or Q5 70B on RTX 6000 Ada plus a second card. Spend the marginal dollar on memory, not precision.
  • Long-context reasoning, contract-grade generation. FP16 or Q6 70B on H200 or dual H100. Quantization aggression here is false economy.
  • Air-gapped edge node, single analyst. Q4 on CPU plus llama.cpp, or a single workstation GPU. Sizing matters less than airflow and write-back to the central appliance.
  • Multi-tenant sovereign appliance, twenty plus parallel users. FP8 on H100 or H200 cluster, never CPU, never aggressive Q4. Concurrency degrades faster than single-user quality does, and the rack pays for itself in months.

For sizing the appliance behind any of these, see our companion piece on sovereign appliance sizing by users and latency.

When to upgrade hardware vs upgrade quant

The simplest decision rule: if the bottleneck shows up on a written eval, upgrade the quant level. If it shows up on a load test, upgrade the hardware. Quantization fixes single-prompt quality. Hardware fixes throughput, latency under concurrency, and context length. Buyers who try to fix a concurrency problem by upgrading from Q4 to Q6 will move quality up by a fraction of a point and watch tokens-per-second drop, which solves the wrong problem. Buyers who try to fix a quality problem by adding a second GPU will discover that two cards running Q4 still produce Q4 outputs.

A practical sequence for a new sovereign deployment: install the model at FP16 on whatever hardware the rack will hold, run the institutional eval to get a ground truth, then quantize down step by step (FP8, Q6, Q5, Q4) and stop one level above where the eval breaks the contract. That floor is your production quant. Hardware sizing then follows from concurrency, latency, and context targets, independent of the quality question.

For a sovereign buyer who wants a fitted recommendation against a real workload and budget, email [email protected] for a one-hour briefing. We bring the eval scripts, the hardware models, and the contract-grade decision rubric.

Frequently asked

Is it better to buy a cheaper GPU and quantize aggressively, or pay for a flagship and run lightly quantized?

Neither answer is universal. For chat, summarisation, and most retrieval workloads, a mid-tier accelerator running Q5 or FP8 weights is usually within ALUE noise of a flagship running FP16, at a fraction of the capex. For long-context reasoning, agentic tool use, and multilingual generation under contract, the flagship plus light quant wins because long-context accuracy degrades faster than chat accuracy does under aggressive quant.

Why is perplexity a poor metric for choosing a quantization level?

Perplexity is computed on token-level probability distributions, usually on English-dominant corpora like WikiText or C4. A quantized Arabic model can keep perplexity inside half a point of FP16 and still lose three or four ALUE points on classical Arabic, named-entity recognition, or rare-diacritic tasks. Sovereign procurement should mandate ALUE delta and an institutional eval, not perplexity.

When should a sovereign buyer upgrade hardware instead of upgrading the quant level?

Upgrade hardware when the use case is concurrency-bound, latency-bound under load, or context-window-bound. If the bottleneck is tokens per second under twenty parallel requests, more VRAM and faster memory help more than re-quantizing. Upgrade the quant level when single-user quality on a written eval is the limiting factor and throughput is already adequate.

Is FP8 inference safe for production sovereign workloads?

FP8 is now standard practice on Hopper and Blackwell class accelerators and behaves close to FP16 on most reasoning and chat tasks when calibrated properly. NVIDIA's own technical reports show single-digit accuracy deltas on common benchmarks. For Arabic and bilingual workloads, run an institutional eval before signing the contract; the Arabic delta is sometimes larger than the headline English delta.