LoRA, QLoRA, and RLHF on Customer Hardware: Fine-Tuning Without the Cloud
The shortest possible argument for sovereign fine-tuning is this: your training data is the moat, the cloud cannot see it, and in 2026 you no longer need a frontier cluster to use it. A single rack-mounted GPU and a few thousand carefully curated examples are enough to bend an open-weight model to your institution's voice, your classification taxonomy, your regulator's terminology, your internal abbreviations. The mechanics that make this possible (LoRA, QLoRA, DPO, parameter-efficient training) are public, peer-reviewed, and now boringly stable. This guide walks through what each one does, what hardware is required, how to prepare data without leaking it, how to know when the result is good, and how to operate the resulting adapter like any other audited artefact.
Why fine-tuning matters for sovereign AI
A general-purpose open-weight model knows the world. It does not know your institution. It has never read your internal procedure manuals, your classification levels, your regulatory templates, the way your director general phrases a decision. Off-the-shelf, it is a competent generalist. With a few hours of fine-tuning on a thousand institutional examples, it becomes specifically yours.
For a sovereign buyer, fine-tuning carries a second weight. The data used for training is, almost by definition, the most sensitive corpus the institution owns. Past correspondence, historical decisions, internal classification taxonomies, dialect-specific phrasings, regulatory nomenclature, customer files. None of this can be uploaded to a public fine-tuning API. None of it can leave the perimeter. Fine-tuning that requires sending data to a foreign cloud is a non-starter for any workload that touches Royal Decree 6/2022's national-security carve-out or any sectoral regulator's data-residency requirement.
The 2026 reality is that on-premise fine-tuning is no longer a research curiosity. It is the default. The hardware is buyable, the libraries are mature, the recipes are stable, and the resulting adapter is a small, auditable artefact you can version, sign, archive, and roll back. The rest of this article is a buyer's-eye walk through the mechanics.
The LoRA breakthrough, in 200 words
Low-Rank Adaptation, introduced by Hu et al at Microsoft Research in the 2021 LoRA paper, is the technique that broke fine-tuning out of the data-centre. The insight is simple. When you fine-tune a large model, you do not need to update every weight. The change you actually make is low-rank: a small, structured update on top of the original weights. LoRA freezes the base model entirely and inserts pairs of small matrices (rank 8, 16, 32) into selected layers. Only those small matrices train. Everything else stays exactly as the model publisher released it.
The numbers are striking. For a 7B model, classic LoRA trains roughly 0.1 to 1 percent of the parameters and reaches accuracy parity with full fine-tuning on most downstream tasks. For a 70B model, the trainable count drops further still. The resulting adapter file is tens of megabytes, not gigabytes. You can keep dozens of adapters per base model, swap them at inference time, and version them in any standard registry. The base model on disk never changes, which means audit and provenance for the heavyweight artefact are trivial.
QLoRA: how 4-bit quantization makes 70B fine-tuning fit
LoRA already shrinks the trainable parameter count. QLoRA, published by Dettmers et al in May 2023, then attacks the memory footprint of the frozen base model itself. The contribution has three parts.
First, a new 4-bit data type called NF4 (NormalFloat-4) that is information-theoretically optimal for the normally distributed weight tensors of a transformer. Second, double quantization, which quantizes the quantization constants themselves to recover another small chunk of memory. Third, paged optimizers that use NVIDIA's unified memory to spill optimizer state to host RAM during the rare gradient checkpointing spikes, preventing out-of-memory crashes on long sequences.
The combined effect is dramatic. Where 16-bit LoRA needs roughly 2 bytes per parameter for the frozen weights, NF4 needs 0.5. A 70B-parameter model that requires 140 GB of GPU memory in 16-bit fits in around 35 GB in NF4, leaving room for the LoRA adapter, optimizer states, and activations on a single 48 GB workstation card or a single H100. The original QLoRA paper demonstrated 65B-parameter fine-tuning on one consumer GPU with quality matching 16-bit full fine-tuning on the Vicuna evaluation suite. That result is the moment fine-tuning crossed from a data-centre activity to a workstation activity.
QLoRA is now the default starting recipe for any model larger than 13B on a single GPU, and the standard reference for institutions sizing their first training rig. The implementation is shipped as part of the bitsandbytes library and is integrated into every major fine-tuning framework.
RLHF and DPO: aligning the model to your institution's voice
Supervised fine-tuning teaches the model what to say. Reinforcement Learning from Human Feedback teaches it which way of saying it the institution prefers. The classical RLHF pipeline, popularised by OpenAI's 2017 work and codified in the InstructGPT paper, has three stages. Stage one is supervised fine-tuning on demonstrations. Stage two is collecting preference pairs (operator A and operator B rate which of two model outputs is better) and training a separate reward model on those pairs. Stage three is policy optimisation, typically with Proximal Policy Optimization (PPO), where the original model is updated to maximise the reward model's score while staying close to its starting distribution.
That pipeline works, but it is operationally fragile. PPO is sensitive to hyperparameters, the reward model can be gamed, and the four-model dance (policy, reference, reward, value) chews memory. In 2023, Direct Preference Optimization (DPO), proposed by Rafailov et al at Stanford, eliminated most of that complexity. DPO shows that the entire RLHF objective can be reformulated as a single supervised classification loss over preference pairs, with no explicit reward model and no reinforcement-learning loop at all. The training behaves like a normal cross-entropy fine-tune. The mathematical equivalence proof in the paper carried it from research curiosity to default within months.
For sovereign deployments, the practical guidance is simple. Start with supervised fine-tuning to teach the model the domain. Then, if the institution has a corpus of preference data (operators rating internal answers as "yes use this" versus "rewrite this"), apply DPO on top. Reserve PPO-based RLHF for the rare cases where the reward signal cannot be expressed as preference pairs, for example when it depends on the output of an external tool. The Hugging Face TRL library ships production-quality DPO, IPO, KTO, and PPO trainers; axolotl wraps them in a config-file-driven workflow that suits institutional change control well.
Hardware sizing for fine-tuning
The practical question is not "what is the absolute minimum?" but "what hardware lets the team iterate fast without burning weeks on memory tuning?" In 2026, the right buying tiers are clean.
For 7B-class models (Gemma 4 4B-7B, Qwen 3.6 7B, Falcon Arabic 7B). A single NVIDIA RTX 6000 Ada or RTX 6000 Blackwell with 96 GB of GDDR is comfortable for 16-bit LoRA fine-tuning at production batch sizes. QLoRA pushes the same card to 13B-class without strain. This is the workstation-tier training rig: one card, one developer machine, results in hours.
For 27B to 30B-class models (Gemma 4 27B MoE, Qwen 3.6 27B, Mixtral-style MoE). A single H100 80 GB or H200 141 GB handles QLoRA cleanly. A Tower-tier rig with one H100 is the right institutional starting point: it serves inference for departmental workloads during the day, runs overnight fine-tunes, and produces adapters in a single working day for typical 5,000-example corpora.
For 70B-class models. One H100 with QLoRA is feasible per the original paper, but slow at production batch sizes. The pragmatic configuration is two H100 or H200 GPUs with NVLink, allowing 16-bit LoRA, larger batches, and tractable wall-clock times. This is the institutional Rack tier doubling as a training environment.
For full-precision SFT of larger models. Multi-GPU is mandatory. Four to eight H100s with NVLink or NVSwitch, fully sharded data parallel via PyTorch FSDP or DeepSpeed ZeRO-3, and high-bandwidth NVMe. This is the configuration for the small share of workloads that genuinely need full SFT. Most institutional workloads do not, and reach for it only when LoRA and QLoRA have been measured and ruled out on a representative slice.
One sizing rule trumps all the others: buy the smallest tier that meets the institution's largest realistic fine-tune in under twenty-four hours of wall clock, with one tier of headroom. Iteration speed is the dominant variable in fine-tuning quality. A team that runs five experiments per week beats a team that runs one, regardless of card model.
Data preparation for sovereign fine-tunes
Hardware is the easy part. The hard part is the data. A sovereign fine-tuning dataset is a sensitive artefact in its own right and deserves the same handling as any other classified document.
Start with classification. Every example carries the classification of its source. Mixing a Confidential example into a Restricted training set raises the resulting adapter to Confidential. The adapter inherits the highest classification of any example in its training corpus, period. This rule is the single most important governance decision in the pipeline, and it should be enforced by tooling, not by goodwill.
Move on to PII handling. Most institutional corpora contain personal data: names, identification numbers, customer references, addresses. For some training objectives this is fine and even necessary. For most it is not. A pre-training scrub that replaces personal identifiers with placeholders (NAME_1, NID_1, ACCOUNT_1) preserves the linguistic signal while removing the leakage risk. The placeholders are reversible only against an offline mapping table that lives outside the training environment.
Then the train and eval split. A naive random split is the wrong move when the corpus contains correlated documents. Splitting by date, by case file, or by author prevents leakage of near-duplicates between train and eval and gives a more honest measurement. Reserve at least 10 percent of the cleaned corpus for held-out evaluation and never let a single training pass touch it.
Finally, format. The fine-tuning library expects a specific schema (instruction-input-output for supervised fine-tuning, prompt-chosen-rejected for DPO). Convert once, hash the resulting JSONL, store the hash in the dataset registry. Every fine-tuning run logs which dataset hash it used. Reproducibility costs almost nothing at this stage and saves weeks of confusion six months later.
Evaluation: what "good" looks like
The evaluation question splits into two halves: automated suites and human review.
The automated half is non-negotiable. Every fine-tune produces a numeric score against a fixed evaluation suite that includes a held-out slice of the institution's own corpus, a generic capability benchmark (MMLU, GSM8K, HumanEval) to detect regression on general competence, and any sector-specific suite that matters for the institution's use case (legal QA, Arabic language, code completion). The suite is the same across fine-tunes so comparisons are honest. Tools like EleutherAI's lm-evaluation-harness are the standard for the public benchmarks; the institution writes its own runner for the private slice.
The human half catches what the automated suites miss. Pick five to ten operators who would be the actual end users of the model. Run a blind A/B between the new adapter and the previous production adapter on twenty real prompts each. Operators rate each pair on accuracy, helpfulness, and tone. The numbers are noisy, but the cumulative signal across thirty operator-prompts per fine-tune is reliable enough to gate a production rollout. A new adapter that loses to the incumbent on operator preference does not ship, regardless of how well it scored on the automated suite.
One final guardrail. Test the new adapter on a small set of adversarial and out-of-distribution prompts before promotion. A fine-tune can subtly degrade safety behaviour on inputs the operators never see in their normal flow. A ten-minute red-team check at promotion time prevents a category of incident that is otherwise difficult to detect after deployment.
Productionising the adapter
The fine-tune is finished. The numbers are good. Now operations begin.
Treat the adapter file as a versioned artefact. Each release gets a semantic version, a SHA-256 hash, the dataset hash it trained on, the base model version it pins to, the training framework version, and the evaluation report. All of this lives in a small registry inside the institution. The registry is the source of truth: an adapter that is not registered is not deployable.
Deploy through a staging enclave. The new adapter loads first on a non-production replica of the inference server, behind a feature flag that exposes it to a controlled user group. The group runs real workloads for one to two weeks. If telemetry, error rates, and operator feedback all hold, the flag flips to production. If anything regresses, the flag flips back. The base model never moved, so rollback is instant.
Plan for adapter sprawl. A mature institution will accumulate adapters per department, per use case, per regulator, per language. Inference servers like vLLM and TGI now support hot-loading adapters per request, which means one base model can serve a dozen specialised behaviours from the same GPU pool. The operational pattern is one canonical base model per generation, many adapters layered on top, all of them small, all of them hashed, all of them owned by the institution.
Plan for retirement. Adapters age. The base model gets a security update, a new variant ships, the regulator changes terminology, the operator preferences shift. Every adapter has a defined review cadence (six or twelve months), and the registry surfaces the ones overdue. Retiring an unused adapter is a one-line registry change. Retraining one is a half-day exercise, not a project.
If your institution is moving from off-the-shelf open-weight models to a fine-tuned, voice-aligned, classification-aware deployment and you would like a one-hour briefing on the data, hardware, and operational pattern that fits your specific situation, the next step is simple. Email [email protected] or message +968 9889 9100. We will walk through your corpus shape, classification levels, target hardware tier, and a credible training and evaluation plan against your timeline. Pricing is by quotation, sized to your specific requirement.
Frequently asked
Do I need a frontier cluster to fine-tune a useful model?
No. QLoRA published in 2023 demonstrated 65B-parameter fine-tuning on a single 48 GB consumer GPU and matched full 16-bit fine-tuning quality on its evaluation suites. In 2026, a single RTX 6000 Ada or RTX 6000 Blackwell handles 7B to 13B models at full precision with LoRA, and 70B-class models with QLoRA. The institutional bottleneck is data preparation and evaluation discipline, not raw compute.
Is LoRA quality really the same as full fine-tuning?
On most institutional tasks, yes. The original LoRA paper by Hu et al at Microsoft Research showed parity or near-parity with full fine-tuning across GLUE, WikiSQL, and SAMSum, while training only a small fraction of the parameters. For tasks that require absorbing entirely new capabilities or large vocabulary shifts, full SFT can edge out LoRA. The right test is to run both on a representative slice of your data and compare against your held-out evaluation set.
What is the difference between RLHF and DPO, and which should we use?
RLHF in its classic form trains a reward model from human preference pairs, then optimises the policy model against the reward model with PPO. DPO, introduced by Rafailov et al in 2023, removes the explicit reward model and the PPO loop and reformulates the objective as a direct supervised loss over preference pairs. DPO is simpler to implement, more stable, and almost always the right starting point for a sovereign deployment. Reach for full PPO-based RLHF only when DPO cannot express your reward signal.
Can fine-tuning happen fully air-gapped?
Yes. The base model, tokenizer, training framework (axolotl, TRL, Hugging Face Transformers), and quantization library (bitsandbytes) are all downloaded once over a controlled channel, hashed against publisher signatures, and pinned. Training itself is a deterministic local computation. No telemetry leaves the perimeter, and the resulting adapter file is a sovereign artefact owned by the institution.
How much labelled data do we need?
Less than most teams expect. For instruction-style fine-tunes on a domain-specific task, 1,000 to 5,000 high-quality examples often produce a measurable lift. For style and tone work, 200 to 500 carefully written examples can be enough. The dominant variable is data quality, not quantity. Spending two weeks on a clean, classification-aware dataset beats spending two weeks scraping a noisy ten-times-larger one.
How do we roll back a bad fine-tune?
Adapters make rollback trivial. Each fine-tune produces a small adapter file (often tens of megabytes) that is loaded on top of the immutable base model. Versioning lives in the institution's adapter registry. Rolling back is unloading the failing adapter and loading the previous version, no model swap required. This is one of the operational reasons LoRA-style fine-tuning is the dominant pattern in production.