DPO and RLHF on Private Data: An Operational Playbook
Sovereign buyers picking up base models from Hugging Face arrive at the same fork: how do you teach a Gemma 4 or Qwen 3.6 instance the right tone, the right refusal policy, the right citation style, without leaking a single token to a public API? The answer is a closed-loop preference pipeline running entirely inside the perimeter. This playbook covers DPO versus full RLHF, how to collect preference data, how to run a training pass with the TRL library on hardware you already own, and how to wire eval and rollback so a bad run is reversible in minutes.
1. DPO versus RLHF in 200 words
Classical RLHF, popularised by InstructGPT in 2022, is a three-stage pipeline: supervised fine-tune (SFT), train a reward model on human preferences, then optimise the policy against that reward with PPO. It works, but it is operationally expensive: two extra models, distributional drift, careful KL control, and reward hacking that only shows up after deployment. Read the original recipe in Ouyang et al, 2022.
Direct Preference Optimization (Rafailov et al, 2023, NeurIPS best-paper runner-up) collapses the three stages into one. Given paired preferences (chosen, rejected), DPO derives a closed-form loss that updates the policy directly against an implicit reward, with the SFT model itself acting as the KL anchor. No reward model. No PPO. One loss, one optimiser, one reproducible run.
For sovereign deployments the operational delta is enormous. DPO is auditable, single-run, single-artefact. Reach for full PPO RLHF only when DPO plateaus and you genuinely need a learned reward signal: chain-of-thought reasoning quality, multi-turn agentic behaviour, or safety policies that cannot be expressed as pairs.
2. The preference-data pipeline
Whatever method you pick, the bottleneck is the data, not the GPUs. A 1,500-pair dataset annotated by two trained domain experts will out-train a 50,000-pair scrape every time. The pipeline:
- Prompt sourcing. Pull 200 to 500 representative prompts from your real corpus: redacted tickets, sanitised legal queries, citizen-service requests, internal Q&A. Bias toward prompts where the current model fails or is awkward.
- Candidate generation. For each prompt, sample two to four candidate completions from your current SFT model with varied temperature (0.3, 0.7, 1.0). Optionally include one completion from a stronger ceiling model run on-premise (a 70B base) to give annotators a high-quality reference.
- Pairwise annotation. Annotators see (prompt, A, B) and pick the better answer plus a free-text reason. Force a binary choice; ties dilute signal. Track Cohen kappa across annotators on a 200-prompt overlap set; below 0.7 you are training on noise.
- Quality gates. Reject pairs where annotation confidence is low, where chosen and rejected differ only in length, or where the rejected answer is factually correct but stylistically off (use a separate style-only dataset for those).
- Format. Serialise as JSONL with fields
prompt,chosen,rejected. This is what TRL'sDPOTrainerexpects.
The whole loop runs inside a sovereign annotation UI on the Hosn appliance. No data leaves. Per the broader stack discussion in our LoRA QLoRA on-premise pillar, the same hardware doing inference can do annotation and training off-hours.
3. The training run, with TRL
Hugging Face's TRL library is the de-facto open-source toolkit for both DPO and PPO RLHF. It wraps the maths and exposes clean trainer classes: SFTTrainer, DPOTrainer, PPOTrainer, RewardTrainer. Two recipes:
Single-GPU 7B with QLoRA + DPO
- Hardware. One H100 80GB, or even a 48GB RTX 6000 Ada with gradient checkpointing.
- Setup. Load the SFT model in 4-bit (bitsandbytes NF4), attach a LoRA adapter (r=16, alpha=32, target the attention and MLP projections).
- DPO config. Beta 0.1 to 0.2, learning rate 5e-7 to 1e-6, batch size 4 with gradient accumulation 8, one to three epochs over your pairs.
- Wallclock. 3,000 pairs converge in 90 to 180 minutes. The output is a 200MB LoRA adapter, not a new base model. Hot-swap it in vLLM with no downtime.
Multi-GPU 70B with FSDP + DPO
- Hardware. Eight H100 or H200 GPUs over NVLink/NVSwitch, or four H200 with FSDP sharding.
- Setup. Full-precision DPO is rare on 70B. Use QLoRA again with r=64. The reference model can be the same QLoRA-frozen base, saving one full copy in memory.
- DPO config. Beta 0.1, learning rate 1e-6, micro-batch 1 with gradient accumulation 16. One epoch over 8,000 to 20,000 pairs.
- Wallclock. Six to twelve hours. Cost on owned hardware: marginal power. Cost on rented hyperscaler: a different conversation.
4. Eval before and after, no exceptions
Every preference run gets gated by a frozen eval suite. Without it, you are flying blind:
- Held-out preference accuracy. Reserve 10 to 15 percent of pairs as a test split. The DPO-tuned model should pick chosen over rejected with 70 to 85 percent accuracy. Below 60 means the run did not learn; above 90 means you are overfitting.
- Capability regression. Run MMLU, GSM8K, and your Arabic-language eval (ArabicMMLU, ALUE) before and after. A 1 to 2 point drop is normal and acceptable; 5+ points means alignment is eating capability.
- Style and faithfulness. Wire Ragas faithfulness and answer-relevance metrics on a sovereign 200-question set. Detail in our LLM evaluation frameworks piece.
- Red-team safety. 100-prompt adversarial set covering refusal policy, jailbreaks, classified-data probing. Refusal rate must hold or improve.
All four numbers go into the model registry alongside the adapter weights, the training data hash, and the TRL git commit. Every promotion to production is one line in the registry, every rollback is one line.
5. Iteration cadence and rollback
Sovereign deployments do not move at startup speed, and they should not. A healthy cadence:
- Weekly. Annotators add 200 to 500 new pairs from the live deployment's flagged outputs. The pipeline is always warm.
- Monthly. A new DPO run on the cumulative dataset. Eval gate passes, adapter is promoted from candidate to shadow: it serves 5 percent of traffic in parallel for two weeks.
- Quarterly. Shadow promotion to primary if shadow metrics hold. Old adapter remains in the registry, two clicks away. Where source policy or law shifts (an updated procurement clause, a new regulator note from NCSI or MTCIT), an out-of-cycle run is triggered.
- Rollback. Adapter swap in vLLM is sub-second. Treat preference promotion exactly like a database migration: every change reversible, every change logged. The registry is your audit trail when an internal auditor or external regulator asks why the model said what it said.
The whole loop, from annotation to eval to rollback, runs inside the sovereign perimeter. Nothing in this playbook requires a public API call, a hyperscaler region, or a non-Omani-resident GPU. That is the point.
Email [email protected] for a one-hour briefing on standing up a DPO loop on your own data, or to walk through a sample TRL run end to end on a Hosn appliance.
Frequently asked
Should we start with DPO or full RLHF?
Start with DPO. It needs a reference model and a paired preference dataset, no separate reward model, no PPO loop. For 80 percent of sovereign use cases (tone, refusal policy, citation style, Arabic register) DPO converges in hours on a single H100 and is trivially auditable. Move to PPO RLHF only when you need a learned reward model for reasoning quality or multi-turn agentic behaviour.
How many preference pairs do we need?
For a tone or refusal-policy adjustment on a 7B base, 1,500 to 3,000 high-quality pairs typically saturate. For domain reasoning (legal, medical, financial) plan for 8,000 to 20,000 pairs. Annotator agreement above 0.7 Cohen kappa matters more than raw count. Below that, you are training the model on annotator noise.
Can DPO run inside an air-gapped environment?
Yes. The TRL library, base model weights, datasets, and evaluation harnesses are all on-premise artefacts. Hosn ships a sealed training appliance with TRL, bitsandbytes, and Axolotl pre-installed; preference annotation happens in a sovereign UI; nothing leaves the customer perimeter at any stage.
What if a DPO run regresses our refusal policy?
Pin the prior LoRA adapter as the rollback artefact in your model registry. Regression on red-team eval triggers automatic rollback to the previous adapter version (vLLM hot-swap takes seconds). Treat preference-model promotion the same way you treat database schema migrations: every change is reversible and audited.