Fine-Tuning Qwen 3.6 on Classified Documents: RLHF Safety Patterns

Fine-tuning a base model on classified Arabic correspondence is not the same engineering job as fine-tuning a chatbot on product FAQs. The threat model is inverted. The corpus is the secret, the prompts are the attack surface, and a single verbatim leak in a completion can cost a clearance, a contract, or a career. This piece is the safety playbook we apply when an Omani sovereign team brings Qwen 3.6 into a SCIF and asks it to learn their documents.

1. Classified fine-tunes need a different safety posture

Public-cloud fine-tunes optimise for helpfulness and refusal calibration on hypothetical harms. Classified fine-tunes optimise for two harder properties at once: the model must be useful inside the cleared workforce, and it must not exfiltrate the corpus when prompted by anyone, including a cleared user with a careless prompt or a red-team attacker who has stolen valid credentials.

  • No escape via prompts. The deployed model must refuse to repeat raw classification markings, document IDs, or annex serials, even when the user asks politely or claims to have authorisation.
  • No inadvertent revelation in completions. Summarisation, translation, and Q&A must paraphrase, not regurgitate. Verbatim spans longer than a configured threshold should be blocked at decoding time, not only at training time.
  • No covert channels. Watermarks, document-ID strings, and unusual unicode runs in the source corpus should be normalised before training, otherwise the model learns to emit them as a tell.

This shifts the design centre of gravity from the algorithm to the data and to the deployment boundary. The training recipe matters far less than what enters the recipe and where the artefact lives afterwards.

2. The data-prep firewall

Treat preprocessing as the first and most important safety control. By the time a token enters the loss function, declassification rules are already baked in, irreversible, and auditable. Our standard pipeline runs entirely inside the SCIF on the same accredited workstation as the trainer.

  1. Strip headers and markings. Remove classification banners, distribution caveats, file numbers, and routing slips before tokenisation. The model should never see the string forms of the markings, only the substantive content.
  2. Redact named entities at the right layer. Person names, phone numbers, IBANs, plate numbers, and coordinates are passed through a deterministic redactor with a salted hash. Different documents that mention the same officer collapse to the same token, so the model learns relations without learning identities.
  3. Cap rare-string memorisation. Long alphanumeric IDs are the canary for memorisation. We replace them with class tokens (CASE_ID, ANNEX_REF) and keep a sealed mapping table outside the training set.
  4. Deduplicate aggressively. Near-duplicate paragraphs amplify memorisation risk. A MinHash pass at 0.85 Jaccard pulls the corpus to a non-redundant core before training.
  5. Sign the prepared corpus. The output of preprocessing is hashed and signed. The trainer refuses to ingest any shard whose hash is not in the signed manifest.

None of these steps are exotic, but skipping any one of them shows up later as a leakage finding the red team will not let you forget.

3. RLHF preference data: reward over-refusal vs leak risk

After supervised fine-tuning on the firewalled corpus, alignment is the second lever. We design preference data so that the reward signal explicitly trades a small amount of helpfulness for a large reduction in leak surface. The structure follows the spirit of Constitutional AI: human reviewers draft a short list of principles, and pairwise preferences are generated from them.

  • Refuse-to-repeat. Given a prompt that asks for verbatim quotes from a marked document, the preferred completion paraphrases and cites the document by its class token, not by its real reference.
  • Refuse out-of-clearance. When the prompt context indicates the user lacks clearance for a topic, the preferred completion politely declines and points to the correct channel, without revealing the document's existence.
  • Prefer summary over quote. For routine summarisation, completions that paraphrase score above completions that copy a phrase longer than seven words.
  • Penalise over-refusal on cleared work. Helpful answers to clearly in-scope cleared queries are scored above blanket refusals. Without this counterweight, the model collapses to a safety-theatre assistant that nobody uses.

The training step itself is usually Direct Preference Optimization. DPO removes the separate reward model and the online sampling loop that classical RLHF requires, which shrinks the number of intermediate artefacts the security officer has to classify, sign, and dispose of. The same preference dataset can be replayed under PPO later if a specific behaviour needs RL polishing.

4. Hardware in an air-gap: H100 in a SCIF, signed adapter outputs

The training rig sits inside an accredited room, never on a multi-tenant host. A two-card H100 node is enough for a 7B QLoRA fine-tune of Qwen 3.6 in roughly a working day. Larger variants need four or eight cards. The room enforces the boundary; the workflow enforces the artefact discipline.

  • Bring code in via signed media. Trainer source, tokeniser, and base weights arrive on write-once media with verified hashes. Network egress is physically absent.
  • One adapter out per run. The only artefact leaving the trainer is the LoRA adapter file, signed by the security officer. Optimiser states, intermediate checkpoints, and gradient logs stay on the rig and are wiped at the end of the campaign.
  • Adapters bind to a base hash. The runtime refuses to load an adapter whose declared base-model hash does not match the deployed base. This stops a stale or swapped adapter from quietly loading on the wrong weights.
  • Inference stays inside. The fine-tuned stack runs on a separate inference appliance inside the same accreditation boundary. There is no path from the model to the open internet.

5. Eval and red-team posture

Evaluation is dual-track. The first track measures task quality on a held-out, similarly classified test set: triage accuracy, summarisation faithfulness, refusal calibration. The second track is adversarial, run by a small in-house red team plus a periodic external review under a non-disclosure regime.

  • Membership-inference probes. Sample 200 short spans from the training set and 200 paraphrases. The model should not be able to distinguish them above chance on a likelihood test.
  • Verbatim extraction probes. Try prompt patterns known from public extraction research, in Arabic and English, to see if any document phrases come back word for word.
  • Clearance-context probes. Replay the same query with three different stated clearance contexts; the model must respect the lowest provided context and never escalate.
  • Drift checks. Re-run the full battery monthly. A new adapter that fails any leak probe is rolled back, not patched in production.

For the broader fine-tuning context across LoRA QLoRA on-premise recipes that this safety overlay sits on top of, see the pillar piece. The recipes are the same; the firewall, the preference design, and the room are what make them defensible at classified.

Briefing

If your team is preparing a classified fine-tune of Qwen 3.6 or a peer Arabic-capable model and wants a second pair of eyes on the data-prep firewall, the preference design, or the SCIF workflow, email [email protected] for a one-hour briefing. We will walk your data scientists and security officer through the playbook in the room, leave nothing in writing without your sign-off, and answer the awkward questions on the record.

Frequently asked

Can a fine-tuned Qwen 3.6 model leak the classified text it was trained on?

Yes, if the data-prep firewall is weak. Models memorise rare strings, and a well-crafted prompt can elicit verbatim training spans. Mitigations: redact serial numbers and named entities before training, run membership-inference tests on the eval set, and clamp the adapter to a low rank so capacity for memorisation is bounded.

Is DPO safer than classical RLHF for classified workloads?

Operationally yes. DPO removes the separate reward model and the online sampling loop, which shrinks the audit surface and the number of intermediate artefacts that have to be classified. Safety still depends on the preference dataset, not the algorithm. The same dataset can be reused later for PPO if needed.

Does the fine-tuned model itself become classified?

In most sovereign regimes, yes. Weights derived from classified corpora inherit the highest classification of any input. They live inside the SCIF, are signed, and only run on accredited hardware. The base Qwen 3.6 weights stay unclassified and are kept side by side for diffing.

How big is a classified-document preference dataset in practice?

For an Arabic document-triage assistant, 3,000 to 8,000 preference pairs typically suffice when the base model already speaks the domain. Roughly 30 percent should target safety behaviours: refuse to repeat raw markings, refuse out-of-clearance asks, prefer summary over verbatim quote.