Sourcing Training Data for Sovereign Fine-Tuning Without Compromising Privacy
A sovereign fine-tune lives or dies on what enters the trainer, not on which optimiser you pick. The recipes for LoRA QLoRA on-premise are well documented and largely commoditised, but the corpus that feeds them is where every privacy, classification, and procurement question actually concentrates. This piece is the sourcing playbook we use when an Omani institution wants a domain-adapted Arabic model without sending a single record to a public cloud.
1. The four data sources for a sovereign fine-tune
Almost every defensible fine-tune draws from four kinds of input, mixed in different proportions depending on the task. Treating them as one bucket is the first mistake; each carries a different consent regime, a different classification, and a different leakage profile.
- Institutional corpus. Internal documents the institution already owns: policies, manuals, prior correspondence, structured records. This is the highest-signal source for house style and domain vocabulary. It is also the one that needs the strictest redaction firewall and the clearest legal basis under PDPL.
- Public regulatory and government text. Decrees, ministerial circulars, gazette entries, parliamentary records, NCSI guidance, MTCIT bulletins. Public, citeable, low-risk. Excellent for teaching the model the formal register of Omani public administration.
- Public-domain references. Classical Arabic texts, open educational material, Wikipedia subsets, openly licensed Arabic books, court judgments published by the judiciary. Useful for broad MSA fluency and for offsetting the dialectal skew of social-web data.
- Synthetic data. Instruction-response pairs generated by a stronger model, paraphrases of real seed data, and adversarial examples written by the security team. Cheap to scale, dangerous to over-rely on, useful in carefully measured doses.
The split we typically recommend for a 7B Arabic-domain fine-tune is roughly 40 percent institutional, 25 percent public regulatory, 20 percent public-domain reference, 15 percent synthetic. The exact mix is a function of how much real institutional text exists and how sensitive it is.
2. Privacy-preserving prep: redact, tag, prove consent
Before any shard reaches the loss function, three things must be true: every personal identifier is removed or hashed, every shard carries a classification tag the trainer can check, and every record has a documented legal basis for inclusion. Skipping any one creates a finding the procurement reviewer will catch later.
- Detect and redact PII at the right layer. Arabic PII is harder than English: morphology, multiple romanisation conventions, and dialectal spellings of the same name. Production pipelines we have shipped chain a transformer-based Arabic NER model (often fine-tuned on top of CAMeL Tools or Farasa) with deterministic regex for IBANs, civil-IDs, plate numbers, and phone strings. Each match collapses to a salted-hash class token, so the model learns relations without learning identities.
- Carry consent provenance with every shard. Every record arrives with a small JSON sidecar: source, original date, legal basis under PDPL Article 5, retention horizon, and any special category flags. The trainer rejects any shard whose sidecar fails the policy check.
- Tag classification levels at ingest. Public, internal, restricted, secret. The trainer enforces that any run touching restricted or secret shards executes only on accredited hardware, inside a SCIF or its network-isolated equivalent, and writes its outputs to a signed-only artefact store.
The point is not to replace the security officer, it is to make their sign-off mechanical. By the time data reaches the GPU, the human-judgement steps are already done and audited.
3. Synthetic data trade-offs: when it helps, when it leaks
Synthetic data is tempting because it is cheap and unconstrained. It is also riskier than most teams assume. The 2024 paper Generated Data with Fake Privacy showed that fine-tuning on email data produced by a generator LLM can raise PII-attack success rates by more than 50 percent versus the pre-trained baseline, because the generator surfaces memorised fragments from its own pre-training. Synthetic does not mean private.
- Where it helps. Expanding instruction-style coverage, generating refusal templates, scripting adversarial probes for the eval set, paraphrasing seed examples to dilute over-fit on rare phrasings. These are template-bound tasks where the synthetic content does not carry private information.
- Where it leaks. Generating "synthetic customer records" or "synthetic case files" from a public model. The output looks plausible but inherits the generator's memorisation. Use this only when the generator is a model whose training set you can audit, ideally one trained inside the same accreditation boundary.
- Mitigation. Run a membership-inference probe on the synthetic set against the generator before any of it reaches the trainer. Cap synthetic share to 15 to 20 percent. Diversify generators when possible.
4. Provenance metadata that survives the fine-tune: a datasheet for adapters
Once the adapter is trained, the metadata that travelled with the corpus must travel with the weights. We adapt the Datasheets for Datasets template (Gebru et al., 2021) to describe the adapter itself, not just the input data. The adapter datasheet is the single artefact a procurement reviewer asks for and the single artefact the security officer signs.
- Motivation. Task, intended users, deployment boundary, in-scope and out-of-scope queries.
- Composition. Per-source row counts, classification mix, language distribution (MSA vs dialect), date range, redaction-rule version.
- Collection process. How institutional shards were exported, how public sources were crawled, how synthetic data was generated and screened.
- Preprocessing. Tokeniser version, deduplication threshold, redactor model and version, manifest hash.
- Recommended uses and out-of-scope uses. Including a refusal pattern catalogue.
- Maintenance. Re-eval cadence, drift monitor, contact owner inside the institution.
This document is small (six to ten pages) and it pays for itself the first time a different team inherits the adapter or a regulator asks how the model was trained.
5. Audit trail for procurement-grade buyers
Sovereign procurement does not buy claims, it buys evidence. The audit trail for a fine-tune has five components, all of them produced as a side effect of the pipeline above and signed at the end of each run.
- Signed corpus manifest. SHA-256 of every shard, paired with its consent sidecar and classification tag. The trainer refuses to ingest anything not in the manifest.
- Redactor configuration and rule version. Hashed, signed, archived alongside the manifest.
- Training-run log. Hyperparameters, base-model hash, GPU node IDs, start and end timestamps, operator identity.
- Eval and red-team report. Task accuracy on a held-out test set plus membership-inference and verbatim-extraction probes per the pillar piece.
- Adapter datasheet. The cover document that ties the four artefacts above into one signable file.
A reviewer who knows what to ask for can verify all five in an afternoon. Vendors who cannot produce them in writing should be treated as unverified, regardless of what their slides claim.
Briefing
If your team is scoping the corpus for an Arabic sovereign fine-tune and wants a second pair of eyes on the source mix, the redaction firewall, or the adapter datasheet template, email [email protected] for a one-hour briefing. We will walk your data scientists, security officer, and legal counsel through the playbook in your room and leave nothing in writing without your sign-off.
Frequently asked
Can a sovereign team really get to a useful fine-tune without using customer data?
Yes for many tasks. Public regulatory text, internal staff-authored notes, and a controlled volume of paired synthetic instructions are often enough to teach domain vocabulary and house style. Customer records become necessary only when the task is identity-bound, for example name reconciliation or KYC. In that case, redact aggressively, run the trainer in an air-gapped room, and treat the resulting weights as classified.
Is synthetic data safe to use for an Arabic fine-tune?
Useful but not automatically safe. Recent work shows that fine-tuning on data generated by another LLM can amplify the original model's PII memorisation rather than dilute it. Use synthetic data to expand instruction styles and edge cases, not to substitute for redacted real data, and screen the synthetic set with a membership-inference probe before training.
What does an adapter datasheet need to contain?
Provenance for every shard that fed the trainer, the redaction rules applied, the consent or legal basis for each source, the classification level of inputs and outputs, the base-model hash the adapter binds to, the training recipe, and the eval and red-team results. The shape follows the Datasheets for Datasets template, adapted for adapter weights rather than raw datasets.
How do procurement reviewers verify these claims?
By asking for the signed manifest, the redactor configuration, the eval logs, and a sample membership-inference report. A reviewer who knows what to ask for can verify in an afternoon whether the corpus pipeline matches the claims in the bid response. Vendors who cannot produce these artefacts in writing should be treated as unverified.