Building an Arabic Instruction-Tuning Dataset for Sovereign Use

An Omani sovereign buyer who wants an LLM to draft, summarise, and answer in Omani-formal Arabic cannot start from a translated American dataset. The off-the-shelf instruction sets that fuelled the open-source LLM revolution were written, generated, or annotated through an English and Levantine-media filter. They will not produce ministerial correspondence, a regulator's circular, or a Court of Cassation summary in the register an Omani institution actually uses. The cure is a deliberately built, hand-reviewed Omani-MSA instruction-tuning dataset of five to ten thousand examples. This article is the build recipe, the same recipe that complements our pillar on Qwen 3.6 Arabic NLP benchmarks.

Why off-the-shelf instruction datasets fail Arabic sovereign work

The de-facto baseline, Stanford's Alpaca dataset, is fifty-two thousand instruction-output pairs generated by OpenAI's text-davinci-003 from one hundred and seventy-five English seeds, at a generation cost under five hundred US dollars. Three structural problems block its direct use for an Omani sovereign deployment:

  • English bias by construction. Seeds, generation model, and review were all English. Cultural references default to American holidays, US legal idioms, and Western names. Translated into Arabic, those references survive and clash with local norms.
  • Levantine register leakage. Most large Arabic instruction sets that exist were either machine-translated from English or harvested from social-media corpora dominated by Levantine and Egyptian dialects. The result reads as media Arabic, not the Omani-formal MSA used in ministerial correspondence.
  • Licence and provenance opacity. Datasets generated from frontier APIs inherit the API's downstream-use restrictions. A sovereign buyer cannot defend an adapter trained on an opaque licence chain to a regulator or auditor.

This is the gap that purpose-built Arabic datasets such as CIDAR by ARBML were designed to close. CIDAR documented that a ten-thousand-example, human-reviewed, culturally aligned Arabic set produced better cultural alignment than adapters trained on thirty times more machine-translated data. That finding is the empirical floor of every sizing conversation we have with sovereign buyers in Oman.

Building a 5,000 to 10,000 example Omani-MSA dataset

A defensible Omani-MSA instruction set draws from five sources, each contributing a distinct slice of register and task surface.

  1. Anonymised institutional correspondence. Internal letters, memoranda, and circulars from the buyer institution, anonymised and stripped of named entities. This anchors the institution's house tone.
  2. Public regulator and ministry text. Official Gazette excerpts, ministerial decisions from the Qanoon legislative portal, and circulars from the Central Bank of Oman, the Capital Market Authority, and the Tax Authority. The interpretive backbone of formal MSA in the country.
  3. Curated open Arabic instruction sets. Selectively imported examples from CIDAR and the Aya Collection, filtered by an Omani reviewer for register and cultural fit. The non-Levantine and non-Egyptian residue typically runs thirty to fifty per cent of source material.
  4. Synthetic Q&A pairs. Generated from in-house source documents (procedures, FAQs, employee handbooks) by an open-weight model running inside the perimeter, then human-edited. The cheap mass that fills out the long tail.
  5. Adversarial seeds. Five hundred to a thousand examples deliberately written to expose bad behaviours: refusal in the wrong register, fabricated citations, English contamination, dialect leak. The hard floor of the safety surface.

The example schema we recommend is JSONL with a fixed field set: id, instruction, input (optional context), output, register (one of omani_formal, msa_neutral, technical), politeness_tier, source_type, citation_required, annotator_id, and review_status. The schema is the contract that lets a downstream trainer rebalance the mix without rebuilding the dataset.

Annotation guidelines for register and politeness

Two axes do most of the work. The register axis distinguishes Omani-formal MSA (the default for ministerial correspondence, formal letters, and regulator-facing documents) from MSA-neutral (fit for general assistant tasks) and from technical (acceptable for engineering or financial workflows where Latin acronyms appear). The politeness tier distinguishes the formal opening and closing protocols expected when addressing a minister, an under-secretary, a counterparty institution, or a citizen.

Concretely, the guideline document fixes the salutation library, the closing library, the use of honorifics (سعادة، معالي، الفاضل/الفاضلة), and the rules around mixed Arabic-English text. It bans Levantine and Egyptian colloquial markers (the bnayn, bnaat, tafshet, sho family of words). It mandates Latin runs to be wrapped or set off in a way the trainer can preserve. Each annotator works against the same calibration set of one hundred examples before touching the production batch.

Validation methodology before training

Three layers of validation run before a single token of the dataset hits a fine-tune.

  • Schema and lexical validation. JSONL parses, every required field present, no Levantine markers, no English-only outputs, no PII patterns surviving anonymisation. Automated, runs in seconds.
  • Inter-annotator agreement. A 10% double-annotated stratum measured by Cohen's kappa per axis. Below 0.7 we re-train the panel on the offending guideline section. Above 0.85 we are over-fitting the panel and need to broaden it.
  • Pilot adapter. A throwaway LoRA adapter on the open base model (Gemma 4 26B-A4B or Qwen 3.6) trained for one epoch on a 1,000 example subset. We measure register accuracy and politeness compliance on a fifty-prompt held-out set. If the pilot adapter regresses on either axis the data is wrong before we commit the full training budget.

Open vs proprietary release decision

The release question is not binary. The defensible split for an Omani sovereign buyer is to publish the generic Omani-MSA register slice on a permissive licence, contributing back to the wider Arabic NLP commons (the same posture the Aya programme rewards), while keeping the institution-specific slice fully proprietary inside the perimeter. The published slice raises the floor of every Omani LLM that reads the registry afterwards. The proprietary slice carries the operational signal that genuinely differentiates the buyer's adapter and never leaves the appliance.

If your institution is sizing an Omani-MSA instruction-tuning dataset for a sovereign LLM programme and would like a one-hour briefing on the schema, sourcing plan, annotation panel, and validation harness, the next step is simple. Email [email protected] or message +968 9889 9100. We will come to you, walk through the recipe, and leave a credible plan against your timeline. Pricing is by quotation, sized to your specific requirement.

Frequently asked

How many examples does a sovereign Arabic instruction-tuning dataset really need?

Five thousand to ten thousand high-quality, hand-reviewed examples are enough to teach an open-weight base model the Omani-MSA register and the institution's task surface. Below five thousand the model learns vocabulary but not behaviour. Above ten thousand returns diminish quickly for a single jurisdiction. The CIDAR paper from ARBML showed that a 10,000 example culturally-aligned Arabic set outperformed adapters trained on thirty times more machine-translated data, which is the empirical anchor we cite when sizing buyer programmes.

Why not just translate the Stanford Alpaca dataset into Arabic?

Three reasons. First, Alpaca was generated by a US frontier model and embeds American cultural assumptions, holidays, names, and legal references that are wrong for an Omani context. Second, machine translation flattens register, the result reads as Levantine media Arabic rather than Omani formal MSA. Third, Alpaca was generated with text-davinci-003 under OpenAI terms that restrict downstream use. A purpose-built Omani dataset avoids all three problems.

Should we open-source the dataset after we build it?

Decide per slice, not per dataset. The generic Omani-MSA register slice can usually be released openly on a CC BY 4.0 or similar licence, contributing to the wider Arabic NLP commons. The institution-specific slice (internal procedures, classified language patterns, named-entity sets) stays proprietary and never leaves the perimeter. This split-release model lets a sovereign buyer be a good citizen of the Arabic open-data movement without leaking operational detail.

Who annotates the dataset, and how do we control quality?

A core panel of four to eight Omani annotators with mixed legal, linguistic, and domain backgrounds, working from a written guideline document and a calibration set of one hundred fully-worked examples. Every example is double-annotated, with disagreements escalated to a senior reviewer. We track inter-annotator agreement (Cohen kappa target above 0.7), per-annotator drift, and a held-out 5% audit set scored weekly. The platform we use mirrors the participatory annotation model that the Aya project ran across sixty-five languages.