Comparing Open-Source Models on Omani Arabic Dialect
Most open weights advertise "Arabic support" without ever seeing an Omani sentence. For institutions in Muscat the question is narrower and harder: which model writes a polite letter from a citizen, summarizes a Khaleeji procurement chat, and refuses to invent a place called Salalah Heights. We ran a small but disciplined spot-check against four open families that ship in our reference stack, and the rankings are not what the leaderboards suggest.
1. Why Omani Arabic is its own evaluation problem
Public Arabic benchmarks lean Egyptian, Levantine, and Maghrebi by sheer corpus volume. Omani-formal MSA, the register an under-secretary actually signs, sits closer to Najdi and Khaleeji norms than to Cairo or Beirut. A model that scores 78 on ArabicMMLU can still produce الحين instead of الآن, or politely refer to a male زبون the way a Cairo helpdesk would, not the way an Omani bank officer would.
Three pitfalls show up repeatedly when teams blindly trust generic Arabic scores:
- Lexical bleed. Words like طب، خلاص، يلا appear in MSA outputs and break the formal register.
- Place-name drift. Models hallucinate الرياض or القاهرة as the default Arab capital, even when the user wrote about Muscat or Sohar.
- Honorific mismatch. Outputs default to Egyptian forms of address rather than the سعادة، معالي stack used in Omani correspondence.
That is why a deployment for an Omani regulator cannot be decided on the strength of a single English leaderboard. The honest evaluation runs on prompts that look like the actual workload, ideally drafted by a native Omani reviewer who can spot the difference between تفضّلوا and تفضّل at a glance, and who knows when a sentence sounds Cairene-by-mistake even though the grammar is correct.
A second subtle issue is the gap between spoken Omani Arabic, used by citizens calling a helpline, and written Omani-formal MSA, used by the same citizens when they email a ministry. A model that handles one register does not automatically handle the other. The four families we evaluated each have different blind spots across this spoken-to-written axis, and any procurement decision that ignores it will surface as user complaints in the first month of pilot.
2. Spot-check methodology with 30 to 50 representative prompts
Our internal harness uses 40 prompts grouped into five buckets. The numbers are small on purpose: a sovereign team can review every output by hand in one afternoon, which beats a thousand-prompt run nobody trusts.
- Citizen-letter MSA (10 prompts). Draft a complaint, a request, a thank-you to a ministry. Graded on register, honorifics, and absence of dialect leakage.
- Khaleeji business chat (8 prompts). Summarize a WhatsApp procurement thread between an Omani buyer and a UAE supplier, including currency switches.
- Code-switched IT support (8 prompts). Mixed Arabic-English tickets with product names like Active Directory and VPN, graded on whether Latin product names survive untouched.
- Dictation cleanup (6 prompts). Noisy ASR output from a meeting in Muscat, restored to clean MSA without changing the speaker's intent.
- Entity extraction (8 prompts). Pull names, places, and dates from Omani news copy and resolve الباطنة, ظفار, مسندم correctly.
Each output is scored 0 to 3 on faithfulness, register, and dialect appropriateness, then averaged. The harness is open-coded against the public eval methodology described in our overview of ArabicMMLU, AlGhafa and ALUE, but trims their breadth in favour of Omani specificity. We cross-reference dialect signal with the LDC dialectal Arabic literature, including the Dolphin Arabic NLG benchmark, which was among the first to publish per-dialect splits beyond the usual four.
3. Where the four families land on the Omani-MSA spectrum
For a fuller treatment of the underlying Falcon Arabic LLM story, see the pillar piece on the Technology Innovation Institute model. Here we focus on relative behaviour against Omani prompts.
- Falcon Arabic. Cleanest Khaleeji register out of the box. Strongest on citizen-letter and entity-extraction buckets. Occasionally over-translates Latin product names. Documented training on a curated Arabic corpus by TII.
- Qwen 3.6. Closest non-Arabic-first competitor. Excellent on code-switching, strong on dictation cleanup. Needs a one-line prompt nudge to lock register to Omani-formal MSA. See the model card maintained by Alibaba on Hugging Face.
- Llama 4. Solid generalist. Best raw English fluency in code-switched tickets, but slips into Cairo phrasing on letters without explicit guidance. Strong tool-use behaviour for agentic workflows.
- Gemma 4. Most disciplined on long-context Arabic summarization. Lowest hallucination rate on Omani place names in our set. Slightly stiffer prose, which actually fits ministerial register.
4. Recommendation per workload
No single model wins every bucket. We deploy in pairs and route by task:
- Outbound ministerial correspondence: Falcon Arabic primary, Gemma 4 fallback. Both produce the formal register without slang leakage.
- Bilingual helpdesk and IT triage: Qwen 3.6 primary, Llama 4 fallback. Their handling of Latin product names, ticket numbers, and code blocks is more reliable than the Arabic-first families.
- Long Arabic document analysis (audit reports, board minutes): Gemma 4. Its 256k context window plus low place-name hallucination is the right fit.
- Customer-facing chat in Omani Arabic: Falcon Arabic with a small QLoRA adapter trained on the institution's own back-catalogue. The prompt-only baseline is already usable, but adapter training pushes the register from 8 out of 10 to 9.5.
- Search and retrieval over mixed corpora: Qwen 3.6 for the embedding side, Gemma 4 for the answer-synthesis side. The Arabic-English tokenizer in Qwen 3.6 keeps query recall high even when users type a Latin acronym followed by an Arabic clarification.
The numbers behind these picks shift quarter to quarter. New checkpoints land, fine-tunes appear, dialect splits get repaired upstream. The constant is the methodology: a small prompt set the institution actually owns, a grading rubric three people in the room agree on, and the discipline to repeat the run after every model upgrade. That is the only honest answer to the question of which open weight serves Omani Arabic best.
If your team wants to run this 40-prompt spot-check on your own corpora before procurement, we share the harness and grading rubric under NDA. Email [email protected] for a one-hour briefing and a sample report.
Frequently asked
Which open-source model handles Omani Arabic dialect best out of the box?
Falcon Arabic remains the strongest baseline for native Gulf and Omani phrasing because the Technology Innovation Institute trained it on a curated Arabic corpus that includes Khaleeji material. Qwen 3.6 is the closest general-purpose challenger, especially when you accept light prompt scaffolding asking it to favour Omani-formal MSA over Egyptian or Levantine defaults.
How many prompts do I need to spot-check a model on Omani Arabic?
30 to 50 representative prompts, balanced across MSA citizen letters, Khaleeji business chat, code-switched IT support, dictation cleanup, and entity extraction, give a confident triage signal in a day. Larger benchmarks like ArabicMMLU or AlGhafa are useful next, but they answer different questions about academic coverage rather than dialect fidelity.
Does code-switching between Arabic and English break these models?
It exposes the weaker tokenizers first. Llama 4 and Qwen 3.6 handle Arabic-English code-switching cleanly in our prompt set. Falcon Arabic occasionally over-translates Latin product names into Arabic transliteration, which matters in banking and procurement transcripts. Gemma 4 is reliable on English fragments but sometimes ignores them in summarization.
Should we fine-tune on an Omani corpus or rely on prompting?
Start with prompting and few-shot examples until you can quantify the residual error, then graduate to a small QLoRA adapter on 2,000 to 10,000 in-domain pairs. For most Omani institutions a prompt library plus retrieval over their own document base closes 80 percent of the dialect gap without the operational cost of a custom fine-tune.