Arabic NLP Evaluation Suites: AraBench, ALUE, and Arabic MMLU

If a sovereign buyer in Muscat asks "is this model any good at Arabic", the only defensible answer cites at least three independent eval suites and one private register test. Marketing leaderboards and viral chat screenshots are not procurement evidence. This guide explains the three Arabic NLP eval suites that matter most in 2026, AraBench, ALUE, and Arabic MMLU, what each one actually measures, and how to combine them into a buyer evaluation that holds up in front of a procurement committee. For the broader benchmark landscape on a specific model, read our deep dive on Qwen 3.6 Arabic NLP benchmarks.

Why eval suite choice matters more for Arabic than English

English NLP has the luxury of a single dominant register, abundant clean training data, and decades of public benchmarks that capture most production use cases. Arabic has none of that. A model that scores well on one Arabic benchmark can fail badly on another because the language itself fragments along several dimensions that English does not.

  • Register continuum. Modern Standard Arabic (MSA) is the language of regulators, court filings, and ministerial letters. The dialect spectrum (Gulf, Levantine, Egyptian, Maghrebi, Iraqi) is the language of customer chat, social media, and most spoken content. A model can be excellent at one and weak at the other.
  • Morphology and tokenisation. A single Arabic root produces dozens of inflected forms. Tokenisers built mostly on English data fragment Arabic words inefficiently, which changes both inference cost and downstream task scores.
  • Diacritics and orthographic variation. Optional diacritics change meaning. Hamza placement varies. Same spelling, different word.
  • Code-switching with English. Omani institutional text routinely embeds English acronyms (PDPL, NCSI, MTCIT) inside Arabic sentences. Some models break direction or lose context at the switch.
  • OCR noise. Real institutional archives are scanned letters and faxed memoranda. The eval set should reflect that, not just clean Wikipedia prose.

No single suite covers all five. That is why the procurement question is never "what's the model's score" but "what's the score on which suites, and which dimensions did each suite stress".

AraBench in detail

AraBench is the Qatar Computing Research Institute (QCRI) evaluation suite for dialectal Arabic to English machine translation, published at COLING 2020 by Sajjad, Abdelali, Durrani, and Dalvi. It is the suite to reach for when the buyer wants to know whether a model can translate real Arabic text, not idealised MSA, into clean English for downstream review or analyst summarisation.

The structure is what makes AraBench useful. The suite combines existing dialectal Arabic to English resources with new test sets built specifically for the benchmark, then organises everything into 4 coarse dialect groups, 15 fine-grained dialects, and 25 city-level dialect categories, across genres including media, chat, religion, and travel. A model that scores well on AraBench Cairo is not automatically good at AraBench Muscat or AraBench Doha, and the suite forces that distinction to be visible.

For an Omani sovereign buyer the AraBench numbers that matter most are the Gulf and city-level Muscat or Doha scores, plus the chat genre (which is closest to real customer correspondence) and the media genre (which is closest to news and policy text). A model that posts a high single number averaged across all 25 city categories is hiding poor performance on the few that are operationally relevant. Always demand the per-dialect breakdown. The QCRI AraBench resource page hosts the dataset and reference scripts; the QCRI team's later LAraBench work extends the methodology to large language models.

ALUE in detail

ALUE, the Arabic Language Understanding Evaluation, is the Arabic counterpart to the GLUE benchmark that shaped English NLU research. Published at the 2021 Workshop on Arabic Natural Language Processing (WANLP), ALUE bundles eight tasks covering Arabic natural language inference, sentiment, dialect identification, irony detection, hate speech, emotional classification, semantic similarity, and a diagnostic task that probes specific linguistic phenomena.

Two design decisions make ALUE more credible than ad-hoc Arabic NLU eval. First, five of the eight test sets are held privately by the ALUE leaderboard team, which means a model cannot be silently fine-tuned on the test data. Second, the diagnostic dataset isolates capabilities like negation, lexical entailment, and quantifiers, so the buyer can see whether a model's overall score is a real average or a smoothing of one strong skill over several weak ones.

For sovereign procurement, ALUE is the suite that answers "does this model understand Arabic well enough to classify an inbound complaint, judge sentiment in a public consultation, or detect hate speech in a regulator-monitored channel". It is less suited to generation-quality questions (use AraBench or human review for those) and less suited to factual knowledge (use Arabic MMLU). ALUE answers comprehension, full stop.

Arabic MMLU in detail

ArabicMMLU, published at ACL 2024 Findings by Koto, Li, Shatnawi and collaborators at MBZUAI and partners, is the first multi-subject knowledge benchmark for Arabic. It is sourced from real school exams across North Africa, the Levant, and the Gulf, comprises 40 subject tasks and 14,575 multiple-choice questions, all in Modern Standard Arabic, all written and validated by native speakers in the region.

This is the suite that tells a procurement committee whether the model actually knows Arabic-language curriculum content the way an educated Arabic speaker does, not just whether it can produce fluent-looking output. The original paper's headline finding was sobering: BLOOMZ, mT0, LLaMA 2, and Falcon all struggled to break 50 percent, and the best Arabic-centric model topped out at 62.3 percent. The 2026 picture is better but not solved. The strongest open Arabic models (Falcon Arabic, Qwen 3.6 in larger tiers, Jais flagships) sit in the high 60s to low 70s. Closed frontier models score higher, but the open-weights gap is the buyer's gap, because closed frontier models cannot be deployed inside a sovereign perimeter. Arabic MMLU is hosted on GitHub and integrated as a task config in the EleutherAI LM Evaluation Harness, which makes reproducing scores straightforward.

How to combine them for a sovereign-grade buyer evaluation

A defensible Omani procurement evaluation does not pick one of the three. It runs all three on the candidate model, plus a fourth private test set written in the institution's own Omani register, and reports the matrix.

  1. Run AraBench on the Gulf coarse group plus the Muscat and Doha city subsets, plus chat and media genres. Report per-dialect BLEU and chrF, not a global average.
  2. Run ALUE end to end. Report all eight task scores plus the diagnostic breakdown. Flag any task where the model collapses below 60 percent.
  3. Run Arabic MMLU via the LM Evaluation Harness. Report by subject group (humanities, science, social, language) so a strong average does not hide a weak subject.
  4. Run a 200-item private test drawn from the institution's own correspondence, regulator filings, and customer chat. This is the test that catches register drift no public benchmark will.

Combine the four into a single one-page model report and the procurement committee has something it can defend. Single-number leaderboards and vendor decks do not survive that level of scrutiny, and they should not.

Email [email protected] for a one-hour briefing on running this four-part Arabic eval matrix against any candidate open-weights model on a sovereign appliance, with reproducible scripts and a redacted Omani register test set.

Frequently asked

Which Arabic eval suite should a sovereign buyer prioritise?

No single suite is sufficient. AraBench measures dialectal Arabic to English machine translation, ALUE covers eight Arabic NLU tasks (similar in spirit to GLUE), and Arabic MMLU measures multi-subject knowledge across 40 school-exam topics. A defensible procurement test reports all three, plus an institution-specific private test set written in Omani register.

Are AraBench, ALUE, and Arabic MMLU available in LM Evaluation Harness?

Arabic MMLU and several Arabic leaderboard tasks ship as task configs in the EleutherAI LM Evaluation Harness. AraBench is integrated less uniformly and is most commonly run via the original QCRI scripts or the LAraBench evaluation pipeline. ALUE is hosted by its own leaderboard team and uses a held-out private test set.

What score on Arabic MMLU is considered strong?

On the original 2024 release, models like BLOOMZ, mT0, LLaMA 2, and Falcon scored under 50 percent, while the best Arabic-centric model reached 62.3 percent. By mid 2026 the strongest open Arabic models (Falcon Arabic, Qwen 3.6, Jais flagships) sit in the high 60s to low 70s. Anything below 55 percent should be questioned for institutional Arabic work.

Why is Arabic eval harder than English eval?

Arabic carries a register continuum from Modern Standard Arabic to dozens of dialects, rich morphology that affects tokenisation, optional diacritics that change meaning, frequent code-switching with English in institutional text, and OCR noise in archives. A single benchmark cannot capture all of these. Sovereign buyers therefore stack a translation suite, an NLU suite, a knowledge suite, and a private register test.