Why do generic English benchmarks understate or overstate multilingual performance?

MMLU, HellaSwag, ARC, and similar suites are overwhelmingly English. Strong English scores correlate weakly with Arabic, Urdu, or Swahili performance because tokenisation, training-data mix, and instruction-tuning data differ. A model can post a 78 percent MMLU score and still misroute polite Omani correspondence. Use multilingual MMLU variants and locale-specific suites alongside the headline number.

How large should a sovereign-tuned eval set be?

For a single institution, 40 to 60 prompts is the practical sweet spot. Smaller sets are noisy across model versions; larger sets are expensive to keep current and burn operator review time. The set should cover the institution's real registers: ministerial correspondence, regulator filings, citizen complaints, internal memos, and any code-switching patterns the institution actually uses.

Is LLM-as-judge valid for sovereign evaluation?

LLM-as-judge agrees with human raters at roughly 80 percent on English chatbot-quality scoring per the original MT-Bench paper, which is acceptable for triage but not for procurement evidence. For sovereign work, use it as a pre-filter and require human operator A/B blind rating for the final score. Calibrate the judge against a small human-rated subset before trusting it at scale.

How often should the eval be re-run?

Re-run the full three-layer eval whenever a candidate model ships a new minor or major release, the institution adds a new register or use case, or quarterly at minimum. Frontier models drop monthly in 2026 and a procurement decision more than ninety days old is structurally stale.

Methodology for Evaluating Multilingual Models in Sovereign Settings, Hosn Blog

A sovereign procurement committee in Muscat does not buy a multilingual model on a vendor's leaderboard slide. It buys on a methodology that survives a regulator's questions, an internal audit, and a model swap six months later. This guide sets out the three-layer evaluation methodology Hosn uses with sovereign clients: public benchmarks for shape, a sovereign-tuned eval for register fit, and operator A/B for ground truth. It pairs with our pillar on Qwen 3.6 Arabic NLP benchmarks, which applies the methodology to one specific model family.

Why generic English benchmarks lie about multilingual performance

The headline benchmark for a frontier model is almost always English. MMLU, the 57-subject multitask benchmark, was authored in English and remains the default citation in vendor decks. Its multilingual translations exist but are uneven, and high English MMLU does not predict high Arabic MMLU. The 2024 ArabicMMLU paper showed open models that scored well on English MMLU still failing to break 50 percent on a multiple-choice Arabic exam set, a gap of more than 25 points for some open frontier checkpoints.

The same pattern appears for translation. BLEU, the n-gram precision metric Papineni and colleagues introduced at ACL 2002, is fast and reproducible and has anchored two decades of MT research. It is also a poor proxy for human judgement on culturally loaded text: a polite Omani Arabic letter scored 38 BLEU and a clumsy machine translation scored 41 BLEU can flip on human review, because BLEU rewards surface n-gram overlap rather than register and pragmatic fit. FLORES-200 from Meta widened coverage to 200 languages and gave usable Gulf-Arabic baselines, but the metric story does not change: a benchmark answers a narrow question, and stacking it on top of an English-trained tokeniser will systematically distort multilingual results unless paired with locale-aware testing.

The sovereign buyer's takeaway is short. Treat any single benchmark as one signal among several. Generic English suites tell you whether a model is capable in principle; they do not tell you whether it is fit for a specific institution writing in a specific Arabic register, with specific code-switching patterns and specific document genres.

The three-layer eval

Hosn's reference methodology has three layers, run in this order, with each layer's failure modes covered by the next.

Layer 1, public benchmarks. Run the candidate model against MMLU, ArabicMMLU, FLORES-200 (Arabic-English both directions), AraBench, ALUE, and a code-reasoning suite if the institution uses code. Use the EleutherAI LM Evaluation Harness to keep the runs reproducible. The output here is a profile, not a score: where is this model strong, where is it weak, what subject groups collapse below 60 percent.
Layer 2, sovereign-tuned eval. Run a private 40-60 prompt set drawn from the institution's actual correspondence, regulator filings, internal memoranda, and customer-facing chats. Score each output against a written rubric covering accuracy, register, formatting, code-switch handling, and refusal correctness. This layer catches register drift no public benchmark will see.
Layer 3, operator A/B. The institution's actual operators (a senior officer, a translator, a complaints handler) blind-rate paired outputs from two candidate models on the same 40-60 prompts. Use a 5-point Likert per dimension, average across raters, and report inter-rater agreement. This is the layer the procurement committee will defend in front of an auditor.

LLM-as-judge belongs to layer 2, never layer 3. The original MT-Bench / Chatbot Arena paper from Zheng and colleagues showed strong judges agree with humans roughly 80 percent of the time on English chat scoring. That is fine for pre-filtering candidates and stress-testing rubrics. It is not fine as the only signal in front of a sovereign procurement committee, especially in Arabic where judge-model bias has been documented repeatedly. Always anchor the final decision in human operator review.

Building a sovereign eval set, 40 to 60 institutional prompts

The institution-specific test set is where the methodology earns its keep. Construction follows five rules.

Source from real work. Pull anonymised samples from inbound regulator letters, internal memoranda, citizen complaints, archive scans, and public consultation responses. Do not write synthetic prompts; they always sound cleaner than real ones.
Cover the register matrix. Include MSA formal correspondence, mixed MSA-Gulf chat, OCR-noisy archive text, English-Arabic code-switched memos, and at least five prompts that test refusal (asking the model to do something it should decline).
Write the rubric before you generate outputs. Five dimensions, 0 to 4 each: factual accuracy, register and tone, formatting and bilingual layout, code-switch and acronym handling, refusal correctness. Score 16 of 20 to pass.
Keep the test set off the model's training surface. Never paste the prompts into a hosted chat for "spot checking". The hosted vendor will see them; some will train on them. Run only on the on-premise candidate.
Blind the operator review. Operators see two outputs labelled A and B with no model identifier. The model identity is unmasked only after scoring is locked.

Forty prompts catches most procurement-relevant differences between two strong candidates; sixty starts to surface tail-quality issues. Past sixty the marginal information drops sharply and the operator review burden becomes the binding constraint. Refresh roughly 20 percent of the prompts each quarter to track new institutional workflows.

Re-eval cadence as models drop monthly

Open-weights frontier checkpoints now ship roughly monthly. Gemma, Qwen, Llama, Falcon, and DeepSeek each pushed multiple minor revisions in the last two quarters. A procurement decision baked on a single 2025 evaluation is structurally stale by mid-2026.

Hosn's recommended cadence is three-tier. Layer 1 public benchmarks re-run automatically on every candidate revision (cheap, scriptable, no operator burden). Layer 2 sovereign-tuned eval re-runs whenever a candidate is shortlisted or quarterly, whichever comes first. Layer 3 operator A/B runs only at the moment of a procurement decision, model swap, or major behavioural regression report from production. Tag every report with model SHA, weight version, quantisation level, and harness commit so a future audit can reproduce it exactly.

Email [email protected] for a one-hour briefing on standing up this three-layer methodology in your institution, including a templated rubric, a starter sovereign-tuned prompt set, and the harness scripts pinned to the model versions you are evaluating.

Why generic English benchmarks lie about multilingual performance

The three-layer eval

Building a sovereign eval set, 40 to 60 institutional prompts

Re-eval cadence as models drop monthly

Frequently asked

Related

Qwen 3.6 Arabic NLP Benchmarks

Arabic NLP Evaluation Suites: AraBench, ALUE, and Arabic MMLU

LLM Evaluation Frameworks: RAGAS and DeepEval