Choosing Embedding Models for Bilingual Arabic-English RAG
Most institutional corpora in Oman are bilingual without anyone planning it. The policy file is in Arabic, the underlying technical standard is in English, the email thread switches mid-paragraph, the regulator's circular has both versions side by side. A retrieval-augmented generation system that only handles one language per index is already failing the moment a director asks an Arabic question over a corpus that holds the answer in English. The fix is not translation at query time, it is choosing an embedding model that puts both languages into one shared vector space and indexing the whole corpus once. This guide walks through the choice for a 2026 sovereign deployment and pairs naturally with the broader picture in our pillar on Qwen 3.6 Arabic NLP benchmarks.
The bilingual retrieval problem
Three patterns dominate institutional retrieval in Oman. A user asks an Arabic question and the answer lives in an English document (a translated standard, an external regulator's report, a vendor manual). A user asks an English question and the relevant artefact is an Arabic ministerial decision or internal memo. A single document is itself bilingual, with Arabic and English paragraphs interleaved, and every chunk needs to be findable from either language.
The naive workaround, run the query through a translator first then retrieve in the target language, fails on three fronts. It doubles the round-trip latency on every search. It introduces a translation error that compounds with the retrieval error. It blocks any query that contains domain-specific Omani terminology the translator has never seen. The right answer is to skip the translation step entirely and use an embedding model that places Arabic and English documents in the same metric space, so cosine similarity already encodes cross-lingual semantic match.
Embedding-model contenders
Four open-weight or commercial models dominate the bilingual retrieval conversation in 2026. Each has a different trade-off shape.
- BGE-M3 from BAAI is the current open-weight reference. Built on XLM-RoBERTa and described in the M3-Embedding paper (Chen et al, 2024), it supports more than 100 languages, handles up to 8,192 tokens per chunk, and exposes dense, multi-vector, and sparse retrieval in a single forward pass. On the MKQA cross-lingual retrieval task it reports 75.5 percent Recall@100, ahead of the strongest baselines and OpenAI's text-embedding-ada-002. Apache 2.0 licensed, runs comfortably on a single workstation GPU.
- multilingual-e5-large from Microsoft Research is the lighter alternative. Initialised from XLM-RoBERTa-large and described in the multilingual E5 technical report, it supports the same 100-language XLM-R surface, uses a smaller 1,024-dim vector, and posts strong nDCG@10 on Arabic across the MIRACL family. The smaller footprint matters when the appliance is also serving a generation model and an OCR pipeline on the same GPU.
- Nomic Embed v1.5 ships an Apache-2.0 multilingual variant with a permissive licence and a usable Arabic score, but its public benchmarks are weaker on cross-lingual Arabic-English retrieval than BGE-M3 in head-to-head tests we have run on Omani corpora.
- Cohere Embed v3 multilingual remains a strong commercial option, but the API-only delivery model is a non-starter for any classified-tier deployment. Cohere is worth considering only when the institution has explicitly approved cloud retrieval and the corpus is at the lowest sensitivity tier.
For a sovereign on-prem deployment the practical short-list is BGE-M3 first, multilingual-e5-large as the lighter fallback. Everything else is downstream of those two unless a sector-specific tuned model has appeared on Hugging Face since this article was written.
Cross-lingual retrieval evaluation
The benchmark you cite when defending the model choice is MIRACL, the WSDM 2023 Cup challenge published as Zhang et al, TACL 2023. It covers 18 typologically diverse languages including Arabic, with human-graded relevance over Wikipedia. Multilingual-e5-large reports an Arabic nDCG@10 around 0.85 on its visual extension; BGE-M3 reports comparable or stronger numbers on the standard split.
Treat MIRACL as the reference, not the verdict. The verdict comes from your own evaluation set. Build 100 to 300 real queries (half Arabic, half English) against a representative slice of your own corpus, label relevance with a small panel of operators, and score nDCG@10, MRR@10, and Recall@50. The numbers that matter for production sit on a 0 to 1 axis with two thresholds: 0.5 nDCG@10 is "useful enough to ship behind a guardrail", 0.7 is "genuinely good".
Architecture pattern
Once the model is chosen, the architecture writes itself. Chunk every document by semantic boundary (heading, paragraph, list item) at 300 to 600 tokens with 10 to 20 percent overlap. Embed each chunk once with the chosen multilingual model and store the vector alongside the original text and metadata in a single index. At query time, embed the user's question with the same model, perform vector similarity search, and pass the top-k chunks to the generation model with the question.
Three details matter. First, normalise vectors before indexing so cosine similarity collapses to a fast inner product. Second, store the document language as metadata so a downstream re-ranker can up-weight same-language results when the operator wants that behaviour. Third, run a lightweight cross-encoder re-ranker on the top 50 hits before passing to the generation model. Cross-encoders close most of the remaining gap between bi-encoder retrieval and oracle retrieval, at a few milliseconds per query.
Note on Hosn-class on-prem deployment
For a sovereign appliance the entire stack runs locally. The embedding model loads once at boot, indexing happens on-box during ingestion, and the vector store (we ship Qdrant or pgvector depending on the institution's existing operational stance) sits on the same machine as the generation model. A Tower-tier configuration handles a single department's corpus, hundreds of thousands of chunks, with sub-100ms retrieval latency. A Rack-tier configuration scales the same pattern to ministry-wide corpora at low millions of chunks.
The retrieval pipeline is a small, auditable artefact: one model file, one index, one query path. It is exactly the kind of component that should run inside the perimeter rather than against an external API. Pricing is by quotation, sized to corpus volume, query rate, and the rest of the stack the institution chooses to run alongside the retriever.
If your institution is sizing a bilingual retrieval system over a sensitive Omani corpus and would like a one-hour briefing on model choice, evaluation, and the deployment shape that fits your specific situation, the next step is simple. Email [email protected] or message +968 9889 9100. We will walk through your corpus mix, classification levels, and a credible evaluation plan against your timeline.
Frequently asked
Do I need separate Arabic and English indexes?
No. A modern multilingual embedding model like BGE-M3 or multilingual-e5-large maps Arabic and English text into a shared vector space, so a query in either language can retrieve relevant documents in the other. One index, one embedding pipeline, one query path. Two indexes only make sense when you also want lexical (BM25) fallback per language, in which case keep one shared dense index and add a per-language sparse index next to it.
Which embedding model should I start with for an Omani institutional corpus?
Start with BGE-M3. It is the strongest open-weight cross-lingual model published to date, supports 100-plus languages including Arabic, handles up to 8,192 tokens per chunk, and exposes dense, multi-vector, and sparse retrieval from a single pass. If license terms or hardware footprint matter, multilingual-e5-large is a smaller, faster fallback with strong MIRACL Arabic numbers. Run both on a 200-query Arabic and English evaluation slice from your own corpus before committing.
How do I evaluate cross-lingual retrieval honestly?
Build a small evaluation set of 100 to 300 real queries in each language, with human-judged relevance against your own documents. Score nDCG@10, MRR@10, and Recall@50. Cross-check your numbers against published MIRACL Arabic and English splits to confirm your pipeline behaves like the literature reports. Re-run the eval whenever the embedding model, chunk size, or normalization changes. The number that matters is on your data, not on Wikipedia.
Can a Hosn-class on-prem appliance run BGE-M3 at production load?
Yes. BGE-M3 inference fits comfortably on a single workstation-class GPU, and a Tower-tier appliance can index millions of chunks overnight while serving live retrieval for a department. Pricing for a Hosn deployment is by quotation, sized to corpus volume, query rate, and the rest of the model stack the institution wants to run alongside the retriever.