On-Premise Indexing of the Omani Legal Corpus With AI

Every serious application of legal research AI Oman starts at the same place: a clean, current, citable index of what the law actually says. For an Omani law firm, ministry legal department, or sovereign in-house team, that index is built on top of the Sultani Decrees, the ministerial decisions that flesh them out, the cassation principles the Supreme Court issues, and the circulars the regulators publish week after week. None of those sources is a tidy database. This piece walks through the on-premise pipeline that turns the public corpus into a retrieval-grade index without ever shipping the firm's queries or the ministry's matters to a foreign cloud.

What the Omani legal corpus actually contains

Four document classes do almost all the work in Omani legal reasoning, and a defensible index treats each one with its own pipeline.

  • Sultani Decrees. The primary source of binding law. Issued by the Sultan, published in the Official Gazette (al-Jarida al-Rasmiya), and mirrored on the public portals qanoon.om and decree.om. Roughly a hundred to two hundred new decrees per year, plus the back catalogue from 1970 onward.
  • Ministerial decisions and regulations. Issued by the Council of Ministers and individual ministries to implement decrees. The Ministry of Justice and Legal Affairs publishes the canonical Gazette PDFs.
  • Cassation rulings. The Supreme Court releases selected principles (mabadi') that bind lower-court reasoning. Available through the court's official portal and a small number of indexed third-party services.
  • Regulator circulars. The Central Bank of Oman, the Capital Market Authority, the Tax Authority, the Public Authority for Mining, the Ministry of Labour, and the Personal Data Protection Authority each publish circulars and guidance notes that are functionally binding on the regulated entities.

The shape of the index follows the shape of the law. A document loader that treats a cassation ruling like a ministerial decision will lose precedential weight in retrieval. The first design rule is to preserve source class as a first-class metadata field on every chunk.

The indexing pipeline, end to end

Five stages, all running on the firm's or ministry's own appliance.

  1. Acquire. Daily polling against qanoon.om, decree.om, the MOJLA Gazette feed, and each regulator's circular page. New documents land as PDFs in a versioned store with the source URL, fetch timestamp, and SHA256 captured for later audit.
  2. Detect text layer, OCR if needed. A page-by-page check separates born-digital PDFs from scanned images. Scanned pages route through an Arabic-tuned OCR stack (Tesseract with the ara traineddata, surya-ocr, or a Qwen-VL or Gemini-vision class model running locally) and the output is reconciled against the issue number and date in the Gazette header.
  3. Extract decree citations. A regex and named-entity layer pulls the canonical citation form (Sultani Decree number/year, Ministerial Decision number/year, cassation appeal number, circular reference) from the body. Each citation becomes a graph edge linking the citing chunk to the cited document, so the retriever can walk one hop to load the full text of any cross-reference.
  4. Semantic chunking. Decrees are chunked at the article (madda) level, not by token count, because every Omani lawyer cites by article number. Cassation rulings are chunked by principle. Regulator circulars are chunked by clause. Every chunk carries its source class, document number, article number, issuing date, and a stable chunk ID.
  5. Embed and index. A multilingual embedding model writes vectors to a local store (FAISS, Qdrant, or Milvus, all self-hosted). A sparse BM25 index lives next to the dense one for hybrid retrieval, which a 2024 paper from the arXiv coliee study on Arabic legal RAG found to outperform either approach alone on case-law tasks.

Bilingual embeddings and cross-lingual lookup

The Omani working environment is bilingual by default. Counsel drafts in English, the underlying decree is in Arabic, the cassation principle is in Arabic, the client memo is in English. The index has to answer an English question with the right Arabic article and vice versa. Three open multilingual embedding families handle this well on local hardware: BGE-M3 from BAAI, multilingual E5 from Microsoft Research, and the Cohere Embed Multilingual checkpoint where a self-hostable variant is available. Each places Arabic and English passages in a shared semantic space, so a query and its retrieved passages do not need to share a language. The deeper architectural pattern, including chunking strategy and reranker selection, is the topic of the companion piece on bilingual RAG embeddings for Arabic and English.

The retrieval layer should also normalise Arabic forms before embedding (alif/hamza variants, ta marbuta, ya/alif maqsura) and remove diacritics. Without normalisation, two articles that differ only in tashkil land in distant points in vector space, and the retriever silently underperforms.

Update cadence and audit posture

The pipeline runs daily but the audit cadence is event-driven. Every new Sultani Decree fires a four-step trace: acquire, OCR (if scanned), embed, index. The knowledge-management lead receives a digest of new documents, citation edges added, and any OCR pages that fell below the confidence threshold and need human review. A weekly diff report compares the local index against the public source URLs to catch silent edits or take-downs. A monthly snapshot of the entire index, model weights, and embedding model is checkpointed to an air-gapped backup so the firm or ministry can reproduce any answer it surfaced six months ago.

The result is an Omani legal index that lives inside the perimeter, updates within hours of the Gazette, supports bilingual lookup, and produces an audit trail the firm or ministry can hand to a regulator without redaction. To plan an indexing build for your firm or ministry, email [email protected] for a one-hour briefing. We will scope the corpus, the OCR queue, and the bilingual embedding stack against your bench size and confidentiality posture.

Frequently asked

Where does the public Omani legal corpus actually live online?

The Ministry of Justice and Legal Affairs publishes the official text of Sultani Decrees in the Official Gazette (al-Jarida al-Rasmiya). Two large public mirrors index the same corpus for search: qanoon.om and decree.om. The Supreme Court publishes selected cassation rulings, while sector regulators such as the Capital Market Authority, the Central Bank of Oman, and the Tax Authority publish their own circulars on their portals. A sovereign index mirrors all of these locally, then keeps a daily diff against the public source so updates land inside the perimeter without anyone leaving the firm or the ministry.

Do we need OCR if the text is already published as PDF?

Often yes. A meaningful share of Sultani Decrees, particularly older volumes from the 1970s through the early 2000s, exists as scanned image PDFs rather than text-layer PDFs. A robust pipeline detects text-layer presence on ingestion, routes scanned pages through Arabic-tuned OCR such as Tesseract with the ara model, surya-ocr, or a fine-tuned vision model running locally, then reconciles the OCR output against the official issue number and date stamp. Skip this step and you index nothing for half the historical archive.

How do you handle bilingual lookup when the corpus is mostly Arabic?

Use a multilingual embedding model that places Arabic and English passages in a shared semantic space. Open options such as BGE-M3, multilingual E5, and Cohere Embed Multilingual all run on local GPUs without sending text outside the perimeter. The lawyer asks the question in English, the retriever ranks Arabic passages by semantic similarity, and the generator answers in either language with the original Arabic citation attached. Keyword search alone collapses on Arabic morphology, so semantic retrieval is the floor, not the ceiling.

How often does the Omani legal corpus actually change?

The Official Gazette publishes weekly. In a typical year Oman issues between 100 and 200 Sultani Decrees, several hundred ministerial decisions, dozens of cassation principles, and a steady drip of regulator circulars. A daily polling cron against qanoon.om and decree.om plus the Gazette PDF is sufficient. New documents enter the OCR queue, get chunked, embedded, and indexed within hours of publication, with a notification surfaced to the firm or ministry's knowledge-management lead.