Arabic NLP for Omani Government Archive Digitization

Every Omani ministry, regulator, court, and Royal Court secretariat sits on top of decades of paper. Decrees in Naskh handwriting from the 1970s. Carbon-copy memos from the 1980s. Faxed land filings from the 1990s. Scanned PDFs of mixed quality from the 2000s. The job of turning that wall of paper into a queryable, sovereign archive is one of the highest-leverage AI projects an Omani institution can run, and it is exactly the kind of work that must never leave the building. This piece walks the OCR plus NLP pipeline that makes it tractable, and the on-prem architecture that keeps it sovereign.

1. The Omani archive scale problem

A typical Omani ministry holds three generations of records stacked on top of each other. The oldest layer, pre-1980, is mostly handwritten in Naskh on letterhead, often with red ministerial seals, marginal annotations, and Hijri-only dates. The middle layer, 1980 to 2005, is typewriter print or early dot-matrix, frequently on faded NCR carbon paper with bleed-through and uneven ink. The newest layer is scanned PDFs from networked multifunction printers, mixed at 200 dpi greyscale and 300 dpi colour, with no OCR text layer.

Three properties make this archive painful to digitise:

  • Multi-script content. Bilingual decrees mix Arabic body text with English company names, Latin-script CR numbers, and Hijri plus Gregorian dates side by side.
  • Mixed scan quality. A single file can contain a clean colour scan of a cover letter, a faded photocopy of an annex, and a microfilm-derived TIFF of an older attachment.
  • Cursive and ligature density. Arabic script is inherently cursive, with letters changing form by position and stacking through ligatures, which breaks any OCR engine that was tuned on Latin scripts.

The institutional cost of leaving this archive opaque is concrete. Researchers spend weeks chasing a single decree. Land disputes drag on because the original 1980s ruling cannot be found. Successor officials lose institutional memory because their predecessors' correspondence is unsearchable. The Hosn position is straightforward: this is exactly what on-premise AI for sovereign institutions in Oman and the GCC is for.

2. The OCR plus NLP pipeline

A working pipeline has five stages, all running inside the institution's perimeter:

  1. Ingest and quality conditioning. Pages enter as PDF, TIFF, or JPEG. The pipeline deskews, denoises, binarises mixed-quality scans, and splits multi-page documents on physical staples or letterhead detection.
  2. Layout analysis. A vision model identifies regions: header, body, signature block, seal, marginal note, table. This is the difference between extracting clean text and pulling a soup of stamps and footnotes into the body.
  3. OCR or HTR. Machine-printed pages route to Tesseract 5 with a tuned Arabic LSTM model. Handwritten pages route to a transformer recogniser such as HATFormer or the line-level recogniser described in Invizo.
  4. Script normalisation. Output is normalised: alef variants unified, kashida stripped, diacritics either preserved or removed depending on downstream task, and Hindi-Arabic numerals optionally converted to Western digits.
  5. NER and indexing. An Arabic NER model fine-tuned on government correspondence extracts persons, ministries, locations, dates, decree numbers, and amounts. These flow into a search index alongside the cleaned text and a vector embedding for semantic search.

Each stage is independently reviewable and re-runnable. When a better recogniser ships, the institution re-OCRs only the failing pages, not the whole archive.

3. Arabic-specific challenges

Arabic OCR is not a solved problem the way English OCR is. Four properties drive the difficulty:

  • Cursive ligatures. Letters connect by default and change shape based on position. A naive segmenter built for Latin scripts cuts characters mid-ligature and produces nonsense.
  • Optional diacritics. Short vowels (tashkeel) appear in the Quran and in formal correspondence but are usually absent in modern memos. The recogniser must handle both, and the NLP downstream must not treat their presence or absence as a meaning change.
  • Handwritten Naskh variability. The same scribe can produce wildly different letter forms across a long document. Historical recognisers like HATFormer were trained on millions of synthetic line images precisely because real handwritten data is scarce.
  • Old print fonts. 1970s and 1980s typewriter fonts and early naskh-derived digital fonts often fall outside the training distribution of OCR models tuned on modern web Arabic. Sovereign archives almost always need a fine-tune on a small in-house corpus to lift accuracy on these legacy fonts.

The downstream NLP layer adds its own challenges. Arabic NER models such as those built on AraBERT and the ANER transformer family handle modern news text well, but ministerial style, Hijri dates, and decree-number formats need fine-tuning on labelled archive samples. A few thousand annotated pages from the institution's own corpus typically lifts F1 enough to be production-viable.

4. Architecture for an on-prem deployment

A sovereign archive deployment has three planes that map cleanly onto a Hosn appliance:

  • Compute plane. Layout analysis, OCR, HTR, and NER inference run on the same on-prem GPU node that hosts the institution's general-purpose Arabic LLM. Batch jobs use idle hours; interactive lookups use spare capacity during the day.
  • Storage plane. Raw scans live in cold object storage. Cleaned text, NER spans, and embeddings live in an indexed store next to the source pages. A signed manifest preserves the chain of custody from original scan to extracted entity, which is what makes the result legally defensible.
  • Access plane. Researchers query through an internal portal that returns highlighted hits inside the original page image, never raw text alone. Role-based access enforces classification rules so a researcher with administrative clearance does not see ministerial-confidential pages.

Two operational details matter. First, the indexing job is rerunnable: when a recogniser is retrained on the institution's own font samples, the pipeline reprocesses just the affected document classes rather than the full archive. Second, every step writes to an append-only audit log, so a regulator can later answer who saw which page when. That posture is what separates an archive project from a science experiment.

Brief us on your archive

If you run a ministry, a regulator, a court, or a Royal Court secretariat that is still answering 1980s questions out of a paper room, the OCR plus NLP problem is solvable with current technology, and it is solvable on-prem. Email [email protected] for a one-hour briefing. Bring a representative slice of the archive, redacted as needed, and we will walk you through the pipeline against your actual material rather than a sales deck.

Frequently asked

Why not just use a cloud OCR service for the archive?

Government archives routinely contain personnel files, classified correspondence, land records, and ministerial decisions that fall under PDPL and ministerial classification rules. Sending those scans to a cloud OCR API exposes them to foreign legal regimes and to retention beyond the operator's control. On-prem OCR keeps every page inside the institution's perimeter.

How accurate is open-source Arabic OCR on real archive material?

On clean, modern machine-printed Arabic, Tesseract 5 with the LSTM engine and a tuned Arabic model reaches strong character accuracy. On older typewriter print, faded carbon copies, and handwritten Naskh, accuracy drops sharply. That gap is closed by transformer recognisers like HATFormer and Invizo plus a human verification queue for high-value documents.

What does the NER layer actually extract from Omani archive text?

An Arabic NER model fine-tuned on government correspondence extracts persons, ministries and directorates, locations, dates (Hijri and Gregorian), CR numbers, decree numbers, file references, and monetary amounts. These become the index that lets a researcher answer queries like find every decree referencing a given wilayah between 1985 and 2000 in seconds rather than weeks.

Can the pipeline run fully air-gapped?

Yes. The recognition models, NER models, embeddings, and search index all run on the on-prem appliance with no outbound network. Updates ship as signed offline bundles. The same posture applies whether the archive lives in a ministry data centre, a regulator's secure room, or a Royal Court secretariat.