Arabic OCR for Ministry Archive Digitization Programmes
Every Omani ministry sits on top of decades of paper: ministerial decrees, internal correspondence, signed memoranda, project files, personnel records, contracts, and committee minutes. Some of it is clean modern Naskh on laser-printed letterhead. Most of it is mixed: faded carbon copies, handwritten margin notes, stamps that bleed through the page, and pre-2000 typewriter pages with broken letters. A real digitization programme cannot pretend that mess away. It needs an Arabic OCR pipeline designed for it, run inside the ministry, and governed for a fifty-year horizon. This piece is the operational shape of that pipeline.
1. The Omani ministry archive scale
The National Records and Archives Authority (NRAA) sets the legal frame for all government documents in Oman through the Documents and Archives Law and the supporting executive regulation. The NRAA's Wosool electronic records system has been signed into operation across multiple ministries, but the back-catalogue of paper that predates Wosool is what dominates any digitization plan.
A typical Omani ministry archive shows three rough strata:
- Pre-1990 layer. Typewritten Arabic on onion-skin paper, English carbon copies for foreign correspondence, and handwritten ministerial annotations. OCR-hostile: missing letter strokes, ink bleed, oxidised paper, and irregular spacing.
- 1990 to 2010 layer. Inkjet and dot-matrix prints, fax reproductions, and the first Word documents printed and re-filed. Mixed quality, with stamps and signatures over text being the main hazard.
- Post-2010 layer. Mostly digital-native PDFs from Microsoft Word and Office, scanned at higher DPI, with cleaner typesetting. OCR-friendly, but full of embedded forms, tables, and signature blocks.
For a single ministry the volume is realistically in the millions of pages once decrees, HR files, project archives, and committee minutes are summed. A pure manual-retype project does not finish in a human lifetime. OCR is mandatory; the question is how to do it without sending classified Omani paper to a foreign cloud.
2. The OCR pipeline (Surya, TrOCR, Tesseract Arabic)
No single OCR engine wins across all three strata. The pragmatic 2026 stack ensembles three:
- Layout and reading order: Surya. Surya handles RTL reading order, multi-column ministerial decrees, and table detection. The KITAB-Bench 2025 evaluation reports Surya at roughly 70 percent Jaccard on Arabic layout detection, which is competitive for modern typeset pages.
- Modern typeset Arabic: Tesseract Arabic with a custom dictionary of ministry-specific terms (departmental names, administrative vocabulary, place names). Cheap, CPU-only, and good enough on the post-2010 layer.
- Handwritten and hostile pages: a fine-tuned TrOCR-Arabic model, or a local vision-language model (Qwen-VL or Gemma-VL family) running on the same air-gapped GPU appliance. KITAB-Bench shows VLMs reduce character error rate on hard Arabic pages by an average of 60 percent versus traditional OCR.
The router picks the engine per page based on layout features (handwriting detector, font count, image quality score). Confidence scores below a threshold push the page into a human-in-the-loop review queue, where an Arabic-native reviewer corrects the text and the correction is fed back as fine-tuning data. After three months of operation the model adapts to the ministry's specific paper, fonts, and stamp patterns.
3. Post-OCR: NER, classification, semantic chunking
Raw OCR text is not yet useful. The post-OCR stage is what makes the archive searchable, governable, and ready for AI assistants. Three steps run on the same on-premise GPU appliance:
- Named Entity Recognition. An Arabic NER model tags every page for persons, organisations, ministries, places, decree numbers, dates (Hijri and Gregorian), file numbers, and money. CAMeL Tools and AraBERT-NER are common starting points; both can be fine-tuned on a few thousand ministry-labelled examples.
- Document classification. A small fine-tuned classifier assigns each scanned bundle a document type (decree, memo, contract, HR file, minutes, correspondence) and a sensitivity tier (public, internal, restricted, classified). Classification routes the file into the right retention rule and the right access-control list.
- Semantic chunking and embeddings. The OCR text is chunked at section boundaries and embedded with a bilingual model (Arabic plus English). The vectors live in a local Qdrant or Milvus index and feed retrieval-augmented assistants for ministry staff. The same plumbing that powers defence AI Arabic triage at higher classification tiers powers ministerial document search at the unrestricted tier.
Mu'een, Oman's national shared-AI platform, can absorb the unrestricted strata for cross-ministry workflows; the restricted and classified strata stay inside the ministry's own appliance.
4. Long-term governance (PDF/A-3, integrity hashes)
A digitization programme is judged in 2076, not 2026. Three governance choices keep the archive alive:
- PDF/A-3 wrapping. Each digitised bundle is stored as ISO 19005-3 PDF/A-3, with the original scan as the visual layer, OCR text as a hidden searchable layer, and structured metadata (entities, classification, retention class) embedded as XMP. PDF/A-3 also allows the original raw TIFF to ride inside the file as an attachment, satisfying the "preserve the original artefact" rule that NRAA inspectors enforce.
- SHA-256 integrity hashes. Every page image, every OCR text layer, and every PDF/A-3 file is hashed and the hash is stamped into an append-only ledger. Tampering with any byte invalidates the chain. A ministry IT auditor twenty years from now can re-hash the archive and prove integrity in minutes.
- Format migration plan. Every five years the archive is verified against the latest PDF/A profile and migrated forward. The pipeline is the same one used during ingestion, so migration is rehearsed code, not heroics.
Across all of this, the rule is the rule from the pillar piece: the model, the embeddings, the audit log, and the original scans never leave Omani soil. The ministry runs the appliance, holds the keys, and revokes any operator account in seconds.
Talk to us
If you are scoping a ministry-level digitization programme, an NRAA-aligned records modernisation, or a single-directorate pilot, email [email protected] for a one-hour briefing. We bring sample throughput numbers from comparable Arabic OCR workloads and a sovereign deployment plan that fits the Documents and Archives Law.
Frequently asked
Why not just send ministry archives to a cloud OCR API?
Ministry archives include classified, restricted, and personal-data files that cannot legally leave Oman under PDPL and the Documents and Archives Law. Cloud OCR APIs send page images to the vendor's data centre and may retain them for model improvement. A sovereign on-premise pipeline keeps every page, every embedding, and every audit log inside the ministry network.
Which OCR engine works best for Arabic government paper?
There is no single answer. Modern stacks ensemble three engines: Surya for layout and reading order, TrOCR-Arabic or a vision-language model for handwritten Naskh, and Tesseract Arabic as a cheap baseline for clean modern typeset pages. The KITAB-Bench 2025 evaluation shows vision-language models lead on character error rate while Surya remains strong on layout.
Do we need to retype handwritten files, or can OCR cope?
Modern Arabic handwritten OCR reaches 88 to 94 percent character accuracy on clean Naskh and Ruq'ah, which is good enough to populate a searchable index but not good enough to replace the source. The right pattern is dual storage: searchable OCR text plus the original image, with a human-in-the-loop review queue for high-value documents.
How is the digitised archive preserved for the long term?
PDF/A-3 is the ISO archival format of choice. Each digitised file is wrapped as PDF/A-3 with the original scan as the visual layer, the OCR text as a hidden searchable layer, and the structured metadata (entities, classifications, hashes) embedded as XMP. SHA-256 integrity hashes are stamped into a write-once ledger so any tampering is detectable years later.