Arabic OCR and Document Triage Patterns for Defence
A defence intake desk in Muscat is not a tidy queue of clean PDFs. It is a stream of phone-photographed paper, faxed annexes, scanned situation reports in mixed quality, screenshots from chat apps, and the occasional handwritten field note. Most of it is Arabic. A meaningful slice mixes Arabic body text with English place names, Latin-script callsigns, and Hindi-Arabic numerals. Cloud OCR is off the table on this material. This piece walks the open-source OCR options that survive that reality and the triage layer that turns recognised text into routed, prioritised analyst work.
1. The Arabic OCR challenge for defence
Defence-grade Arabic intake combines four properties that each break a naive OCR pipeline, and they almost always co-occur on the same document:
- Mixed scripts. A single intercept summary can carry Arabic narrative, English unit designations, Latin equipment model numbers, and Hindi-Arabic plus Western digits in the same paragraph. Bidi handling has to be right at the recognition layer, not patched afterwards.
- Handwritten captures. Field notes, source debriefs, and margin annotations on otherwise typed documents are written by hand in cursive Naskh, sometimes with idiosyncratic letter forms that vary across a single page.
- Low-quality scans. Pages arrive as phone photos at oblique angles, fax-derived greyscale, microfilm-sourced TIFFs, and bleed-through carbon copies. Deskew, denoise, and binarisation have to do real work before recognition runs.
- Cursive Naskh fonts. Arabic letters connect by default and change shape by position. Recognition models trained on Latin scripts cut characters mid-ligature; even Arabic-tuned models trained on modern web text underperform on the older Naskh-derived fonts found in legacy ministerial documents.
Triage is what the operator actually wants from the system: not raw text, but a routed, ranked queue of documents tagged by type, entity, urgency, and analyst owner. OCR is the floor. Triage is the value.
2. Open-source OCR options for on-prem
Three open-source recognisers cover the defence intake distribution between them, and a working stack uses all three:
- Tesseract 5 with the Arabic LSTM model. The workhorse for clean machine-printed Arabic. Independent fine-tuning studies on Arabic fonts have cut character error rates by up to 61 percent compared with the stock model, and a baseline study reports an average CER around 14 percent and overall accuracy near 88 percent on printed Arabic. Tesseract is fast, CPU-friendly, and trivial to redeploy offline.
- Surya. A modern transformer toolkit covering OCR, layout analysis, reading order, and table recognition in over 90 languages including Arabic. Comparative reviews report Surya near 97 percent overall accuracy on a multilingual mix, with around 87 percent on handwritten material. It is the right second-stage recogniser when Tesseract's confidence drops on a page.
- TrOCR. An end-to-end transformer with a Vision Transformer encoder and a pretrained text decoder. TrOCR achieved state-of-the-art results on printed-text and IAM-handwritten benchmarks. With an Arabic-focused fine-tune it becomes the natural recogniser for handwritten field notes and historical handwritten annexes.
The selection logic is mechanical. Layout analysis labels each region. Machine-printed regions route to Tesseract first, then to Surya on low confidence. Handwritten regions route to TrOCR. Tables route to Surya's table recogniser. Everything stays inside the appliance.
3. The triage layer above OCR
Recognised text without triage is a pile of tokens. The triage layer turns that pile into a decision queue, and it has three components:
- Document classification. A small Arabic-tuned classifier labels each document by type: intercept summary, signed order, situation report, informant note, open-source clipping, supplier invoice, court referral. The label drives every downstream routing rule.
- Named-entity recognition. An Arabic NER model fine-tuned on defence correspondence extracts persons, units and formations, locations down to wilayah and willage, weapon and equipment types, dates (Hijri and Gregorian), file numbers, and monetary amounts. The same model handles the embedded Latin runs (English place names, Latin callsigns) without flipping the bidi.
- Entity-event extraction. A larger Arabic-capable LLM running on-prem reads the cleaned text plus the entity spans and emits structured tuples: actor, action, object, place, time, source confidence. Those tuples are what an analyst actually consumes, joined into a graph that survives across documents.
Each triage decision carries a confidence score. Low-confidence documents land in a human verification queue with the original page image, the OCR overlay, and the proposed labels. High-confidence documents route directly to the analyst owning the relevant subject area. This is the layer that earns the air-gap, and it is the layer that makes defence AI Arabic triage a credible programme rather than a science experiment.
4. Architecture pattern
A working on-prem deployment has four planes that map cleanly onto a Hosn appliance:
- Ingest plane. Scanners, intake watchfolders, and operator upload endpoints feed a queue. The plane deskews, denoises, splits multi-page bundles on letterhead detection, and writes the source page image to immutable storage with a hash-bound chain of custody.
- Recognition plane. Layout analysis dispatches regions to Tesseract, Surya, or TrOCR. Output is normalised: alef variants unified, kashida stripped, diacritics preserved on classified-quote fields, Hindi-Arabic numerals optionally folded to Western digits.
- Triage plane. Classification, NER, and entity-event extraction run against the cleaned text. Confidence scores drive routing. An audit log records every model output, every routing decision, and every analyst access.
- Access plane. Analysts query through an internal portal that returns highlighted hits inside the original page image. Role-based access enforces clearance. Nothing leaves the perimeter, and nothing reaches a model the operator did not approve.
5. Eval methodology with a declassified corpus
An OCR plus triage stack must be measured before it is trusted. The institution builds a small evaluation corpus from declassified or synthetic material that mirrors real intake: 200 to 500 pages spanning the document types the desk actually sees, hand-labelled for ground truth at the character, entity, and document-type level. The pipeline reports CER and WER on the OCR layer, F1 on the NER layer, top-1 and top-3 accuracy on the classifier, and tuple-level precision and recall on entity-event extraction. Every model swap reruns the same eval. The number that matters most is end-to-end: how often does a document arriving cold land on the correct analyst's queue with the correct entities tagged. Anything below that is a vendor benchmark.
Brief us on your intake
If you operate a defence intake desk, an internal-security analysis cell, or a counter-terrorism research unit that is still hand-routing Arabic paper, the OCR plus triage stack is buildable on current open-source models, and it is buildable on-prem. Email [email protected] for a one-hour briefing. Bring a representative slice of the intake, redacted as needed, and we will walk you through the pipeline against your actual material rather than a sales deck.
Frequently asked
Why not just use a cloud OCR service for defence documents?
Defence intake mixes intercept summaries, signed orders, situation reports, and informant notes. Routing those through a cloud OCR API exposes content to foreign legal regimes and to retention beyond the operator's control. On-prem OCR keeps every page inside the perimeter and lets the operator log every read.
How accurate is open-source Arabic OCR on real defence material?
On clean modern machine-printed Arabic, fine-tuned Tesseract 5 LSTM models reach character error rates around 14 percent and word error rates around 41 percent. Transformer recognisers like Surya and TrOCR push higher on typed text but still need fine-tuning on the institution's own font samples and handwriting styles for legacy forms.
What does the triage layer extract that OCR alone cannot?
OCR returns text. Triage returns decisions. The triage layer classifies each document by type (intercept, order, situation report, informant note), extracts entities (persons, units, locations, weapon systems, dates), surfaces entity-event tuples (who did what to whom, where, when), and assigns a routing label so analysts only see what matches their queue.
Can the OCR plus triage stack run fully air-gapped?
Yes. Tesseract, Surya, TrOCR, the Arabic NER models, and the triage LLM all run on the on-prem appliance with no outbound network. Model updates ship as signed offline bundles. The same posture applies whether the workload sits inside a defence ministry, an internal-security directorate, or a counter-terrorism analysis cell.