Why not just put the geological corpus on a cloud retrieval API?

Because the corpus encodes the firm's exploration thesis and reserves position. Once a well log, a seismic interpretation memo, or a play-evaluation note has been embedded by an external API, the provider holds vectors that can be inverted, queries that can be logged, and a usage signal that maps to where the firm is looking next. For an upstream operator in the GCC that is a competitive sensitivity, not a compliance footnote. The vectors and the query path stay inside the operator's perimeter.

What does a geological retrieval index actually contain?

Mostly text and figures extracted from well-completion reports, daily drilling reports, mud logs, core descriptions, biostratigraphic notes, seismic interpretation memos, and decades of internal technical papers. Numerical log curves and seismic volumes stay in their own systems (Petrel, Techlog, OpenWorks); the AI index points back to those system identifiers and complements them with a searchable text and figure layer.

Does a domain-specific embedding model really beat a general one?

On petroleum-engineering language the gap is real but bounded. Published work on PetroBERT and Petroleum-Engineering-LLM evaluations (see arXiv 2409.02428 and the IPTC 2024 SPE-language papers) shows 5 to 12 point gains on retrieval and named-entity tasks over generic baselines. The practical pattern is to start with a strong multilingual base (BGE-M3 or multilingual-e5-large), add a lightweight domain adapter trained on your own corpus, and re-evaluate quarterly.

Can a Hosn appliance host this for an upstream operator?

Yes. A Tower-tier appliance handles a single asset team's corpus, low millions of chunks, with sub-100ms retrieval and on-box generation. A Rack-tier configuration scales to a full E and P organisation. Pricing is by quotation, sized to corpus volume, query rate, and the rest of the model stack the operator wants to run alongside the retriever.

AI Patterns for Geological and Subsurface Literature Search, Hosn Blog

Upstream operators in the GCC sit on top of fifty years of subsurface paperwork. Well-completion reports from the 1970s, daily drilling reports stamped on telex paper, biostratigraphic notes scribbled by retired palaeontologists, seismic interpretation memos signed off across three different vendor packages, and decades of internal technical papers describing every play and prospect ever drilled. The geology team needs a working answer to "what did we already learn about this formation" in minutes, not in three days of corridor archaeology. This is a textbook retrieval-augmented generation problem, and it pairs naturally with the broader picture in our pillar on defence AI Arabic triage, which establishes the same on-prem pattern for a different sensitivity class.

The subsurface-literature search problem

Three artefacts dominate the corpus, and none of them search well in the systems that already exist on an operator's network.

Well logs and completion reports. Petrophysical interpretations, lithology columns, and casing diagrams are stored in dedicated platforms (Petrel, Techlog, the legacy OpenWorks instance), but the surrounding narrative (why a zone was abandoned, which fluid was tested, how a sidetrack was justified) lives in scanned PDFs that the existing platform indexes only by well name and date.
Seismic interpretation memos. Decades of structural and stratigraphic interpretations, vendor reprocessing reports, and internal QC notes. Mostly PDF, frequently with embedded figures and tables that contain the actual conclusion.
Geology reports and technical papers. Internal play-evaluation studies, prospect reviews, post-drill analyses, and conference papers (SPE, AAPG, IPTC) authored by staff or vendors over half a century. Some scanned, some born-digital, some stored on departmental shares no-one has indexed since 2008.

The shared failure mode is brittle keyword search. A geologist asking "has anyone described a karstified Khuff equivalent in the southern asset" has to know the right operator-specific spelling for the formation, the right vendor for the play, and the right legacy filename convention. Most of the corpus is invisible to that query even though the answer is in it.

Domain-specific embedding and retrieval

Petroleum-engineering language is recognisably its own dialect. "Pay zone", "kick", "gauge ring", "Khuff carbonate", "facies belt", and the units that follow them carry meaning that a generic web-trained embedding model collapses or mis-clusters. The published evaluations make the gap concrete. Aljarbouh et al, 2024 describes a Petroleum-Engineering-LLM that posts 5 to 12 point gains on domain retrieval and named-entity recognition over generic baselines, while earlier work on PetroBERT (Rezende et al, 2022) showed similar effects on Brazilian upstream corpora.

The practical pattern for a sovereign on-prem deployment has three steps. First, pick a strong multilingual base (BGE-M3 if cross-lingual coverage matters, multilingual-e5-large for a smaller footprint). Second, train a lightweight domain adapter on a sample of the operator's own corpus, three to five thousand labelled query-document pairs is enough for a meaningful gain. Third, run a cross-encoder re-ranker on the top 50 hits before passing to the generation model. The retriever stays small enough to live alongside the generation model on the same appliance, and re-training the adapter is a quarterly task, not a research project.

Two pitfalls to budget for. PDFs from the 1970s and 1980s need real OCR with figure-and-table awareness; a naive OCR pass loses the captions that carry the actual interpretation. And formation-name normalisation deserves its own pre-processing layer, because the same unit appears under three or four different conventions across the corpus.

Why on-prem for upstream operators

The corpus encodes the operator's reserves position and exploration thesis. Where a firm is looking next, which plays it has quietly de-risked, which prospects it has parked and why, all of that is implicit in the documents and in the queries staff ask against them. Three implications follow.

Reserves data is competitive intelligence. The SEC and home-country regulators define reserves disclosure tightly; the underlying technical evidence behind a reserves booking is not for outside consumption. An external embedding API holds vectors that can in principle be inverted, plus query logs that map to where the firm is currently focused.
Exploration thesis leaks via queries. Even with the documents un-shared, the pattern of questions ("what do we know about pre-Cambrian basement plays in block X") is itself a signal. Sovereign-class operators do not export that signal to a third-party endpoint.
National data laws frame the choice. The Omani PDPL (Royal Decree 6/2022) and the wider GCC trend toward data-residency expectations close the gap between "we prefer on-prem" and "we deploy on-prem". The corpus, the index, the model, and the query path all live inside the operator's perimeter.

Mu'een, Oman's national shared-AI platform, exists for the cohort of public-sector workloads where shared infrastructure is the right answer. Subsurface literature for an upstream operator is not that cohort.

Architecture sketch

The shape that survives a security review without compromising on usefulness:

Ingestion. One-time crawl of the document shares, with OCR for scanned material and a figure-and-table extraction pass. Output is plain-text chunks plus original-figure references, all retained on-box.
Chunking and metadata. Semantic chunks of 300 to 600 tokens with 10 to 20 percent overlap. Per-chunk metadata: well name, formation, basin, year, vendor, classification level, source-system identifier so the geologist can jump back to Petrel or Techlog from a hit.
Embedding. Multilingual base model (BGE-M3 first), optional domain adapter, normalised vectors stored in a local vector database (Qdrant or pgvector, on the same appliance).
Retrieval and re-ranking. Bi-encoder retrieval over the index, cross-encoder re-ranker on the top 50, optional metadata filter (asset, basin, year range) supplied by the operator UI.
Generation. A long-context Arabic-and-English-capable model (Gemma 4 256k or Qwen 3.6) running on the same appliance, prompted with the top chunks plus the geologist's question, citations back to the source documents and source-system IDs.

The whole pipeline is a small, auditable artefact: a documented set of model files, one index, one query path, all inside the operator's perimeter. It is exactly the kind of component that should run on a sovereign appliance rather than against an external API.

If your team is sizing this for an upstream operator and would like a one-hour briefing on corpus shape, model choice, and the deployment pattern that fits a specific asset organisation, the next step is simple. Email [email protected] or message +968 9889 9100. We will walk through the corpus mix, classification levels, and a credible evaluation plan.

The subsurface-literature search problem

Domain-specific embedding and retrieval

Why on-prem for upstream operators

Architecture sketch

Frequently asked

Related

Defence AI Arabic Document Triage

OQ E and P Knowledge Base AI

Choosing Embedding Models for Bilingual Arabic-English RAG