Falcon Arabic at the Edge: Sizing for Branch Deployments
Sovereign Omani institutions rarely operate from a single building. A bank has 60 branches across the wilayats. A regulator has field offices in Salalah, Sohar, and Sur. A defence agency has stations that can lose connectivity for a working day. Putting Arabic-fluent AI in the headquarters data centre is the easy half of the design; the harder half is putting capable, governed Arabic AI inside every branch without sending sensitive documents back to HQ. This article is the sizing playbook for Falcon Arabic at the edge, the supporting companion to our deeper piece on the Falcon Arabic LLM and TII's role in sovereign Arabic NLP.
Branch-office reality: 5 to 50 concurrent users in one rack unit
Branch sites have constraints HQ does not. A typical Omani branch fits its server, switch, and UPS in a single half-rack inside a back office, draws no more than a 16-amp single-phase circuit, and has neither raised flooring nor proper cooling for a 4U H100 chassis. The realistic concurrent-user count is small: 5 to 10 tellers, 20 to 30 case officers, 40 to 50 in a busy regional centre.
- Power envelope: 600 W to 1,200 W steady-state for the AI appliance, peaking under 1,800 W. Anything that needs a dedicated 32-amp three-phase drop is a non-starter.
- Acoustic budget: branch back rooms sit next to staff desks; loud rack-mount GPUs (H100 SXM, 8-fan 4U boxes) fail the noise test that HQ data halls never had to pass.
- Cooling envelope: a split-AC unit, not a CRAC. Sustained 700 W heat output is the realistic ceiling.
- Maintenance reality: the branch IT lead is one person, often a generalist. The appliance must be a single SKU that boots, registers with HQ, and runs.
This is the box where Falcon Arabic has to fit, and where 70B-class models do not.
Falcon Arabic size variants and quantisation choices for edge
TII published Falcon Arabic as a 7B-parameter model on the Falcon 3 architecture, with a 32K-token context window and a tokenizer extended by 32,000 Arabic-specific tokens, per the official TII model announcement on Hugging Face. That 7B size is the gift to edge buyers: it lands inside the memory and power envelopes a branch can actually host.
Three quantisation tiers matter for branch sizing:
- FP16 / BF16 (full precision): roughly 14 GB for weights. Comfortable on any modern 24 GB+ GPU, runs with full quality, the natural choice when the appliance has headroom.
- FP8 or INT8: roughly 7 GB for weights, with negligible Arabic quality drop on Falcon-3 class models. Frees KV cache budget to support more concurrent users on the same card.
- INT4 / Q4_K_M GGUF: roughly 4 GB for weights. The right tier for the smallest edge units, including Apple Silicon laptops in mobile-office scenarios. Pair with an evaluation pass on Arabic MMLU and MadinahQA so the institution sees the actual quality cost on its own corpus, not a vendor's headline number.
For a deeper treatment of GGUF Q4 versus Q5 trade-offs on Arabic, see our piece on quantisation choices for Arabic LLMs.
Hardware sizing: M3 Ultra, RTX 6000 Ada, Strix Halo
Three platforms cover the realistic branch envelope in 2026. Pricing data below is drawn from the public hardware survey at BSWEN's RTX PRO 6000 vs Mac Studio local-LLM analysis and from Apple and NVIDIA's published specifications.
- Mac Studio M3 Ultra, 256 GB unified memory. Silent, 480 W peak, ships in a 20 cm cube. Holds Falcon Arabic 7B at FP16 with hundreds of GB headroom for KV cache and a second model. Best for branches with 5 to 25 concurrent users and bilingual document workloads.
- RTX 6000 Ada in a 2U workstation chassis. 48 GB or 96 GB VRAM depending on generation, 300 W typical, runs Falcon Arabic 7B at FP8 or INT8 with substantial headroom. The right pick for a busy branch with 30 to 50 concurrent users, especially when the institution wants vLLM-style continuous batching.
- AMD Strix Halo workstation, 128 GB unified memory. The newest entrant in the edge class. Lower acquisition cost than the M3 Ultra, slightly less memory, comparable Arabic inference latency on 7B models at INT4. Useful when an institution has standardised on AMD across the estate.
For a fuller comparison of these three platforms, including raw tokens-per-second numbers, see our companion piece on edge GPU appliances for branch offices.
Sync architecture: signed adapter updates, no direct connectivity
The harder design question is not how to run Falcon Arabic at the branch; it is how to update it. Sovereign branches typically cannot reach the public internet, and many cannot reach HQ over a continuous network either. The Hosn pattern uses signed offline bundles:
- HQ as the model factory. Fine-tuning, LoRA adapter training, evaluation, and red-teaming all happen in the HQ environment, on the institution's full corpus, under the institution's control.
- Versioned signed bundles. Each release is packaged as a manifest plus weights and adapters, signed with an HQ private key whose public counterpart is embedded in every branch appliance at provisioning.
- One-way transport. Bundles travel by encrypted USB, by data diode for classified branches, or over a tightly-scoped private MTCIT circuit when policy permits. There is no path from the branch back to HQ for arbitrary outbound traffic.
- Stage-and-promote. The branch appliance applies the bundle to a staging slot, runs a self-test on a held-out evaluation set, and only promotes to active if the self-test passes. A failed update never takes a branch offline.
- Telemetry, opt-in. Operational metrics (latency, error rates, model version) flow back to HQ on the institution's terms, never the contents of any user prompt.
Mu'een, Oman's national shared-AI platform, offers a complementary path for institutions that prefer a centrally-hosted bilingual model; the on-premise edge pattern above is the right answer when classification, latency, or connectivity rule that out.
Closing
Falcon Arabic 7B is the rare Arabic model that fits the constraints a branch actually has: a single rack unit, a single 16-amp circuit, a single IT generalist, and an air gap. The work is in the sizing discipline (right quantisation, right hardware tier, honest concurrent-user budget) and in the sync architecture that keeps branches current without ever opening them to the public network. Email [email protected] for a one-hour briefing on a branch-deployment plan tailored to your institution's footprint.
Frequently asked
How many concurrent users can Falcon Arabic 7B realistically serve at a branch?
On a single RTX 6000 Ada with 48 GB or 96 GB VRAM running Falcon Arabic 7B at FP8 or INT4 with vLLM batching, plan for 30 to 50 concurrent interactive sessions at sub-second time-to-first-token. On an M3 Ultra Mac Studio with 256 GB unified memory the same model serves 10 to 25 concurrent users with longer prefill but adequate generation throughput. The hard ceiling at the edge is rarely raw compute; it is the KV cache budget when several users hold long-context sessions at once.
Does the Falcon Arabic license allow on-premise deployment in Oman?
The base Falcon-3 family ships under the TII Falcon LLM License, which permits commercial on-premise use. Falcon Arabic itself is currently distributed primarily through the TII chat endpoint and partner channels; for sovereign on-premise deployment, the practical pattern in 2026 is to run the closest open Falcon-3 7B base with the published Arabic tokenizer extension, then layer institution-specific fine-tuning. Hosn confirms licensing terms with TII for every sovereign customer before deploying Falcon Arabic weights to production.
Why not just call HQ over a private link instead of putting AI in the branch?
Three reasons. First, branch sites in remote wilayats sometimes lose connectivity for hours; an air-gapped local model keeps the work moving. Second, private circuits to HQ are expensive at the bandwidth required for streaming AI sessions and bilingual document review. Third, classification rules often prohibit certain branch documents from leaving the branch perimeter at all. Edge inference solves all three at once, at the cost of running and updating a small appliance per site.
How are model and adapter updates pushed to branches without internet?
The standard Hosn pattern is signed offline bundles. HQ produces a versioned package containing the new base weights, fine-tuned LoRA adapters, prompt templates, and a manifest signed with an HQ private key. Branches verify the signature against an HQ public key embedded in the appliance, apply the bundle in a staging slot, and roll forward only after a self-test passes. The bundle moves over a one-way diode for classified branches, encrypted USB for unclassified branches, or a private MTCIT circuit when permitted. No branch ever needs to reach a public network.