BYOC: Bring-Your-Own-Hardware Certification for Sovereign AI
Not every sovereign buyer wants Hosn to ship a turnkey appliance. Some already own GPU servers from a framework agreement. Some can only buy through a classified-procurement vendor list. Some signed a multi-year deal with HPE or Dell two budget cycles ago and the asset is on the floor. For all of them we run a BYOC programme: bring your own compute, we certify it, harden it, and operate the AI stack on top. This article documents how that certification works and what passes our gate.
1. Why some sovereign buyers must bring their own hardware
The reflexive picture of an on-prem AI deployment is a vendor truck arriving with a sealed appliance. In Oman and the wider GCC the reality is messier. Three patterns drive BYOC requests at Hosn:
- Existing inventory. A ministry or sovereign-bank IT division bought DGX-class or HGX-class boxes for a previous workload (HPC, rendering, model training pilots) and the boards are 60 to 80 percent idle. Re-using them is cheaper than retiring them.
- Framework agreements. Most government bodies in Oman procure compute through a multi-year framework with one or two prime vendors. Buying outside the framework needs Tender Board exemption paperwork. Re-using framework hardware avoids the exemption.
- Classified procurement. Defence and internal-security buyers can only accept hardware from a pre-cleared vendor list, often with serial-number-level provenance tracking. The Hosn appliance SKU may not be on that list. Their existing accredited hardware is.
For these buyers, the question is not "what should we buy" but "can what we already have run a sovereign AI stack at production quality". That is what BYOC certification answers. The deeper hardware-selection conversation, GPU class, memory budget, throughput targets, lives in our pillar piece on the H100, H200, RTX 6000 and Mac Studio AI hardware comparison.
2. The certification matrix
We certify on six axes. A node passes only when every axis is green; one red blocks production sign-off.
- CPU. Intel Xeon Sapphire Rapids/Emerald Rapids/Granite Rapids, or AMD EPYC Genoa/Bergamo/Turin. Minimum 32 physical cores per socket for inference, 64 for training. Older Skylake/Cascade Lake fails on AVX-512 BF16 path requirements for vLLM tensor-parallel sharding.
- GPU. NVIDIA H100, H200, B200, GH200, L40S, RTX 6000 Ada, or RTX Pro 6000 Blackwell. AMD MI300X is supported on a case basis with vLLM-ROCm. Consumer cards (4090/5090) are refused for production. The system itself must appear on the NVIDIA-Certified Systems catalogue or carry an explicit OEM thermal/power validation report.
- NIC. 25 GbE minimum for single-node inference, 100 GbE or NDR InfiniBand for multi-node. NVIDIA ConnectX-6/7/8 or BlueField-3 strongly preferred. Mellanox firmware floor: MFT 4.30 or newer.
- Storage. NVMe-only for the model cache (KV-cache spill, model weights). 4 TB minimum per node, U.2 or E1.S form factor. Object storage backplane (MinIO/Ceph) on a separate disk pool.
- BIOS / firmware. SR-IOV, IOMMU and Above 4G Decoding enabled. PCIe ACS disabled or per-port-overridden so GPUDirect P2P works. BIOS within 12 months of vendor latest. BMC firmware audited and demoted from internet exposure.
- OS / driver stack. Ubuntu 24.04 LTS or RHEL 9.x, NVIDIA driver 565+ with CUDA 12.6+, NVIDIA Container Toolkit, validated against the Hugging Face Hub model-card runtime requirements for the specific Gemma, Qwen, or Falcon Arabic build the customer plans to deploy.
3. Reference architectures Hosn supports
We maintain four reference designs that the customer's hardware is benchmarked against. If your inventory matches one of these, certification is fast. If it deviates, we document the variance and decide.
- HPE. ProLiant DL380a Gen11/Gen12 with 4x or 8x H100/H200 SXM, ProLiant Compute XD685 for HGX-scale, Cray XD670 for training-tier deployments. iLO BMC isolated on a management VLAN.
- Dell. PowerEdge XE9680 (8x H100/H200 SXM), XE8640 (4x SXM), R760xa (4x PCIe H100/L40S). PowerStore for the object backplane in mixed-SAN sites.
- Supermicro. 421GE-TNHR2 (8x H100/H200 SXM), AS-8125GS-TNHR (8x MI300X), SYS-521GE-TNRT (4x PCIe). The most common BYOC chassis in GCC tenders because of price and lead-time advantages.
- Edge / classified-air-gap. A Hosn Kernel reference (single L40S or RTX Pro 6000, mid-tower, no ILO/iDRAC, USB-locked) for sites where a full rack is overkill or not permitted.
For a deeper view of how these families differ on price, lead-time, and warranty in GCC procurement, read Dell, HPE and Supermicro AI servers in the GCC. For the procurement-paperwork angle inside Omani government, see hardware procurement for Omani government.
4. The acceptance test checklist
Once the matrix and reference fit are confirmed on paper, we run the physical acceptance test on the box. The test is unattended after kickoff and produces a signed PDF report the customer files with their tender records.
- Boot and inventory. Cold-boot to login under 4 minutes, dmidecode/lspci match the bill of materials, BMC reachable on the management VLAN only.
- Drivers. nvidia-smi reports all GPUs, ECC enabled, no Xid errors over a 30-minute idle hold, NCCL all-reduce passes on multi-GPU nodes.
- vLLM benchmark. Load a reference build of Gemma 4 27B Instruct or Qwen 3.6 32B, run a 1000-prompt throughput sweep at 2k/8k/32k context, measure tokens/sec and p50/p95/p99 latency. Result must land within 8 percent of our reference node for the same GPU class.
- Network throughput. iperf3 between every node pair at 90 percent or better of line rate; for InfiniBand, ib_write_bw and ib_send_lat within vendor spec; nccl-tests all-reduce bandwidth at or above 80 percent of theoretical.
- Eval-suite delta. Run our internal Arabic + English eval (a frozen subset of ArabicMMLU, AlGhafa, ALUE, plus 200 institution-specific prompts the customer supplied) on the customer's box and on our reference. Any score that drops more than 0.5 absolute points triggers a manual re-run; a sustained gap is a fail.
- Security and lifecycle. Disk encryption keys held by customer, BMC patched and isolated, supply-chain attestation (TPM-quoted boot measurement), and a rollback plan to factory firmware in writing.
Closing
BYOC is not a discount path; it is a procurement-shape choice. Sovereign buyers who already own modern compute or who are bound by framework agreements should not be punished for it. If you have an inventory list and a Tender Board reference and want to know whether your boxes can carry a Hosn deployment, email [email protected] for a one-hour briefing. We will tell you, in writing, what passes and what does not.
Frequently asked
What does BYOC mean for a Hosn deployment?
BYOC, bring your own compute, means the customer supplies the servers and Hosn certifies, hardens, and operates the AI stack on top. It is common for institutions with framework procurement agreements, classified-tier hardware already on the floor, or strict national-vendor preferences.
Which server families does Hosn certify?
We certify against the NVIDIA-Certified Systems and Hugging Face validated server lists, with reference configurations on HPE ProLiant DL380a Gen11/Gen12, Dell PowerEdge XE9680 and R760xa, and Supermicro 421GE-TNHR2 / AS-8125GS class chassis. Other families are reviewed on request.
What happens if a customer's existing hardware fails certification?
We document the gap (e.g. NIC firmware too old, BIOS PCIe topology unsupported, GPU thermals out of spec) and propose either a firmware or component upgrade path, or a fall-back to a Hosn-supplied reference node. We never silently downgrade the model footprint to fit broken hardware.
How long does the acceptance test take?
Three to five business days for a single-node certification, longer for multi-node clusters with InfiniBand or RoCE fabrics. The eval-suite delta against our reference baseline is the gating step and runs unattended overnight.