Can Strix Halo really run a 70B-class model on a single box?

Yes, in 4-bit quantisation. With 128GB of unified memory and up to 120GB allocatable to the GPU under Linux, a Llama 3 70B or Qwen 3.6 72B Q4_K_M weight file fits comfortably and serves at roughly 4 to 6 tokens per second per user. That is interactive enough for document review and structured drafting, though not ideal for high-concurrency chat. For 27B to 32B models, expect 25 to 45 tokens per second depending on backend.

Is ROCm mature enough to deploy in production in 2026?

For inference of mainstream open-weight models through llama.cpp, vLLM, or Ollama, yes. ROCm 7 ships with stable Strix Halo (gfx1151) support, rocWMMA accelerates BF16 and FP16 paths, and Lemonade SDK provides nightly llama.cpp containers. The CUDA gap remains real for cutting-edge research code and some training frameworks, but the ninety percent of sovereign use cases that are inference plus light fine-tuning are well covered.

How does Strix Halo compare to the Apple M3 Ultra Mac Studio at the same tier?

M3 Ultra wins on raw memory bandwidth (819 GB/s versus 256 GB/s) and on the maturity of its ML stack via Metal Performance Shaders and MLX. Strix Halo wins on price (a 128GB Framework Desktop sits at roughly half the cost of a comparable 256GB Mac Studio), on Linux native support, and on the ability to run x86 enterprise software next to the model. For a Linux shop, Strix Halo is the obvious choice; for an Apple-first shop, M3 Ultra still leads.

What is the practical power draw under sustained LLM inference?

A Framework Desktop or comparable mini-PC running a 32B model under sustained load draws around 120 to 140 watts at the wall, well within the published 45 to 120W cTDP envelope of the SoC. That is roughly one tenth the draw of an H100 server at idle. For a sovereign workstation in a regular office, no special cooling or power provisioning is required.

Where does Strix Halo fit in the Hosn product line-up?

Strix Halo is the AMD-side of the Kernel and Tower tiers. Where the Mac Studio is the right pick for an Apple-aligned office and the RTX 6000 is the right pick for a CUDA-locked team, Strix Halo is the right pick for a Linux-shop sovereign client, an institution standardising on AMD across its estate, or any procurement scenario that requires multi-vendor diversification away from a single accelerator vendor.

What are the main limitations to flag in a procurement memo?

Three. Single-socket (no multi-GPU scaling on one chassis, you scale by adding nodes). ROCm versus CUDA library gap (some research code, optimised kernels, and exotic training paths still favour NVIDIA). Lower memory bandwidth than M3 Ultra and H100 (256 GB/s ceiling), which caps tokens per second on dense models above 70B parameters. None of these are deal-breakers for sovereign inference workloads, but they should be written into the evaluation matrix honestly.

Strix Halo 128GB: AMD's Sovereign-AI Workstation Play, Hosn Blog

For two years, the sovereign-AI workstation conversation in Oman and the wider GCC has been a binary one. Either an Apple M3 Ultra Mac Studio with 256GB of unified memory, or an NVIDIA RTX 6000 in a tower. Both work, both have credible deployment stories, and both leave a class of buyer underserved: the Linux-first ministry, the AMD-standardised holding, the procurement officer who has been told to diversify away from a single accelerator vendor. AMD's Strix Halo platform, marketed as the Ryzen AI Max+ 395 with 128GB of LPDDR5X unified memory, is now the third credible answer at the workstation tier.

This is the pillar guide to that platform. It walks through the architecture, the real-world LLM throughput numbers institutional buyers actually need, the ROCm tooling story in 2026, a head-to-head with the M3 Ultra, the power and procurement implications, the limitations to flag honestly in any evaluation memo, and the personas for whom Strix Halo is the right pick. Hosn ships Strix Halo as one of the AMD-side reference builds inside the on-premise AI framework. The platform itself, however, is bigger than any one vendor.

Why AMD Strix Halo matters for sovereign edge AI

Three forces converged in 2025 to make Strix Halo a sovereign-grade product rather than an enthusiast curiosity.

The first is the rise of dense 27B to 32B open-weight models that meaningfully match GPT-4-class performance on enterprise workloads. Gemma 4 27B, Qwen 3.6 32B, Mistral Large 2 dense variants, and Llama 3.3 70B in 4-bit quantisation all fit inside 24GB to 64GB of weights. A single accelerator with 96GB to 120GB of accessible memory is enough to serve any of these to a department of users. Two years ago that required at least an H100; in 2026 it does not.

The second is the maturation of ROCm. AMD's compute stack spent most of the post-2020 decade chasing CUDA, with credible inference paths but rough edges in tooling, kernel stability, and library breadth. ROCm 7, released in October 2025, ships first-class support for Strix Halo (the gfx1151 target), with rocWMMA acceleration for BF16 and FP16 matrix paths and packaged llama.cpp container builds via the Lemonade SDK. For inference workloads, the gap to CUDA has effectively closed.

The third is geopolitical. Sovereign procurement officers across the GCC are under explicit instruction to avoid single-vendor lock-in for compute. A buyer who specifies "NVIDIA only" is making a decision the next minister might be asked to defend. Strix Halo gives that buyer a credible AMD alternative at the workstation and small-departmental tiers, without forcing them to give up the open-weight model catalogue or the Linux operational model.

The architecture: Zen 5, RDNA 3.5, XDNA 2, 128GB unified

The Ryzen AI Max+ 395 is technically a system-on-chip, not a discrete CPU plus GPU. Understanding this matters for how memory and bandwidth are modelled.

CPU. Sixteen Zen 5 cores, thirty-two threads, a 3.0 GHz base clock and 5.1 GHz boost, 64MB of L3 cache. That alone is a competitive workstation CPU. For comparison, the Threadripper 7960X has twenty-four cores at a similar clock; Strix Halo trades two-thirds of the cores for an integrated GPU and NPU on the same package.

iGPU. Forty RDNA 3.5 compute units, branded Radeon 8060S, with the same architectural lineage as the discrete Radeon 9000 series. This is closer in raw shader count to a desktop RX 7700 XT than to a typical mobile iGPU. Critically, it speaks ROCm, supports rocWMMA matrix instructions, and exposes itself to llama.cpp, vLLM, and Ollama as a first-class compute device.

NPU. An XDNA 2 neural processing unit rated at 50+ TOPS, the highest in the Copilot+ PC class as of launch. For LLM inference the NPU is rarely the bottleneck (token generation is memory-bandwidth bound), but for vision pipelines, speech, and small-model agentic loops the NPU is genuinely useful and runs at a fraction of the power of either the CPU or the iGPU.

Memory. Up to 128GB of LPDDR5X-8000, soldered, on a 256-bit interface delivering 256 GB/s of theoretical bandwidth. This is the piece that changes the workstation calculus. Of the 128GB, on a Linux host with kernel parameters tuned, up to 120GB can be allocated to the GPU as VRAM, leaving a comfortable 8GB for the host operating system. Under Windows or default Linux, the maximum GPU allocation is 96GB. For sovereign deployments running headless Linux, the 120GB ceiling is the relevant number.

Power envelope. A configurable TDP from 45W (laptop deployments) up to 120W (desktop and mini-PC chassis). Real-world sustained LLM workloads land at roughly 120 to 140 watts at the wall.

Real-world LLM tokens per second on Strix Halo

Synthetic benchmarks are easy to construct and easy to dismiss. The numbers below come from the public community benchmark sets at amd-strix-halo-toolboxes and the Framework Community LLM tests, plus the technical writeup at llm-tracker.info, all on a 128GB Framework Desktop with ROCm 7 and rocWMMA-enabled llama.cpp builds. They reflect single-user latency, which is the metric that matters for sovereign workstations.

Small-to-mid dense models. Llama 3.1 8B at Q4_K_M sustains 60 to 90 tokens per second. Gemma 4 4B runs above 100 tokens per second. These are headroom numbers for a small office.

Mainstream institutional models (27B to 32B dense). Gemma 4 27B and Qwen 3.6 32B at Q4_K_M land in the 25 to 45 tokens-per-second band, depending on backend and prompt length. This is comfortably interactive: a one-page summary completes in roughly fifteen seconds, well below the threshold at which staff start to abandon the tool.

Mixture-of-experts models. Qwen 3 30B-A3B (activated 3B parameters per token) hits roughly 86 tokens per second. GPT-OSS 120B in the dense-equivalent profile sustains around 53 tokens per second. MoE architectures play to Strix Halo's strength: a large weight footprint that fits in 128GB, with only a small slice activated per token, which the 256 GB/s memory subsystem handles well.

Large dense models (70B class). Llama 3 70B at Q4_K_M sustains 4 to 6 tokens per second. This is the practical ceiling on a single Strix Halo node and reflects the memory-bandwidth limit honestly. For 70B dense workloads at high concurrency, the right answer is the institutional tier (H100 or H200 rack), not Strix Halo. For 70B at single-user document-review pace, it works.

Prompt prefill. Often overlooked. Strix Halo's RDNA 3.5 iGPU posts strong prefill numbers, often 400 to 800 tokens per second for typical-length prompts on 27B-class models, which keeps long-context retrieval-augmented generation usable.

ROCm and llama.cpp tooling maturity in 2026

For a procurement officer, the question is not "can ROCm match CUDA in a research benchmark" but "can my engineering team deploy and operate this in production with current tooling." The 2026 answer is yes, with caveats.

What works out of the box. The llama.cpp ROCm/HIP backend is stable, Lemonade SDK ships nightly Strix Halo builds, Ollama exposes the iGPU automatically, vLLM has functional Strix Halo paths, and Open WebUI plus LiteLLM proxy work without modification. A Linux engineer who has deployed CUDA inference before will be productive on Strix Halo within a day.

What needs tuning. BIOS allocation of the GPU memory partition (the 96GB default versus the 120GB Linux-tuned ceiling). Kernel parameters for huge-page support. Choice of backend (Vulkan is faster on some quantisations, ROCm on others). The community setup guide documents this end to end.

What still favours NVIDIA. Cutting-edge research code that targets CUDA primitives directly. Some training frameworks where ROCm support lags by one or two major versions. Optimised attention kernels (FlashAttention 3) that arrive on CUDA first and ROCm second. For a sovereign institution that does inference plus light fine-tuning, none of these matter. For a research lab pushing model frontiers, they do.

Strix Halo vs Apple M3 Ultra at the same tier

The honest comparison is against the M3 Ultra Mac Studio, not against H100 servers. Both are unified-memory workstations targeting the same sovereign-edge persona.

Memory bandwidth. M3 Ultra delivers 819 GB/s versus Strix Halo's 256 GB/s. That is a 3.2x advantage for Apple, and it shows up in tokens-per-second on the largest dense models. For a 70B dense workload, M3 Ultra is roughly twice as fast in practice. For 27B dense and MoE models, the gap narrows substantially because compute, not bandwidth, becomes the bottleneck.

Total memory. M3 Ultra ships up to 256GB; Strix Halo tops out at 128GB. For institutions running multiple medium models concurrently or a single very large model, M3 Ultra has more headroom.

Software stack. Apple's Metal Performance Shaders and MLX are mature; Apple Silicon is the development target for Ollama, LM Studio, and several open-weight model release pipelines. ROCm is mature for inference but earlier in the lifecycle. Both stacks work in production today.

Price. A 128GB Framework Desktop or comparable mini-PC sits at roughly USD 2,500 to 3,500 depending on configuration. A 256GB Mac Studio is roughly USD 5,800 to 6,500. Per gigabyte of unified memory, Strix Halo is cheaper; per gigabyte per second of bandwidth, Mac Studio is cheaper. The right metric depends on the workload.

Operating model. Strix Halo runs Linux natively, integrates into existing Ansible and SSH workflows, and sits next to x86 enterprise software (Active Directory clients, ERP integrations, SIEM agents) on the same machine. Mac Studio runs macOS, which is excellent for AI but a foreign animal in most government IT estates. For a Linux-shop sovereign client, Strix Halo wins this comparison on operational grounds before any benchmark is run.

Power, cost, and procurement story

A Strix Halo workstation lives in a regular office. No raised floor, no row-of-racks cooling plan, no upgrade to the building's UPS. The total power draw under sustained inference is roughly that of a workstation laptop on full load. This is procurement-friendly in three ways.

It moves the conversation from facilities to IT. A ministry that would need eighteen months of approvals to install a 4U H100 server can install a Strix Halo workstation under existing endpoint procurement rules. The classification, audit, and acceptance steps that gate a data-centre deployment do not apply to a desktop class device, even one running 70B-parameter models.

It allows pilot-then-scale. Buy one Strix Halo node for a single department's evaluation. Run for ninety days. If the workload outgrows it, scale to additional nodes or upgrade to the institutional tier. The committed capital at the pilot stage is on the order of a senior official's mid-range vehicle, not a multi-year capital programme.

It diversifies the accelerator vendor mix. An institution running NVIDIA at the institutional tier and AMD at the workstation tier has, by construction, hedged against a single supply chain. For sovereign buyers under explicit diversification mandates, Strix Halo is the cleanest way to add an AMD line item without giving up model coverage or operational maturity.

Limitations to write into the evaluation matrix

An honest procurement memo flags three limitations explicitly.

Single-socket scaling. Strix Halo is a single SoC. There is no equivalent of NVLink to bond two together inside a chassis. Scaling beyond one node is done over Ethernet via vLLM tensor parallelism or llama.cpp RPC, which works but introduces network-latency overhead. For workloads that need a single coherent accelerator larger than 120GB, Strix Halo is not the answer.

ROCm versus CUDA library gap. While inference is well covered, certain workloads (long-context fine-tuning, exotic quantisation schemes, research-grade attention variants) still ship CUDA-first. A sovereign team that wants to be at the cutting edge of model research, rather than at the deploy-and-operate end, will find this friction. The mainstream institutional use cases (RAG, summarisation, drafting, classification) do not.

Memory bandwidth ceiling. 256 GB/s is enough for everything up to and including a 70B dense model at single-user pace, but it is the ceiling. Workloads that need higher per-user throughput on large dense models will hit this wall. The MoE architectures (Qwen 3 30B-A3B, GPT-OSS 120B) sidestep this elegantly by activating only a fraction of weights per token; sovereign deployments that anticipate growth should plan around MoE-friendly model choices.

When Strix Halo is the right pick

Three personas come up repeatedly in our buyer conversations.

The Linux-shop ministry. An IT directorate that runs RHEL or Ubuntu across its fleet, deploys via Ansible, integrates with its SIEM through agent-based logging, and would treat macOS as a foreign object. For this team, Strix Halo slots into existing operational practice with no new tooling. It is the right pick at the workstation and small-departmental tiers.

The diversification-mandated holding. A sovereign fund or holding company with an explicit instruction to avoid single-vendor accelerator dependency. Mixing H100/H200 institutional clusters with AMD MI300X data-centre GPUs at the top tier and Strix Halo at the workstation tier produces a clean, defensible vendor mix.

The AMD-preferred buyer. An institution with existing AMD EPYC servers, AMD-aligned procurement frameworks, or an AMD partnership at the group level. For these buyers, Strix Halo is simply the AMD-side answer to the workstation question, and the choice writes itself.

Mu'een, Oman's national shared-AI platform, addresses general-purpose government AI demand and continues to evolve. Sovereign workloads that require dedicated infrastructure inside an institution's own perimeter, however, are precisely what platforms like Hosn exist to serve, and Strix Halo is one of the credible building blocks for that work in 2026.

If your team is sizing a sovereign workstation deployment and wants to walk through the AMD versus Apple versus NVIDIA decision against your specific workload, email [email protected] for a one-hour briefing. We will bring the benchmark numbers, the procurement language, and the deployment plan, and leave the decision in your hands.

Why AMD Strix Halo matters for sovereign edge AI

The architecture: Zen 5, RDNA 3.5, XDNA 2, 128GB unified

Real-world LLM tokens per second on Strix Halo

ROCm and llama.cpp tooling maturity in 2026

Strix Halo vs Apple M3 Ultra at the same tier

Power, cost, and procurement story

Limitations to write into the evaluation matrix

When Strix Halo is the right pick

Frequently asked

Related

H100, H200, RTX 6000, Mac Studio: a sovereign-tier comparison

Mac Studio M3 Ultra for sovereign edge AI

AMD MI300X vs NVIDIA H100 for sovereign clusters