Datacenter-Grade AI Racks: Power, Cooling, UPS, and Air-Gap Network Design
A sovereign AI rack is not a server you slide into an existing comms cabinet. The cooling load of one fully populated H100 system rivals an entire SME's office, the network has to refuse outbound packets at the physics layer, and the room itself is part of the threat model. This is the field guide for the procurement officer, the facilities engineer, and the security officer who have to sign the same drawing. Power, cooling, UPS, physical security, network isolation, storage, and the realistic 12-20 week timeline that turns a green floor into a Tier-A enclave.
From VM-on-cloud to rack-in-the-fortress
The cloud-native instinct is to ask "which instance type fits the workload?" and let someone else worry about the physical layer. That instinct breaks the moment classification rules forbid the data from crossing your perimeter. The question shifts from "which instance" to "which room", and the room becomes a deliverable in the same RFP as the GPUs.
Three reframings are useful before any line-item is priced:
- The rack is the unit of design, not the GPU. Power, cooling, weight, and network feeds are reserved at the rack level. A single 8-GPU system that draws ten kilowatts has more in common with a small machine shop than with a 1U pizza-box server.
- The room is the unit of accreditation, not the rack. Inspectors certify enclosures, walls, doors, ducts, and cable paths. A perfectly hardened rack inside a leaky room fails the audit.
- The supply chain is the unit of trust, not the brand. Tamper-evident shipping, factory acceptance tests, and signed firmware bills-of-materials matter more than the logo on the bezel.
For background on why the public-cloud option is a non-starter for the sovereign tier, see our pillar on on-premise AI for sovereign institutions. This article assumes that decision is already made.
Power: per-GPU draw, total rack draw, UPS sizing
Start with the chip. NVIDIA's H100 datasheet rates the SXM5 form factor at 700 W maximum thermal design power. The PCIe sibling sits at 350 W. The newer H200 SXM lifts the envelope to 1,000 W per GPU. Multiply by eight GPUs on an HGX baseboard and you have a 5.6 to 8 kW chip envelope before you even count CPUs, memory, NVSwitch fabric, NICs, drives, fans, or PSU efficiency losses.
NVIDIA publishes the system-level number for the reference platform in the DGX H100 user guide: 10.2 kW peak per chassis. A four-system rack with a leaf switch and PDU overhead lands between 41 and 45 kW under sustained training. Inference-only workloads run lower, often 60 to 70 percent of peak, but you still size the rack feed to the worst case.
Power planning checklist:
- Two diverse feeds, A and B, from separate transformers. Each feed sized for full rack load, so the surviving feed can carry everything if its twin trips.
- 415 V three-phase to the rack PDU, not 230 V single-phase. The current draw at single-phase 230 V for a 45 kW rack exceeds 195 amps and turns cabling into copper bars.
- UPS sized for the inverter, not the average. A 4-GPU rack drawing 22 kW peak needs a UPS rated 30 kVA at 0.9 power factor minimum. An 8-GPU rack at 45 kW peak needs 60 kVA. APC, Vertiv, and Eaton all publish sizing tools that account for inrush and harmonic distortion.
- Generator runtime tied to the policy. A SCIF that cannot lose state needs a diesel that runs the load for at least eight hours, with a fuel contract guaranteeing 72-hour replenishment. Anything less and you are pretending.
- UPS bypass and isolation transformers sit between the utility and the GPU rack. A direct utility-to-GPU path will trip on the first brownout and the GPU memory will not survive the cold restart cleanly.
The often-skipped detail: GPU power supplies are not power-factor unity. Plan the UPS rating in kVA, not kW, and confirm the harmonic profile with the UPS vendor. A poorly chosen UPS will derate by 20 percent under a high-THD load and you will find that out only at commissioning.
Cooling: air, liquid, and the Muscat climate factor
ASHRAE's TC 9.9 thermal guidelines define the envelopes. Class A2, the typical enterprise room, allows 10 to 35 C inlet air. The H100 Tensor Core GPU operates within ASHRAE A2, but every degree above 27 C inlet air costs you fan power and shortens component life. The 2021 update to the ASHRAE TC 9.9 Datacom Series introduced the H1 envelope specifically for high-density AI workloads.
At rack densities below 25 kW, hot-aisle containment with chilled-water CRAH (computer room air handler) units delivers the supply air at 22 to 25 C, and the exhaust returns through the ceiling plenum. Above 30 kW, you are fighting physics. Air at 1.2 kg per cubic metre carries roughly 1 kW per cubic metre per second of delta-T, which means a 45 kW rack needs 35 to 40 cubic metres per second of moving air. That is industrial ventilation, not a server room.
The practical answer at 30 kW and above is direct-to-chip liquid cooling. Cold plates sit on the GPU die and the CPU IHS, water flows at 35 to 45 C supply temperature, and a coolant distribution unit (CDU) per rack handles secondary loop pressure. The Open Compute Project's Cooling Environments work documents the W32 and W45 water classifications that most enterprise CDUs target. Rear-door heat exchangers are the middle path: they sit on the rack exhaust, pull heat out before it enters the room, and let you keep the rest of the room on conventional air handling.
Muscat-specific adjustments matter. Summer dry-bulb temperatures of 42 C and wet-bulb temperatures pushing 30 C blow up any plan that assumed dry coolers. The combinations that actually work in this climate:
- Chilled-water plant with magnetic-bearing centrifugal chillers, sized for 38 C ambient with N+1 redundancy.
- Rear-door heat exchangers or in-row CRAH on the rack, fed from the chilled-water loop at 14 to 18 C supply.
- Free-cooling economiser only useful in winter months, December through February, and only at night. Plan it as a bonus, not the baseline.
- Humidity control tighter than the climate suggests. Coastal humidity plus over-aggressive cooling causes condensation on cold plates, which is a fault, not a feature.
For the dimensional analysis on a 4-GPU box specifically tuned to Muscat, see Muscat-climate cooling for a 4-GPU rack.
Physical security: SCIF-class rooms and access
The U.S. Intelligence Community Directive 705 (ICD 705) remains the most cited construction standard for sensitive compartmented information facilities globally. Even when the local accreditor is not the U.S. IC, ICD 705 is the working baseline. The relevant features:
- Construction. Slab-to-slab walls, no false ceiling crossing the boundary, RF shielding (typically 80 dB attenuation in the 1 MHz to 10 GHz band), sound attenuation Class 3 minimum, no glass on perimeter.
- Access control. Two-factor at the door (badge plus biometric), access list reviewed monthly, two-person rule for the SCIF when classified material is exposed.
- Intrusion detection. Balanced magnetic switches on every door, motion detectors, glass-break sensors, monitored 24/7 by a central station with armed response.
- Camera coverage. Every door, every rack face, every cable path. Recordings retained per the regulator's policy, often 90 days minimum.
- Cable paths. All cables enter through a single penetration, sealed and inspected, no conduit shared with non-classified circuits.
In Oman, the National Centre for Information Safety publishes guidance on secure facility construction that aligns with international standards. Coordinate the build with the accrediting authority before pouring concrete. Retrofitting RF shielding into an existing room costs roughly three times what it costs to build correctly the first time.
For non-SCIF deployments (Secret-equivalent or below), the same hardware can sit in a hardened server room with biometric access, tamper-evident racks, and continuous camera coverage. The accreditor decides where the line sits, not the vendor.
Network: air-gap, data diodes, classified VLANs
"Air-gap" is a contract, not a marketing word. It means the classified enclave has zero live network path to anything outside it. Updates, telemetry, and any inbound bytes cross a unidirectional gateway, also called a data diode, that is enforced in optical hardware: a transmit fibre with no receive fibre, or a one-way photodiode pair. Bytes flow inward only, and the high-side cannot signal the low-side by physics, not by firewall rule.
The reference architecture for a sovereign AI enclave looks like this:
- Outer perimeter (low-side). Conventional internet-facing or intranet zone where model updates, vendor patches, and operator workstations live. Standard firewalls, IDS, SIEM.
- Cross-domain solution. The data diode sits here, plus a content disarm-and-reconstruction (CDR) appliance that strips active content from incoming files. The U.S. NSA's cross-domain solution guidance describes the pattern. Approved CDS products are listed by NCDSMO and adopted internationally.
- Inner perimeter (high-side). The classified VLAN. No DHCP from outside, no DNS resolution beyond an internal stub, no NTP from public servers. Time sync runs from a stratum-1 GPS receiver inside the enclave.
- GPU compute fabric. InfiniBand or RoCE between GPU nodes for collective operations. This fabric does not route to the classified VLAN; it is its own physically separate network.
- Storage fabric. Separate from compute fabric. Encryption at rest with hardware-backed keys.
Three classified VLANs are typical: management (jumphost, syslog, IPMI/BMC), data plane (model serving, RAG retrieval), and administrative (operator access, with two-person sign-in for sensitive operations). For deeper treatment, see our companion piece on air-gap network architecture for AI.
Storage tier for model weights and KV cache
Model weights for a 70B-parameter model in BF16 are 140 GB. Quantised at INT8, 70 GB. KV cache scales linearly with context length and batch size: a 70B model serving 32 concurrent users at 32K context can consume 200 GB of GPU HBM just for KV. The storage tier behind the GPU has to feed weights at NVMe speed and stage KV spillover to fast pooled flash.
Storage layout for a Tier-A enclave:
- Local NVMe per node, 8 to 16 TB, for active model weights and operator scratch. PCIe Gen5 minimum for H100-class platforms.
- All-flash NAS or parallel filesystem (WekaIO, VAST, DDN, or Ceph with NVMe OSDs) for shared model registry, RAG corpus, and audit logs. Sized at 50 to 200 TB for most sovereign deployments.
- Cold tier on encrypted SATA SSD or LTO-9 tape, inside the enclave, for archival and the immutable audit log. Tape is still the cheapest and most tamper-evident medium for seven-year retention.
- Encryption. Full-disk at rest, with keys stored in a FIPS 140-3 validated hardware security module inside the enclave. Key escrow follows the regulator's policy.
Operations: change-control, signed updates, dual-control
The operating model decides whether the rack stays trustworthy through year three. The pattern that survives audit:
- Signed updates. Every binary that crosses the data diode carries a signature from a hardware-backed key. The receiving side verifies before staging. Unsigned material is discarded, not quarantined.
- Two-person integrity. Promotion of a model weight, a firmware blob, or a configuration change from staging to production requires two distinct operators with separate credentials. Implement at the workflow tool, not the door.
- Immutable audit log. Every action, who did it, when, what hash was promoted, written to append-only storage that the operator cannot alter. WORM-mode storage or a hash-chained log on a separate appliance.
- Change windows. Production changes only inside scheduled windows announced 48 hours in advance. Emergency changes require a documented break-glass process with after-the-fact review.
- Drills. Quarterly tabletop exercises covering UPS failure, chiller failure, leak in the liquid loop, suspected insider, and detected diode anomaly. Drills surface the playbook gaps before a real incident does.
The procurement and build timeline (12-20 weeks)
For a Tier-A buyer that owns the building and has the budget approved, twelve to twenty weeks is the realistic envelope. The longer end accommodates GPU lead-times that have stretched beyond 16 weeks during the post-2024 supply crunch. Indicative breakdown:
- Weeks 1 to 4: Design and procurement. Load study, single-line diagram, cooling load calculation, network topology, accreditation engagement. PO release for GPUs, UPS, chillers, racks, and CDS.
- Weeks 5 to 10: Civil and MEP works. Slab-to-slab walls, RF shielding, access doors, fire suppression (clean-agent, not sprinkler over GPUs). Electrical: transformer pad, switchgear, UPS install, generator commissioning. Mechanical: chilled-water plant, CRAH or rear-door heat exchangers, leak detection.
- Weeks 11 to 14: Network and rack install. Structured cabling, classified-VLAN switches, InfiniBand fabric, data diode and CDS commissioning. Rack delivery, GPU systems racked, PDU energised.
- Weeks 15 to 16: GPU commissioning and burn-in. Vendor acceptance test, 72-hour burn-in at full thermal load, NVLink and NVSwitch validation, storage benchmark.
- Weeks 17 to 18: Software stack. Operating system hardening (CIS or STIG benchmarks), driver install, container runtime, model registry, monitoring stack, audit pipeline.
- Weeks 19 to 20: Accreditation and handover. Penetration test, accreditor walkthrough, documentation handover, operator training, go-live.
Sizing the compute envelope before the room design starts is where most projects save or burn weeks. For the dimensional analysis tying user counts and latency targets to GPU count, see our piece on sizing a sovereign AI appliance against user load and latency.
If you are scoping the room before the GPUs arrive, or trying to retrofit an existing comms cabinet into something that survives a SCIF inspection, email [email protected] for a one-hour briefing. We will walk through your floor plan, single-line diagram, and accreditation target, and tell you straight whether the timeline is twelve weeks or twenty.
Frequently asked
How much power does a fully loaded 8-GPU H100 rack actually draw?
A single NVIDIA HGX H100 8-GPU baseboard draws roughly 5.6 kW for the GPUs alone (8 x 700 W SXM5). Add CPU, memory, NVSwitch, NICs, fans, and PSU losses and a single DGX H100 system is rated by NVIDIA at about 10.2 kW maximum. A rack containing four such systems plus a leaf switch sits between 40 and 45 kW under sustained training load. Provision the rack PDU and the upstream feed for at least 50 kW with N+1 redundancy.
Air or liquid cooling for an 8-GPU rack in Muscat?
Below 25 kW per rack, hot-aisle containment with high-density CRAH units and 27 C cold-aisle supply is workable. Above 30 kW per rack, direct-to-chip liquid cooling is the practical answer. Muscat ambient summer wet-bulb temperatures push chiller loads hard, and dry coolers alone fail in July. Most sovereign deployments end up with chilled-water plus rear-door heat exchangers or full direct liquid loops feeding a CDU per rack.
What does an air-gap actually mean if you also need to ship model updates in?
True air-gap means no live network path between the classified enclave and the outside world. Updates cross via a unidirectional gateway, also called a data diode, that allows bytes to flow inward only and is enforced in hardware, not software. The receiving side validates signatures, scans, and stages updates in a quarantine before promotion. Outbound is impossible by physics, not by firewall rule.
Do we need a SCIF for a sovereign AI rack?
Not always, it depends on the data classification. For Top Secret or compartmented material, ICD 705 SCIF construction (RF shielding, sound attenuation, intrusion detection, two-person access control) is mandatory. For Secret-equivalent, a hardened server room with biometric access, full camera coverage, and tamper-evident racks is usually sufficient. The procurement office sets the bar, not the vendor.
How long does it take to build a Tier-A AI rack room from a green floor?
Twelve to twenty weeks is the realistic envelope for a buyer that already owns the building. Weeks 1 to 4 are design, load study, and procurement. Weeks 5 to 10 are civil works, electrical (transformer, switchgear, UPS), and mechanical (chillers, CRAH, leak detection). Weeks 11 to 16 cover network cabling, rack install, GPU commissioning, burn-in. Weeks 17 to 20 are accreditation, penetration testing, and handover. GPU lead times can extend the upper bound.
Why dual-control and signed updates instead of just trusting the operator?
Insider risk is the dominant threat in a properly air-gapped enclave. Two-person integrity, code signing with hardware-backed keys, and immutable change logs convert the trust model from people to procedure. A single operator cannot push an unreviewed model weight into the inference plane. The audit trail is then admissible to the regulator without ambiguity.