Building your own Personal AI Server
The On-Premise Business AI Center is the third inference box we've built at the Autonomous workshop. After the 2-GPU shipped, the DMs all said the same thing: show us the 8-GPU build. So this post will give you the build with fill parts walkthrough, ROI math and the chassis design by the Autonomous team.
Eight is where one box replaces an entire team's API bill. Four cards tensor-parallel will host any 70B open model. Eight gets you DeepSeek-V3, MiniMax M2, larger Qwen variants, and Kimi K2 once you've expanded storage — serving production while a second model serves staging while a third is in fine-tuning.
Additional Components Required
- A 30A 240V circuit, dedicated. A standard 15A 120V outlet gives 1,800W. This box draws ~4,600W under load.
- C19 power cables, four of them. PSUs ship with C13 cables that don't fit a rack PDU.
- A torque screwdriver, 0.4–0.6 Nm range. MCIO connector retention has an actual torque spec buried in the motherboard manual.
- A second person. The rig weighs ~77lbs
Here's how to build one. This is the spec we run in multiple boxes at Autonomous Labs supporting our own development and production. Component by component, with the gotchas we hit.
The Build at a Glance
- 8× RTX 5090 (4090 works too)
- 2× AMD EPYC 9004 Genoa
- Dual-socket ASRock Rack motherboard
- 8,000W of power across 4 PSUs
- Custom chassis (no off-the-shelf case fits this)
About $80,000 in parts depending on GPU pricing and RAM config. Payback math at the end.

1. Motherboard: ASRock Rack GENOA2D24G-2L+
Dual SP5 sockets. 24 DDR5 DIMM slots. Server-class.
No traditional PCIe slots on this board. 16 MCIO connectors instead, bifurcated to deliver full PCIe Gen5 x16 to each of 8 GPUs through MCIO-to-PCIe adapter cards.
That's the whole point. Eight cards, eight full x16 connections, no compromises.

2. CPUs: 2× AMD EPYC 9004 (Genoa)
SP5 socket. Dual-socket configuration is non-negotiable here.
Why two: a single Genoa chip gives you 128 PCIe Gen5 lanes. Eight GPUs at x16 each already eats all 128. You need a second socket for everything else - storage, networking, future expansion.
Total: 256 PCIe Gen5 lanes. 128 to the GPUs, 128 for the rest of the platform.
Buy the chip for the lanes, not the cores. For inference workloads, even the entry-level Genoa SKU is plenty of compute. Don't overspend on a 64-core monster you won't use.
3. Memory: DDR5 ECC RDIMM
24 slots. Start at 4 sticks. Scale up to 24.
System RAM matters more on an 8-GPU box than people expect. KV cache offload, context handling, multi-model serving across the 8 cards. We typically build at 384–768GB depending on the box's job.
Underbuild RAM and you'll watch GPUs sit idle waiting on context. Build it once, build it right.
4. Storage: 1TB NVMe Boot + Multi-TB Array
Samsung 990 Pro for the boot drive. PCIe Gen4. Fine.
For models, a multi-TB NVMe array. Modern open weights are big - Kimi K2 is over 1TB, MiniMax M2 is similar. A box hosting three frontier models concurrently can eat 3–4TB of fast storage.
Plan based on how many models you want hot on disk versus loadable from cold storage.

5. GPUs: 8× RTX 5090
32GB GDDR7 each. 256GB total VRAM. 575W per card. PCIe Gen5 x16 via MCIO adapters.
4090 works if you can't get 5090s - same form factor, same compute envelope, 24GB instead of 32GB. The 4090 build is what we ran before 5090s were available. Either works.
What this box can host: open frontier models - Qwen, Kimi, MiniMax, Z.ai, Llama variants - for development and production simultaneously. A single 5090 hosts most 7B/13B models with room. Four cards tensor-parallel hosts 70B comfortably. All eight tensor-parallel handles the largest open weights available.
6. Power: 4× PSUs, 8,000W Total
Do the math.
8 cards × 575W = 4,600W steady from GPUs alone under full inference load. Add two EPYC sockets, 24 DIMMs, 12 fans, drives, and platform overhead - call it 6,500W under real load.
Four PSUs split the load cleanly with headroom. You do not want to push any single PSU near its rated ceiling on a 24/7 production box. PSUs die. Multiple PSUs let you keep running on three while one is replaced.
7. Cooling: 2× CPU Heatsinks + 12 Case Fans + Central Fan Hub
4,600W of GPU heat needs proper airflow architecture. Most multi-GPU builds throttle hard because home-server builders underestimate this part.
The build uses front-to-back airflow with 12 case fans on a central hub for unified speed control. CPU heatsinks rated well above the EPYC TDP. M.2 heatsink for the boot drive.
If you skip the central fan hub, you'll spend the rest of the box's life tuning individual fan curves manually. Don't.

8. Chassis: Custom
This is the part where off-the-shelf stops working.
No consumer case fits 8 dual-slot GPUs plus dual EPYC plus four PSUs plus a dozen fans. Server chassis options exist - Supermicro, ASUS - but the ones at this scale are ugly and built for racks, not for a workshop or office. We built our own frame in the Autonomous workshop.
The chassis matters more than people think. Cable routing, fan mount geometry, GPU support brackets, PSU placement - get any of them wrong and the box becomes unmaintainable. Three years from now when a GPU dies, you want to be able to swap it without dismantling the build.

What This Box Does
- Runs open frontier models locally: Qwen, Kimi, MiniMax, Z.ai
- Serves multiple models concurrently for development and production
- Hosts your own fine-tunes without third-party access
- Trains LoRAs and small models in-house
- No API bills, no rate limits, no data leaving the box
ROI Math
Rough numbers for a team running 100M tokens/day across Claude/GPT-equivalent capabilities:
- API spend at frontier rates: $30K–80K/month depending on model mix
- On-prem TCO including power, cooling, depreciation: ~$3K–5K/month
- Payback period for $80K: 4-10 months in high-utilization scenarios, but may extend to 6-18 months depending on model quality requirements, infrastructure overhead, and hybrid API usage.
The math gets better the more you use. The math gets worse if you're running 5M tokens/day - then you're better off on API.
You build this when your monthly API bill crosses ~$15K. Below that, you're paying for hardware you don't need. Above that, you're renting compute you should own.
Who Runs It
This box needs an owner. Not full-time - but someone on the team who's comfortable with Linux, vLLM or TensorRT-LLM, and the basics of multi-GPU model serving.
If nobody on your team wants the job, the box doesn't get built. It gets bought. There are managed-on-prem vendors who'll spec, install, and operate one of these for you. The hardware cost is the same. The operational overhead transfers.
We've built versions of this for ourselves at multiple scales - 2-GPU, 4-GPU, 8-GPU, more. If your team wants help building one, the door is open: @dee_hw.
This is one of several GPU boxes we run in the Autonomous workshop for development and production. Fully custom frame, full assembly, in-house. Built by our engineers.
