AI Grid

How NVIDIA's AI Grid reference architecture distributes AI inference across edge locations — and why it changes the economics of running AI in production.

Centralised AI factories don't fit every workload
The first wave of AI infrastructure was built for training: massive GPU clusters in a handful of data centres. Training workloads need that concentration. But inference — the work of actually serving predictions to users — has fundamentally different requirements. It's latency-sensitive, geographically distributed, and bursty.
Centralised GPU clusterVirginia / Oregon / LondonTokyo user180ms round-tripMumbai user140ms round-tripBerlin user40ms round-tripSydney user220ms round-tripEvery request travels to one location and back — latency scales with distance
CDNs solved this for web content 25 years ago. Video buffering from a server 3,000 miles away is unacceptable — so you cache it at the edge. The same physics applies to AI inference: a 200ms round trip to a centralised GPU cluster is fine for an internal tool, but lethal for real-time video, gaming, or a customer-facing agent.
AI Cost CurvesAgentic AI Patterns