AI Grid
How NVIDIA's AI Grid reference architecture distributes AI inference across edge locations — and why it changes the economics of running AI in production.
Centralised AI factories don't fit every workload
The first wave of AI infrastructure was built for training: massive GPU clusters in a handful of data centres. Training workloads need that concentration. But inference — the work of actually serving predictions to users — has fundamentally different requirements. It's latency-sensitive, geographically distributed, and bursty.
CDNs solved this for web content 25 years ago. Video buffering from a server 3,000 miles away is unacceptable — so you cache it at the edge. The same physics applies to AI inference: a 200ms round trip to a centralised GPU cluster is fine for an internal tool, but lethal for real-time video, gaming, or a customer-facing agent.
Centralised AI factories don't fit every workload
The first wave of AI infrastructure was built for training: massive GPU clusters in a handful of data centres. Training workloads need that concentration. But inference — the work of actually serving predictions to users — has fundamentally different requirements. It's latency-sensitive, geographically distributed, and bursty.
CDNs solved this for web content 25 years ago. Video buffering from a server 3,000 miles away is unacceptable — so you cache it at the edge. The same physics applies to AI inference: a 200ms round trip to a centralised GPU cluster is fine for an internal tool, but lethal for real-time video, gaming, or a customer-facing agent.
AI Grid: CDN architecture for inference
NVIDIA's AI Grid reference design distributes GPU compute across edge locations — the same locations that already serve web content, video, and security. Instead of routing every AI request to a central cluster, the grid routes it to the nearest capable node.
This is the CDN playbook applied to AI. Edge-first networks put web servers close to users 25 years ago because round trips to central origin servers were unacceptable. AI Grid does the same for inference: put GPUs where the users are, and route each request to the nearest one that can handle it.
Three tiers: right model, right location, right cost
A practical AI Grid deployment uses three tiers. The edge handles latency-critical inference — thousands of points of presence running small, quantised models. Regional sites handle heavier models and batch workloads. Core sites handle training and frontier inference. An orchestrator routes each request to the cheapest tier that meets its latency and capability requirements.
This is the CDN model pushed one step further: compute follows content follows users. The three-tier split means you stop paying for hyperscale GPU time when all you actually need is a small model returning a sub-50ms answer to a user 3,000 miles from the core. The commercial play is whoever already owns the edge footprint and adds GPUs — because they had the real-estate and the global network before AI made it valuable.
Where AI Grid changes the economics
AI Grid matters most for workloads where latency directly impacts user experience or business outcomes. Real-time video personalisation, gaming AI, financial services, retail recommendations, and the customer service agents from the WhatsApp example.
A global customer-agent becomes significantly more viable on AI Grid: inference at the edge means sub-50ms response times globally, not just in the region where your GPU cluster lives. The passenger in Tokyo and the passenger in London both get instant responses — not "instant in one region, 400ms elsewhere".