AI Cost Curves
When API costs cross the self-hosting line — the economics that drive infrastructure decisions.
The economics explained
Click each model to see the cost dynamics and visual explanation.
Pay-per-token
API Economics
"A taxi — cheap for short trips, ruinous for a daily commute"
API pricing is elegantly simple: you pay per token in, per token out. At low volumes, this is unbeatable — no infrastructure, no GPUs, no ops team. But the cost scales linearly with volume. Double your queries, double your bill. There's no economy of scale, no volume discount that matters at 100K+ queries per day.
At 1,000 queries/day with a frontier model, you're spending roughly $3K-10K/month. Manageable. At 100,000 queries/day, that's $300K-1M/month. At that point, you're not paying for AI — you're paying rent on someone else's GPUs.
Pay-per-token
API Economics
"A taxi — cheap for short trips, ruinous for a daily commute"
API pricing is elegantly simple: you pay per token in, per token out. At low volumes, this is unbeatable — no infrastructure, no GPUs, no ops team. But the cost scales linearly with volume. Double your queries, double your bill. There's no economy of scale, no volume discount that matters at 100K+ queries per day.
At 1,000 queries/day with a frontier model, you're spending roughly $3K-10K/month. Manageable. At 100,000 queries/day, that's $300K-1M/month. At that point, you're not paying for AI — you're paying rent on someone else's GPUs.
Own the hardware
Self-Hosted Economics
"Buying a car — expensive upfront, but the commute is nearly free"
Self-hosting means GPU servers: A100s, H100s, or their cloud equivalents. The fixed cost is significant — $10K-30K/month for a capable cluster. But once you're paying that fixed cost, the marginal cost per query is nearly zero. Your 100,000th query costs the same as your first.
The break-even depends on your model size and query volume. A quantised 7B model on a single A10G can handle 50K+ queries/day for under $2K/month. The same workload via API would cost 10-50x more.
The inflection point
The Crossover Point
"The moment buying a car becomes cheaper than taxis"
There's a specific volume where the lines cross: API cost rising linearly meets self-hosted cost sitting flat. Below the crossover, API wins on simplicity. Above it, self-hosting wins on economics. For most workloads, this crossover sits between 10,000 and 50,000 queries per day — but it depends heavily on model size, query complexity, and your ops maturity.
Don't guess — benchmark. Run your actual workload on a self-hosted model for a week. Compare cost, latency, and quality. The crossover point is different for every use case. And remember: self-hosting also means you own your data pipeline end-to-end.
Best of both
Hybrid Architecture
"Own a car for the commute, take a taxi to the airport"
The pragmatic answer is rarely pure API or pure self-hosted. Route by complexity: simple, high-volume queries go to a cheap self-hosted SLM. Complex, nuanced queries that need frontier reasoning go to the API. You get 90% of the cost savings with 100% of the quality ceiling.
A typical split: 80-90% of queries handled by a 7B-14B self-hosted model at near-zero marginal cost. The remaining 10-20% routed to a frontier API for hard cases. Total cost: a fraction of pure-API, with no quality compromise where it matters.
The cost ladder
As volume grows, the optimal infrastructure shifts — from API to hybrid to self-hosted.
Infrastructure maturity
Each rung represents a shift in where inference runs and what it costs
- $3-10K/mo
- $10-100K/mo
- $5-20K/mo
- $2-10K/mo
- <$1K/mo
The crossover point depends on your query volume, latency requirements, and team capability.
Decision framework
Under 1K queries/day?
API. The infrastructure overhead of self-hosting costs more than the tokens.
Between 1K-50K queries/day?
Benchmark. Run a 7B model on a single GPU for a week. Compare cost and quality.
Over 50K queries/day?
Self-host the bulk. Use API for the 5% of queries that need frontier reasoning.
Latency-sensitive (edge inference)?
Self-host with quantised models. API round-trips add 200-500ms you can't optimise away.