AI Cost Curves: When API Pricing Becomes Your Biggest Infrastructure Decision
· 5 min read
The economics of API vs self-hosted AI inference, where the crossover point sits, and why the hybrid approach wins for most production workloads.
There’s a moment in every AI deployment where someone opens the monthly invoice and has a very uncomfortable meeting. The PoC cost $3,000 a month. The production rollout costs $300,000. Nothing changed except volume.
I’ve watched this happen at multiple organisations. The economics of AI inference are not intuitive, and the cost curves don’t behave like anything else in enterprise IT. Understanding where the lines cross is one of the most consequential infrastructure decisions a technology leader makes today.
The API trap: elegant simplicity, linear cost
API pricing for AI models is beautifully simple. You pay per token in, per token out. At low volumes, there’s nothing to beat. No infrastructure. No GPUs. No ops team. You’re renting someone else’s compute by the millisecond.
But the cost scales linearly. Double your queries, double your bill. There’s no economy of scale, no volume discount that matters when you’re processing 100,000+ queries per day.
At 1,000 queries per day with a frontier model, you’re spending roughly $3K-10K per month. Manageable. At 100,000 queries per day, that’s $300K-1M per month. At that point, you’re not paying for AI. You’re paying rent on someone else’s GPUs.
The analogy I use: API pricing is like taking a taxi. Cheap for short trips. Ruinous for a daily commute.
Self-hosting: the car purchase
Self-hosting means GPU servers. A100s, H100s, or their cloud equivalents. The fixed cost is real: $10K-30K per month for a capable cluster. But once you’re paying that fixed cost, the marginal cost per query approaches zero. Your 100,000th query costs the same as your first.
The economics are a step function. You pay for capacity in chunks (one GPU, two GPUs, three GPUs), and within each chunk, additional queries are effectively free.
A quantised 7B parameter model on a single A10G GPU can handle 50,000+ queries per day for under $2,000 per month. The same workload via API would cost 10-50x more.
But self-hosting brings operational complexity. You need MLOps expertise, model management, monitoring, scaling, and on-call support. The question isn’t whether it’s cheaper at scale. It’s whether the operational overhead is worth the savings.
The crossover point
There’s a specific volume where the lines cross. API cost rising linearly meets self-hosted cost sitting flat. Below the crossover, API wins on simplicity. Above it, self-hosting wins on economics.
For most workloads I’ve evaluated, this crossover sits between 10,000 and 50,000 queries per day. But it varies significantly based on model size, query complexity, and your team’s ops maturity.
The variables that shift the crossover:
- Model size: A 7B parameter model is cheap to self-host. A 70B parameter model requires serious GPU investment.
- Query complexity: Simple classification tasks cost less per query than long-form generation.
- Latency requirements: Edge inference (close to users) shifts the equation toward self-hosting because API round-trips add 200-500ms you can’t optimise away.
- Data sensitivity: If your data can’t leave your infrastructure, self-hosting isn’t a cost decision. It’s a compliance requirement.
My advice: don’t guess. Run your actual production workload on a self-hosted model for a week. Compare cost, latency, and output quality side by side. The crossover point is different for every use case.
The hybrid answer
The pragmatic answer is rarely pure API or pure self-hosted. The approach I recommend and have seen work best in production:
Route by complexity. Simple, high-volume queries go to a cheap self-hosted small language model (SLM). Complex, nuanced queries that need frontier reasoning go to the API.
A typical production split: 80-90% of queries handled by a 7B-14B self-hosted model at near-zero marginal cost. The remaining 10-20% routed to a frontier API for the hard cases.
The routing logic doesn’t need to be sophisticated. Task classification works well: simple lookups, summaries, and classification go to the SLM. Multi-step reasoning, creative generation, and ambiguous queries go to the API. You get 90% of the cost savings with 100% of the quality ceiling.
What this means for technology leaders
The AI cost curve decision isn’t just an infrastructure choice. It’s a strategic one.
If you’re running at low volume (under 1,000 queries per day), use APIs. The infrastructure overhead of self-hosting costs more than the tokens.
If you’re between 1,000 and 50,000 queries per day, you’re in the benchmark zone. Run the numbers. The crossover might be closer than you think.
If you’re above 50,000 queries per day and still on pure API, you’re leaving significant money on the table. Self-host the bulk workload. Keep the API for frontier-quality edge cases.
And if latency matters, if you’re processing at the edge, if your users expect sub-100ms responses, self-hosting with quantised models isn’t optional. It’s the only architecture that works.
The organisations getting this right are treating AI inference as a portfolio, not a binary choice. Different models for different jobs, at different price points, all orchestrated through a single routing layer.
Explore the interactive AI Cost Curves framework for visual diagrams showing the crossover point, cost curves, and the hybrid routing architecture.