AI Cost Curves: When API Pricing Drives Decisions

There’s a moment in every AI deployment where someone opens the monthly invoice and has a very uncomfortable meeting. The PoC cost $3,000 a month. The production rollout costs $300,000. Nothing changed except volume.

I’ve watched this happen at multiple organisations. The economics of AI inference are not intuitive, and the cost curves don’t behave like anything else in enterprise IT. Understanding where the lines cross is one of the most consequential infrastructure decisions a technology leader makes today.

The API trap: elegant simplicity, linear cost

API pricing for AI models is beautifully simple. You pay per token in, per token out. At low volumes, there’s nothing to beat. No infrastructure. No GPUs. No ops team. You’re renting someone else’s compute by the millisecond.

But the cost scales linearly. Double your queries, double your bill. There’s no economy of scale, no volume discount that matters when you’re processing 100,000+ queries per day.

At 1,000 queries per day with a frontier model, you’re spending roughly $3K-10K per month. Manageable. At 100,000 queries per day, that’s $300K-1M per month. At that point, you’re not paying for AI. You’re paying rent on someone else’s GPUs.

The analogy I use: API pricing is like taking a taxi. Cheap for short trips. Ruinous for a daily commute.

API pricing. No floor, no ceiling, no volume discount that bends the line.

Self-hosting: the car purchase

Self-hosting means GPU servers. A100s, H100s, or their cloud equivalents. The fixed cost is real: $10K-30K per month for a capable cluster. But once you’re paying that fixed cost, the marginal cost per query approaches zero. Your 100,000th query costs the same as your first.

The economics are a step function. You pay for capacity in chunks (one GPU, two GPUs, three GPUs), and within each chunk, additional queries are effectively free.

A quantised 7B parameter model on a single A10G GPU can handle 50,000+ queries per day for under $2,000 per month. The same workload via API would cost 10-50x more.

But self-hosting brings operational complexity. You need MLOps expertise, model management, monitoring, scaling, and on-call support. The question isn’t whether it’s cheaper at scale. It’s whether the operational overhead is worth the savings.

Self-hosting. Pay for capacity in chunks; within each chunk, extra queries are basically free.

The crossover point

There’s a specific volume where the lines cross. API cost rising linearly meets self-hosted cost sitting flat. Below the crossover, API wins on simplicity. Above it, self-hosting wins on economics.

The crossover sits somewhere between 10K and 50K queries/day — specifics depend on model size, query length, and ops maturity.

For most workloads I’ve evaluated, this crossover sits between 10,000 and 50,000 queries per day. But it varies significantly based on model size, query complexity, and your team’s ops maturity.

The variables that shift the crossover:

Model size: A 7B parameter model is cheap to self-host. A 70B parameter model requires serious GPU investment.
Query complexity: Simple classification tasks cost less per query than long-form generation.
Latency requirements: Edge inference (close to users) shifts the equation toward self-hosting because API round-trips add 200-500ms you can’t optimise away.
Data sensitivity: If your data can’t leave your infrastructure, self-hosting isn’t a cost decision. It’s a compliance requirement.

My advice: don’t guess. Run your actual production workload on a self-hosted model for a week. Compare cost, latency, and output quality side by side. The crossover point is different for every use case.

The hybrid answer

The pragmatic answer is rarely pure API or pure self-hosted. The approach I recommend and have seen work best in production:

Route by complexity. Simple, high-volume queries go to a cheap self-hosted small language model (SLM). Complex, nuanced queries that need frontier reasoning go to the API.

A typical production split: 80-90% of queries handled by a 7B-14B self-hosted model at near-zero marginal cost. The remaining 10-20% routed to a frontier API for the hard cases.

Hybrid routing. Classify the query; cheap model for the easy 85%, frontier model for the hard 15%. 90% of the savings, 100% of the quality ceiling.

The routing logic doesn’t need to be sophisticated. Task classification works well: simple lookups, summaries, and classification go to the SLM. Multi-step reasoning, creative generation, and ambiguous queries go to the API. You get 90% of the cost savings with 100% of the quality ceiling.

What this means for technology leaders

The AI cost curve decision isn’t just an infrastructure choice. It’s a strategic one.

If you’re running at low volume (under 1,000 queries per day), use APIs. The infrastructure overhead of self-hosting costs more than the tokens.

If you’re between 1,000 and 50,000 queries per day, you’re in the benchmark zone. Run the numbers. The crossover might be closer than you think.

If you’re above 50,000 queries per day and still on pure API, you’re leaving significant money on the table. Self-host the bulk workload. Keep the API for frontier-quality edge cases.

And if latency matters, if you’re processing at the edge, if your users expect sub-100ms responses, self-hosting with quantised models isn’t optional. It’s the only architecture that works.

The organisations getting this right are treating AI inference as a portfolio, not a binary choice. Different models for different jobs, at different price points, all orchestrated through a single routing layer.

Explore the interactive AI Cost Curves framework for visual diagrams showing the crossover point, cost curves, and the hybrid routing architecture.