At the end of 2022, using GPT-4-level AI cost $20 per million tokens. Now it's $0.40. A 50x collapse in 2 years. This isn't a simple discount — it's a structural shift that's changing how startups use AI entirely.
What Is This?
a16z's Guido Appenzeller coined a name for this phenomenon — "LLMflation". At equivalent performance levels, LLM inference costs are dropping 10x every year. When GPT-3 launched in November 2021, it was $60 per million tokens. Now you can get the same performance level from Llama 3.2 3B for $0.06. A 1,000x drop in 3 years.
Epoch AI's analysis is even more dramatic. Price decline speeds vary by benchmark, with a median of 50x per year. Looking at data from January 2024 onward, prices are falling at 200x per year. The cost of achieving GPT-4-level performance on PhD-level science problems (GPQA) is dropping 40x annually.
Why so fast? Six factors are working simultaneously. GPU performance improvements, model quantization (16-bit→4-bit), software optimization, smaller and more efficient models, instruction tuning advances, and pricing pressure from open-source models. It's much faster than semiconductors during the Moore's Law era.
The decisive trigger was DeepSeek. When DeepSeek R1 appeared in January 2025, the industry was turned upside down. Costs were 90–95% lower than OpenAI and Anthropic while performance was comparable. Nvidia's stock recorded its largest single-day drop in history. The key was that DeepSeek achieved this using older A100 chips instead of the latest H100s, which couldn't be obtained due to US export controls.
What Makes It Different?
The numbers make it clear. In August 2025, when OpenAI launched GPT-5, they priced it lower than GPT-4o. TechCrunch reported this as "the start of a price war." Google dropped Gemini Flash-Lite to $0.10 per million tokens, and Anthropic responded with batch processing options.
| Early 2023 (GPT-4 Era) | March 2026 (Now) | |
|---|---|---|
| Premium model cost | $30–60/1M output tokens | $8–25/1M output tokens (60–80% down) |
| Lightweight model cost | $1–2/1M tokens | $0.04–0.10/1M tokens |
| Startup monthly API budget | $50,000 | $3,000–5,000 (same workload) |
| Prompt caching | None | Up to 90% input cost savings |
| Off-peak discounts | None | Up to 75% additional discount (DeepSeek) |
Even among frontier models, the price competition is fierce. Here's a comparison of current major model pricing:
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Key Feature |
|---|---|---|---|
| DeepSeek V3 | $0.28 | $1.10 | Best value, 75% off-peak discount |
| Gemini 2.5 Flash | $0.30 | $2.50 | Google infrastructure, fast speed |
| GPT-5 (base) | $1.25 | $10.00 | Cheaper than GPT-4o with better performance |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Coding & analysis specialist |
| Claude Opus 4.6 | $5.00 | $25.00 | Peak performance premium |
The price gap between the cheapest model (DeepSeek V3) and the most expensive (Claude Opus) is over 20x. Include ultra-lightweight models like Mistral Nemo and the gap between lowest and highest exceeds 1,000x. In the past, "good AI = expensive AI." Now, depending on the use case, $0.04 is plenty.
Deja vu from the AWS cloud revolution
In the 2010s, AWS kept lowering cloud costs, birthing an explosive generation of startups that couldn't afford their own infrastructure. The AI API price war is playing exactly the same role right now. Developers in Lagos, Sao Paulo, Jakarta, and Bangalore can now access frontier AI.
The Essentials: How to Optimize AI API Costs
- Route models by workload
You don't need GPT-5 for everything. Route simple classification to lightweight models ($0.04/M), summarization to mid-tier ($0.30/M), and only complex reasoning to premium ($3–15/M). - Use prompt caching
Anthropic offers up to 90% cost savings on cached inputs. If you have repetitive system prompts, apply this immediately. - Implement batch processing
For tasks that don't need real-time responses (report generation, data classification, etc.), batch APIs can get you a 50% discount. - Consider API aggregators
Multi-provider platforms like OpenRouter and LemonData let you switch between 400+ models with a single API key. Markup is 0–10%. - Consider open-source self-hosting
DeepSeek V3 and Llama 3.3 70B deliver 90–95% of GPT-4 performance. If you have high traffic, self-hosting can save 90%+.
Cheap doesn't always mean good
DeepSeek maintains some API prices through subsidies — a market share strategy burning hedge fund capital. Data privacy, regulatory compliance, and geopolitical risks need consideration too. And beyond direct model costs, when you add infrastructure, monitoring, and compliance, actual costs can be 5–10x higher.


