Let's be honest. Up until 2025, AI API costs were basically free money. Token prices were so cheap that we'd throw GPT-5 at a simple classification task, use Opus for summaries, and talk ourselves into it: "It's the best model, so obviously we should use it." Then 2026 hit — HBM memory price spikes, new energy levies, and rising compliance requirements all landed at once, and API prices climbed in a very noticeable way. One HN developer admitted to "spending two weeks wrestling with costs", and a growing consensus is spreading across the industry that "the subsidy era is over."

TL;DR

Why prices are rising: HBM memory costs, new energy taxes, and tightening compliance requirements are all hitting at once, driving AI API prices up.

The key to cutting costs: Model tiering alone — routing simple tasks to cheaper models and complex tasks to expensive ones — can cut costs 60–80%.

The strategy: Combine prompt minimalism, batch APIs, caching, and local compute to dramatically reduce costs without sacrificing performance.

What Is It?

It's got a fancy name — Lean Engineering — but the idea is simple. Don't use expensive AI models everywhere. Pick the right-sized model for each job.

Independent developer David Vartanian put it plainly on HN: "I started this with my own savings, no VC money, so I figured I was pretty lean. Turns out I wasn't. Using the most expensive model every single time had become a habit." He's not alone. As of 2026, frontier models (GPT-5, Claude 4.5 Opus, etc.) charge $15–$75 per million output tokens, while smaller models that can handle the same tasks run $0.05–$1.

The real issue wasn't technical — it was habit. "Setting one powerful model as the default and never questioning it" compounds costs exponentially over time.

Cost Reality Check: Run 1,000 chatbot conversations a day (averaging 2K tokens each). With GPT-5, that's $1,050/month. With Gemini 3 Flash, it's $12/month. That's an 87x difference.

What Changes?

Through 2025, the default equation was "bigger model = better results." But 2026 data tells a different story.

Old Approach (All-In Frontier)Lean Engineering
Model SelectionGPT-5/Opus for everything3-tier routing by task complexity
Monthly cost (1K chats/day)$1,050/month$12–$132/month
Latency800ms+ (large model overhead)50–100ms (small models)
Throughput~15 tok/s (GPT-5)200–544 tok/s
Prompt managementDump everything into contextStrip filler, design for minimum tokens
Infrastructure100% cloud API dependentLocal/hybrid mix

Real-world results back this up. In Microsoft's distillation experiments, shrinking a 405B parameter model down to 8B actually improved NLI task accuracy by 21%. "Sketch-of-Thought" research proved you can cut reasoning token usage by over 70% while maintaining accuracy. And a Clarifai solutions architect put it this way: "Enterprise customers handle 80% of API calls with smaller models, reserving large models only for complex reasoning — and they're cutting compute costs by 70%."

60–80%Cost reduction with model tiering
10–30xInference cost gap: small vs. large models
70%+Reasoning tokens you can cut (with compact reasoning)

Getting Started

A step-by-step guide you can start using tomorrow.

  1. Map your current cost structure
    First, measure which models are being used for which tasks and how many tokens you're burning. FinOps tools like Finout let you track costs by project. The data backs this up: 80% of companies miss their AI infrastructure cost estimates by more than 25%.
  2. Sort your tasks into three tiers
    Simple (classification, extraction, basic Q&A) → economy models like Gemini 3 Flash or Claude Haiku. Mid-tier (summarization, general reasoning) → Claude 4.5 Sonnet, o4-mini. Complex (multi-step analysis, creative work) → GPT-5, Claude Opus.
  3. Put your prompts on a diet
    Aggressively cut unnecessary context and filler. Cache the static parts of a 4K system prompt and you'll cut input costs by 40% alone. Remember David's line: "Every unnecessary token is a direct drain on capital."
  4. Use batch APIs for non-real-time work
    Both OpenAI and Anthropic offer 50% discounts on batch API calls. Document analysis, content generation — anything that doesn't need an instant response — cuts your bill in half.
  5. Consider local compute
    For repetitive, predictable tasks, running on local GPUs is cheaper long-term. Deploy an open-source model like Mixtral 8x7B locally and per-token charges disappear entirely — plus you get data privacy as a bonus.

Deep Dive Resources

Real-World LLM Cost Comparison

Want a side-by-side breakdown of major LLM API pricing as of 2026? Zen van Riel's LLM API Cost Comparison 2026 guide covers everything from frontier to economy models with real workload-based cost calculations.

The Science Behind Small Model Performance

MIT's "Meek Models" research takes an academic look at why budget models can match the performance of much larger ones. Required reading if you want to understand the mechanics of distillation, quantization, and efficient inference.

The Full Breakdown: 2026 AI Cost Drivers

Finout's Top 6 AI Cost Drivers report systematically breaks down all six cost factors: compute, LLM pricing, customization, labor, security, and more.