When AI feels slow, the obvious fix seems to be buying more GPUs. But doubling your GPU count does not double token generation speed. Over the past decade, AI chip compute capability grew 80×, while memory bandwidth grew only 17×. Today's AI bottleneck is not the brain — it is the bloodstream.

3-second summary
More GPUs ≠ faster AI Real bottleneck = memory bandwidth In-memory computing emerges Fractile $220M + XCENA $135M 2027: AI cost structure shifts

Everyone believes this — more GPUs means faster AI

An NVIDIA H100 runs about $30,000. The B200 costs roughly twice that. AI companies pour billions into GPUs because they believe the formula: more GPUs = more compute = faster AI.

But look at memory bandwidth and the story changes. The NVIDIA H100 can process 3.35 TB of data per second. The H200 bumped that to 4.8 TB/s — a 43% improvement. The problem is that GPU compute performance improved far faster over the same period. There is plenty of compute power sitting idle, waiting for data to arrive from memory.

This is what engineers call the "Memory Wall." Every time an LLM generates a token, it has to read hundreds of gigabytes of model weights from memory. This read operation is the bottleneck — no matter how many compute cores you add, slow memory means waiting. The 80× compute vs 17× bandwidth gap over a decade is the essence of today's bottleneck.

80×
AI chip compute growth (10 years)
17×
Memory bandwidth growth (same period)
~1 month
Time to process 100M tokens today

The real problem is how far data has to travel

Here is how current AI chip architecture works in simple terms: data leaves memory, gets preprocessed by the CPU, travels to the GPU for computation, then returns to memory. This round trip repeats for every single token generated. That journey itself consumes time and energy.

What Fractile has been building since 2022 is a way to eliminate that journey. They are building an "In-Memory Compute" architecture where calculations happen directly inside SRAM cells alongside the compute logic. Matrix multiplications never leave memory — they are processed inside it, and only the results come out.

"Faster speed is not just about going from 10 seconds to 100 milliseconds. It is about going from weeks, months — down to something much, much shorter."

— Walter Goodwin, Fractile CEO

By the numbers: today's advanced AI systems can generate up to 100 million tokens solving complex problems, but at ~40 tokens per second on current hardware, that takes one month. Fractile's target is 1,200 tokens per second — bringing the same task down to a few days. The company claims their design could be 25× faster at one-tenth the cost compared to current GPU setups.

Current GPU approachIn-memory computing
Data flowMemory → CPU → GPU → Memory (repeat)Compute completes inside memory
BottleneckMemory bandwidth ceiling (3–8 TB/s)Minimizes data movement
100M token task~1 month (40 tokens/sec)Days target (1,200 tokens/sec)
Cost targetBaseline1/10th cost (Fractile claim)

$355M landed on the same bet in two months

Fractile's $220M round in May 2026 got attention. But at the end of the same month, Korean chip startup XCENA also raised $135M at a $570M valuation. Their approaches differ — Fractile computes inside SRAM, while XCENA's MX1 chip uses CXL to place processing power right next to DRAM. But the diagnosis is identical.

In XCENA's own words: "Inference is not just a compute problem; it is increasingly a memory scaling problem." Teams in Seoul and London independently reached the same conclusion.

The investors say something too. Fractile has Founders Fund (Peter Thiel) and former Intel CEO Pat Gelsinger behind it. Anthropic is reportedly in early discussions to purchase Fractile chips once they ship. Anthropic currently sources compute from three suppliers — NVIDIA, Google TPUs, and Amazon Trainium. Fractile could become the fourth. The AI inference market is projected to grow from ~$103B in 2025 to ~$255B by 2030.

NVIDIA knows too

Blackwell boosted memory bandwidth significantly, and the H200 delivers 43% more bandwidth than the H100. But what Fractile/XCENA are targeting is not "better memory bandwidth inside a GPU" — it is "unifying memory and compute." NVIDIA will dominate short-term, but the long-term architecture bet is being placed right now.

What to do before 2027

Fractile's chip will not arrive until 2027. XCENA targets production by end of 2026. You can start preparing for this shift today.

  1. Factor the cost decline curve into your AI planning
    Per-token pricing from GPT, Claude, and Gemini follows infrastructure costs down. If AI ROI does not pencil out today, recalculate based on 2027–2028 pricing. Things not economically viable now may become possible then.
  2. Design long-context workflows in advance
    The workloads Fractile targets are "100M+ token deep reasoning" tasks. Claude's 200K and Gemini's 1M context windows are available now but expensive. Expect dramatically cheaper access post-2027 — map out processes that would benefit from long context today.
  3. Revisit your speed-vs-cost tradeoffs
    "Cost-optimized" mode slows AI responses. That tradeoff narrows post-2027. List the use cases you abandoned for speed reasons, and have them ready when infrastructure costs drop.
  4. Watch for vendor lock-in
    Anthropic eyeing Fractile as a fourth compute supplier signals the AI infrastructure diversification era has begun. More supplier diversity means more price competition. Be careful about contracts that lock you into a single vendor today.
  5. Set a H2 2027 AI workflow checkpoint
    Both Fractile and XCENA target 2026–2027 production. Mark that as your team's AI infrastructure and cost review point. Use cases that do not ROI today may be ready to go at that checkpoint.