Tired of cloud API bills that keep creeping up? Or maybe you're not comfortable sending sensitive data to external servers? Google's Gemma 4, released April 2nd, might be the answer to both problems. Thanks to its MoE architecture — only 3.8B of the total 26B parameters are active at any time — it runs at 20–30 tok/s on a single Mac Mini.
What Is It?
Gemma 4 is an open-weight LLM family released by Google DeepMind on April 2nd, 2026. It's Apache 2.0 licensed, so you're free to use it commercially.
The standout here is the 26B A4B model. "A4B" stands for Active 4 Billion — out of the full 26 billion parameters, only 3.8 billion are actually activated during inference. That's the MoE (Mixture of Experts) architecture at work: it selects only the relevant expert networks (out of 128 total) for each task.
Why does MoE matter?
It means a 26B model runs at 4B model speed. You still need enough memory to hold 26B parameters, but the actual compute cost is closer to 4B — which makes it practical on Mac's unified memory. It scored 88.3% on the AIME 2026 math benchmark and 82.6% on MMLU Pro.
Ollama is an open-source tool for running LLMs locally. Think of it like Docker for models — you manage and run them with ollama pull and ollama run commands. Once installed, it automatically spins up an OpenAI-compatible API server at localhost:11434, so you can drop it into your existing OpenAI-based apps just by changing the base URL.
The setup guide that earned 322 points on Hacker News got popular for one simple reason: it's a genuinely practical local AI setup that takes under 10 minutes from install to auto-start.
What Changes?
"Why bother running locally when you can just use an API?" — Let's answer that with actual numbers.
| Comparison | Cloud API (GPT-4o, etc.) | Local Ollama + Gemma 4 26B |
|---|---|---|
| Upfront cost | $0 (pay-as-you-go) | $0 (model is free, use your existing Mac) |
| Monthly cost (at 100 req/day) | $30–150+ (depends on model and token count) | Just electricity ($3–5) |
| Data privacy | Data leaves your device | Stays on your Mac — zero external transfer |
| Internet required | Always | Works offline after initial download |
| Response speed | 0.5–2s including network latency | No network delay (20–30 tok/s) |
| Context window | 128K (GPT-4o) | 256K (Gemma 4 26B) |
| Model capability | Frontier models (Claude, GPT) win here | #6 on the Arena AI text leaderboard |
| Rate limits | Per-minute and daily limits apply | Unlimited |
Sure, local models still can't match the raw capability of frontier models like Claude 4 or GPT-5. But as the HN discussion made clear, local models have a real edge for privacy-sensitive work, repetitive automation, and prototyping where API costs add up fast.
Heads Up: Hardware Requirements
The 26B model (Q4_K_M quantization) uses around 15–18GB of memory. We recommend at least 32GB unified memory. On 16GB Macs the system will struggle, and 24GB users have reported freezing under concurrent requests. If you've got a 16GB Mac, gemma4:e4b (4.5B params, ~9.6GB) is the practical choice.
Getting Started
- Install Ollama
brew install --cask ollama-app
After installing, runopen -a Ollamato launch it — you'll see an icon appear in your menu bar. The CLI tool gets installed at/opt/homebrew/bin/ollama. - Download the Gemma 4 Model
ollama pull gemma4:26b
This downloads about 18GB. If your Mac has less than 32GB of RAM, tryollama pull gemma4(default 8B) orollama pull gemma4:e4binstead. -
Run a Quick Test
ollama run gemma4:26b "Hey, what model are you?"
If you get a response, you're good to go. Useollama psto check which models are loaded and how much memory they're using. - Set GPU Optimization Environment Variables
launchctl setenv OLLAMA_NUM_GPU 99
This maximizes how many layers get loaded into Apple Silicon's unified memory, pushing speed as high as possible. Without this, you'll fall back to CPU and speeds can drop by more than half. - Keep the Model Loaded (Prevent Unloading)
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
By default, Ollama unloads models after 5 minutes of inactivity. Reloading a 26B model takes 15–30 seconds — not ideal. Setting this to "-1" keeps it loaded indefinitely. To persist after reboot, addexport OLLAMA_KEEP_ALIVE="-1"to your~/.zshrc. - Set Up Auto-Start (Optional)
Add Ollama to macOS Login Items and use a LaunchAgent to preload your model automatically — so AI is ready the moment your Mac boots up. The specific plist configuration is in the original guide linked in the resources below. - Connect Your Apps (OpenAI-Compatible API)
Ollama exposes an OpenAI-compatible API at localhost:11434. Just swap the base URL in your existing code and you're done.
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"gemma4:26b","messages":[{"role":"user","content":"Hello"}]}'




