storage.googleapis.com

Run Local AI on a Mac Mini — Ollama + Gemma 4 Setup Guide

A practical guide to running Google Gemma 4 26B locally on Mac Mini with Ollama.Dev

Gemma 4: Byte for byte, the most capable open models — Google Blog

Gemma 4 — Google DeepMind

google/gemma-4-26B-A4B — Hugging Face

Tired of cloud API bills that keep creeping up? Or maybe you're not comfortable sending sensitive data to external servers? Google's Gemma 4, released April 2nd, might be the answer to both problems. Thanks to its MoE architecture — only 3.8B of the total 26B parameters are active at any time — it runs at 20–30 tok/s on a single Mac Mini.

TL;DR

Install Ollama (1 min) → Download Gemma 4 model (5 min) → Set environment variables → Auto-start + persistent model loading → Connect apps via OpenAI-compatible API

What Is It?

Gemma 4 is an open-weight LLM family released by Google DeepMind on April 2nd, 2026. It's Apache 2.0 licensed, so you're free to use it commercially.

The standout here is the 26B A4B model. "A4B" stands for Active 4 Billion — out of the full 26 billion parameters, only 3.8 billion are actually activated during inference. That's the MoE (Mixture of Experts) architecture at work: it selects only the relevant expert networks (out of 128 total) for each task.

Why does MoE matter?

It means a 26B model runs at 4B model speed. You still need enough memory to hold 26B parameters, but the actual compute cost is closer to 4B — which makes it practical on Mac's unified memory. It scored 88.3% on the AIME 2026 math benchmark and 82.6% on MMLU Pro.

Ollama is an open-source tool for running LLMs locally. Think of it like Docker for models — you manage and run them with ollama pull and ollama run commands. Once installed, it automatically spins up an OpenAI-compatible API server at localhost:11434, so you can drop it into your existing OpenAI-based apps just by changing the base URL.

The setup guide that earned 322 points on Hacker News got popular for one simple reason: it's a genuinely practical local AI setup that takes under 10 minutes from install to auto-start.

What Changes?

"Why bother running locally when you can just use an API?" — Let's answer that with actual numbers.

Comparison	Cloud API (GPT-4o, etc.)	Local Ollama + Gemma 4 26B
Upfront cost	$0 (pay-as-you-go)	$0 (model is free, use your existing Mac)
Monthly cost (at 100 req/day)	$30–150+ (depends on model and token count)	Just electricity ($3–5)
Data privacy	Data leaves your device	Stays on your Mac — zero external transfer
Internet required	Always	Works offline after initial download
Response speed	0.5–2s including network latency	No network delay (20–30 tok/s)
Context window	128K (GPT-4o)	256K (Gemma 4 26B)
Model capability	Frontier models (Claude, GPT) win here	#6 on the Arena AI text leaderboard
Rate limits	Per-minute and daily limits apply	Unlimited

Sure, local models still can't match the raw capability of frontier models like Claude 4 or GPT-5. But as the HN discussion made clear, local models have a real edge for privacy-sensitive work, repetitive automation, and prototyping where API costs add up fast.

Heads Up: Hardware Requirements

The 26B model (Q4_K_M quantization) uses around 15–18GB of memory. We recommend at least 32GB unified memory. On 16GB Macs the system will struggle, and 24GB users have reported freezing under concurrent requests. If you've got a 16GB Mac, gemma4:e4b (4.5B params, ~9.6GB) is the practical choice.

Getting Started

Install Ollama
brew install --cask ollama-app
After installing, run open -a Ollama to launch it — you'll see an icon appear in your menu bar. The CLI tool gets installed at /opt/homebrew/bin/ollama.
Download the Gemma 4 Model
ollama pull gemma4:26b
This downloads about 18GB. If your Mac has less than 32GB of RAM, try ollama pull gemma4 (default 8B) or ollama pull gemma4:e4b instead.
Run a Quick Test
ollama run gemma4:26b "Hey, what model are you?"
If you get a response, you're good to go. Use ollama ps to check which models are loaded and how much memory they're using.
Set GPU Optimization Environment Variables
launchctl setenv OLLAMA_NUM_GPU 99
This maximizes how many layers get loaded into Apple Silicon's unified memory, pushing speed as high as possible. Without this, you'll fall back to CPU and speeds can drop by more than half.
Keep the Model Loaded (Prevent Unloading)
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
By default, Ollama unloads models after 5 minutes of inactivity. Reloading a 26B model takes 15–30 seconds — not ideal. Setting this to "-1" keeps it loaded indefinitely. To persist after reboot, add export OLLAMA_KEEP_ALIVE="-1" to your ~/.zshrc.
Set Up Auto-Start (Optional)
Add Ollama to macOS Login Items and use a LaunchAgent to preload your model automatically — so AI is ready the moment your Mac boots up. The specific plist configuration is in the original guide linked in the resources below.
Connect Your Apps (OpenAI-Compatible API)
Ollama exposes an OpenAI-compatible API at localhost:11434. Just swap the base URL in your existing code and you're done.
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"gemma4:26b","messages":[{"role":"user","content":"Hello"}]}'

26B

total params (3.8B active)

256K

context window

~18GB

model download size

20-30

tok/s (on M4 Pro)

🔗

Deep Dive Resources

Ollama + Gemma 4 Mac Mini Setup Guide — GitHub Gist

The original guide that earned 322 points on Hacker News. Covers installation, auto-start, LaunchAgent plist configuration, and persistent model loading all in one place.

Gemma 4 26B on Mac Mini — DEV Community

A deep-dive covering memory requirements by quantization level (Q4_K_M to FP16), GPU offloading optimization, and custom context window configuration.

Gemma 4 — Ollama Official Model Page

Full model lineup from E2B to 31B, sizes by tag, supported features, and usage examples.

Gemma 4 Official Page — Google DeepMind

Official specs covering benchmark performance, architecture details, and agentic workflow support.

Gemma 4 Hardware Guide — Compute Market

VRAM requirements for each model size from 2B to 31B, with performance comparisons by quantization option. Find out which model fits your Mac.

FAQ

Can I run Gemma 4 on a 16GB Mac?

The 26B model is a tough ask. On 16GB Macs, go with gemma4:e4b (4.5B params, ~9.6GB). It's less capable than the 26B, but at 69.4% on MMLU Pro it's more than enough for practical automation work.

Are there alternatives to Ollama?

LM Studio is GUI-based and friendlier for beginners; llama.cpp gives you fine-grained performance control. All three use the same underlying inference engine (llama.cpp), so speed differences are minimal. If you need an API server, go with Ollama. If you just want to explore models, LM Studio is the better fit.

What does Gemma 4 offer over other open models like Llama?

The biggest advantage is the MoE architecture — 26B model performance at 4B speeds. On top of that: a 256K context window, multimodal support (images and audio), and 140 language support. It currently sits at #6 on the Arena AI text leaderboard, putting it near the top of open models.

Can it fully replace frontier models like Claude or GPT?

Not quite yet. The HN community consensus is that frontier models still win on specialized coding and complex reasoning. That said, for privacy-sensitive work, repetitive automation, and prototyping, local models have a clear cost and speed advantage.

Written by Rush

Tracking where business meets AI.

Did you find this reference helpful?

Get curated references delivered to your inbox weekly

Share this reference

Antioch — Meet the Cursor for Robot AI

Physical AI startups no longer need to rent warehouses or build million-dollar test facilities. Antioch brings software-speed development to robotics through cloud simulation — and just raised $8.5M seed to prove it.

Explore more AI workflow guides on similar topics

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

morningbrew.com

Medvi telehealth, AI startup leverage, GLP-1 startup, one-person unicorn, AI operations

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

Matthew Gallagher built Medvi, a GLP-1 telehealth startup, in 14 months with $20,000 and AI tools. 2 employees. 16.2% net margin. $401M in year one. Here's how the model works — and where it's breaking.

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

substackcdn.com

What if your code review was already done when you woke up, and your newsletter

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

What if your code review was already done when you woke up, and your newsletter sources were already organized? Here's how to automate recurring tasks with Claude Code Scheduled Task.

Next →Antioch — Meet the Cursor for Robot AI