---
title: "Local models"
description: "Run your HolaClaw assistant entirely on your Mac — no account, no API key, no cloud. Total privacy, works offline, free to run."
updated: 2026-06-11
canonical: https://holaclaw.ai/docs/ai-providers/local-models
---

Local models run the AI right on your Mac. Nothing you say ever leaves the machine, it works with no internet connection, and it's completely free to run. The honest trade-off: local models are smaller than the frontier models the cloud providers offer, so expect a little less polish and slower responses — especially on Macs with less memory. For privacy journaling, offline use, or zero-cost tinkering, that trade is often well worth it.

```provider-card
name: Local
company: On-device
tagline: Runs entirely on your Mac — no API key, no cloud calls.
hue: 145
auth: None — runs on your Mac
cost: Free
local: Yes
recommended: Gemma 4 E4B
```

## How it works

HolaClaw bundles its own **inference server** — the program that actually runs the model and turns your messages into replies. It's built on llama.cpp, a fast on-device engine, and there's nothing extra for you to install. When you pick a local model, HolaClaw downloads its **weights** (the big file of numbers that *is* the trained model) once from Hugging Face, checks that the file arrived intact, and stores it on disk. If a download gets interrupted, it picks up where it left off the next time you're online. Every local model can also see images you send.

## What your Mac needs

HolaClaw runs only on Apple-silicon (M-series) Macs, and the limiting factor for local models is **memory** — specifically your Mac's **unified memory**, the pool of RAM that both the processor and graphics chip share. A model has to fit in that pool to run, so more memory means access to bigger, smarter models.

We recommend 16 GB or more of unified memory for a good experience. Here's the plain-language version:

- **8 GB Macs** — stick to the smallest model, Gemma 4 E2B. It'll run, but expect modest quality.
- **16 GB Macs** — the comfortable middle. Gemma 4 E4B, 12B, GLM-4.6V Flash, and Ministral 3 14B all work well.
- **32 GB or more** — you can run the big three: Qwen3.6 27B, Gemma 4 26B MoE, and Gemma 4 31B.

> **Heads-up — 8 GB Macs are tight.** On a base 8 GB Mac, only Gemma 4 E2B fits comfortably. It's genuinely useful for quick chats and notes, but if you want frontier-quality conversation on that machine, a cloud provider will serve you better.

## Pick a model

Here's the full local catalog, with the size of the one-time download and the unified memory we recommend for each:

| Model | Download | Recommended RAM | Notes |
| --- | --- | --- | --- |
| **Gemma 4 E2B** | 4.1 GB | 8 GB | The smallest — snappy on any M-series Mac |
| **Gemma 4 E4B** | 6.0 GB | 16 GB | **Recommended.** Best balance of speed and quality |
| **Gemma 4 12B** | 7.2 GB | 16 GB | A quality step up from E4B |
| **GLM-4.6V Flash** | 8.0 GB | 16 GB | Fast, and strong with images and screenshots |
| **Ministral 3 14B** | 9.1 GB | 16 GB | Mistral's solid all-rounder |
| **Qwen3.6 27B** | 18.0 GB | 32 GB | Big and capable |
| **Gemma 4 26B MoE** | 18.1 GB | 32 GB | A mixture-of-experts model that punches above its weight |
| **Gemma 4 31B** | 19.5 GB | 32 GB | The strongest local model HolaClaw ships |

**Gemma 4 E4B** is the recommended default and the right starting point for most people on a 16 GB Mac. If your machine has the headroom, step up to 12B for a little more quality, or one of the big three if you've got 32 GB.

> **Tip — start in the middle and adjust.** If E4B feels slow on your Mac, drop to E2B; if it feels too basic and you have the memory, move up. Each model is its own download, so trying another just means picking it for a new assistant.

## Set it up

You choose your model while creating an assistant, in **Step 2 · Model provider** — the screen that says "Connect the brain behind your assistant."

1. In HolaClaw, start the **Create Assistant** flow and go to Step 2.
2. Choose **Local** in the provider sidebar. The tagline reads "Runs entirely on your Mac — no API key, no cloud calls."
3. Pick a model from the list. Each one shows its download size and recommended RAM, plus a status badge: **Downloaded** if the weights are already on your Mac, or **Will download on first run** if they'll be fetched the first time you chat.
4. Continue with the rest of the setup. There's no API key step at all.

If you picked a model that isn't downloaded yet, HolaClaw fetches it the first time your assistant needs it. That download can take a few minutes on a slow connection — and if it gets cut off, it resumes automatically once you're back online.

## Managing disk space

Model files live on your Mac, and they're not small — anywhere from about 4 GB to nearly 20 GB each. They stay on disk after download so they're ready instantly next time.

If you've tried a few models and want the space back, you can remove the weights for any model you're no longer using. The model disappears from your disk but stays in the list, so you can always re-download it later.

## The Inference Server panel

Assistants powered by a local model get one extra section in their **Settings** that cloud assistants don't: **Inference Server**. It shows the server's status and which model it's running, with controls to start, stop, or restart it.

You normally never need to touch this — HolaClaw starts and stops the server for you. It's there for the rare case where you want to check on it or give it a nudge, and we mention it here so the section isn't a surprise when you spot it.

## When local is the right call

Reach for a local model when:

- **Privacy matters most** — journaling, sensitive notes, anything you'd rather never left your Mac.
- **You're offline** — on a plane, off-grid, or just without a connection.
- **You want zero cost** — tinker as much as you like with no bill at the end of the month.

Reach for a cloud provider instead when you want the highest-quality conversation you can get, or you're on a lower-memory Mac where the local models feel limited. The cloud guides cover those options — [Claude](/docs/ai-providers/anthropic-claude), [OpenAI](/docs/ai-providers/openai), [Google Gemini](/docs/ai-providers/google-gemini), [Grok](/docs/ai-providers/grok), [DeepSeek](/docs/ai-providers/deepseek), and [OpenRouter](/docs/ai-providers/openrouter).

Remember the model choice is per-assistant, made at creation. If you want to compare local and cloud side by side, just create a second assistant with the other one.

## Troubleshooting

Local models fail differently from cloud ones — there's no account or billing to go wrong, so the issues are about memory and downloads:

- **The first response after creating or restarting is slow.** That's the model loading into memory — it's normal, and only happens once. Replies speed up after that.
- **Responses are very slow, or your Mac is struggling.** The model is likely too big for your memory. Create a new assistant with a smaller model — drop from a big one to E4B, or from E4B to E2B.
- **A download seems stuck.** Check your connection. Downloads resume automatically when you're back online, so you won't lose progress.

## Next steps

Still weighing local against the cloud? Compare every option in [Choosing an AI provider](/docs/ai-providers/choosing-a-provider). And if you get stuck or just want a hand, the [HolaClaw Discord](https://discord.gg/FbxAbS5sGQ) is a friendly place to ask.

