...

Best Local LLMs for OpenClaw (2026 guide)

OpenClaw mascot, a Mac Mini with the Apple logo running a local LLM as a glowing teal neural core, and the Ollama llama logo, showing agent, model, and runtime working together on one Mac

Everyone’s running OpenClaw with a local LLM now. Or trying to. Most guides make it sound easy: install Ollama, pull a model, done. They skip the part where half these models can’t actually do what OpenClaw needs them to do.

One model that looked great on benchmarks looped endlessly on a basic tool-calling chain. Another, half its size, ran an agent for two weeks without a single failure. More on both later.

In the OpenClaw vs Claude article, I said combining OpenClaw + Claude via OAuth was the best setup. That’s still true for complex reasoning. But for the routine 80%, email triage, scheduling, file management, messaging? The right local model on the right hardware handles it without touching the cloud.

OpenClaw itself barely touches your RAM. The model you feed it? That’s where the gigabytes go. And the minimum bar for “actually works as an agent” is higher than most people think.

The short answer

For most OpenClaw users: run Qwen3.6-27B at Q4 (about 14GB) on a 32GB Mac, and keep a Claude OAuth fallback for the hard 20%. Pick by the RAM you have:

  • Under 32GB: run OpenClaw with a cloud model via OAuth, or use split architecture.
  • 32GB ($1,199): Qwen3.6-27B, Devstral-24B, or Qwen3-Coder:32B locally. The sweet spot.
  • 64GB ($1,999): dual model, zero cloud.
  • 192-512GB ($7,999+): full 262K context and many agents, fully local.

Where the RAM actually goes

OpenClaw’s daemon uses 300-500 MB of RAM. Add about 100 MB per messaging channel and maybe 1 GB for sandbox containers. Total overhead in production: 1-2 GB. That’s it.

But here’s what nobody tells you upfront. OpenClaw’s system prompt alone is 17,000 tokens. Add sub-agent context, tool definitions, and conversation history, and you need 32K context minimum. 65K+ for production with sub-agents.

That context eats RAM through the KV cache, on top of the model weights.

Any machine can run OpenClaw. The real question: can it run a 24-32B model with 65K context alongside it?

Small models fail this test. 7-8B models hallucinate tool calls and produce format errors. 14B is marginal. And MoE (mixture-of-experts) models at higher parameter counts can look great on paper but loop on tool-calling chains in practice.

Models that actually work

Everything below comes from real-world production setups, not benchmark scores. Two people in particular helped shape this section: Ian Paterson, who runs OpenClaw on a 32GB Mac Studio, and Alican Kiraz, who runs a 48-agent OpenClaw cluster.

The recommended local models for OpenClaw, in order, are Qwen3.6-27B as the current headline pick, then Devstral-24B or Qwen3-Coder:32B as proven fallbacks.

ModelParamsRAM at Q4Best forVerdict
Qwen3.6-27B27B dense~14GBheadline pick, fits 32GBBest pick
Devstral-Small-2-24B24B~14GBproven 2-week production runProven
Qwen3-Coder:32B32B~20GBvery stable tool calls, needs 32GB+Stable
GLM-4.7 Flashflashvariesdual-model fallbackFallback
Minimax M2.5 / Qwen3-Coder-Nextmixedvaries48-agent clusters at scaleAt scale
GPT-OSS 20B/120B20B / 120B~16GB+agent-tuned extra optionSolid
Qwen 3 14B / DeepSeek-R1-Distill-14B14Bfits 24GBlight tasks onlyMarginal
7-8B (Qwen 3 8B, Mistral 7B, Llama 3.2 8B)7-8B~6GBone-off Q&A, not agentsSkip
Local LLMs for OpenClaw agents: memory and verdict at a glance.

Top picks for 32GB+

Qwen3.6-27B. The current headline pick as of May 2026. A dense 27B that scores 77.2% on SWE-bench Verified, beating the older 397B Qwen3.5 MoE at 76.2%, plus 59.3% on Terminal-Bench 2.0. That second number tracks function-calling reliability, the exact thing OpenClaw’s tool loop depends on.

It needs about 14GB at 4-bit, so it fits a 32GB machine with room for context, with a native 262,144-token window. Released April 22, 2026, so most older guides predate it. It’s the strongest proof yet of the dense-beats-MoE point this section makes. For the full walkthrough, see how to run Qwen locally.

Devstral-Small-2-24B. Ian Paterson’s production model on a 32GB Mac Studio M1 Max. About 14GB on disk at Q4_K_M. Stable tool calling at 13.2 tok/s. Slow enough that you notice. Reliable enough that you stop caring. Two weeks in production without a single failure.

Qwen3-Coder:32B. Community consensus pick. Extremely stable tool calling. Needs about 20GB at Q4_K_M plus 4-6GB for KV cache at 65K context. Requires 32GB+ hardware.

GLM-4.7 Flash. Best fallback model. Pairs well with either of the above for dual-model rotation, OpenClaw switches between them automatically. Recommended combo: Devstral-24B or Qwen3-Coder:32B as primary, GLM-4.7 Flash as fallback.

Minimax M2.5 and Qwen3-Coder-Next. What Alican Kiraz runs across his 48-agent cluster. Proven at scale. Qwen3-Coder-Next has native tool calling support.

GPT-OSS 20B/120B. An agent-tuned open option that keeps showing up in current OpenClaw model rankings. A reasonable extra pick, though I’d still start with Qwen3.6-27B or the Devstral/Qwen3-Coder pair.

The MoE trap

Mixture-of-experts models like Qwen3-Coder-30B look great on paper. 49 tok/s on Apple Silicon, almost 4x faster than Devstral. Ian Paterson tried it first. It looped endlessly on tool-calling chains. The slower dense model won on reliability.

Speed means nothing if your agent gets stuck in a loop. Stick to dense models for agent work.

14B: marginal

Qwen 3 14B and DeepSeek-R1-Distill-14B both fit on 24GB hardware. Qwen is more coherent. DeepSeek has chain-of-thought reasoning from the full R1 model. Neither is reliable enough for production agent tasks that need consistent tool calling.

If you’re on this tier, run the local model for simple tasks and lean on Claude OAuth for anything with more than two steps.

8B and below: skip for agents

7-8B models (Qwen 3 8B, Mistral 7B, Llama 3.2 8B) produce tool call format errors constantly. Fine for answering one-off questions. Useless for agent work.

1-2B models are routing only. PicoClaw on a Raspberry Pi handles basic triage and smart home triggers. One user got a Telegram agent running on a Pi 400 with ZeroClaw, but hit context and timeout walls doing anything more complex.

Comparison table of eight local LLMs by RAM and OpenClaw agent verdict, Qwen3.6 27B best

Runtime choice: it matters more than you think

The fastest way onto a working setup is the one-command launcher. As of Ollama 0.17, ollama launch openclaw --model

--yes installs OpenClaw via npm, pulls the model, and configures the gateway in one go. The production decisions below still apply once you’re past that first run.

Both real-world power users I talked to chose LM Studio or vLLM over Ollama for production. There’s a reason.

The Ollama streaming gotcha. OpenClaw sends stream: true by default. Ollama’s streaming doesn’t emit tool_calls delta chunks properly.

The model decides to call a tool, but the response comes back empty. Silent failure.

Your agent just stops mid-task and you have no idea why.

Fixes: set stream: false in model config (kills streaming performance), use Ollama’s native /api/chat endpoint instead of the OpenAI-compatible one, or skip Ollama entirely.

LM Studio is what Ian Paterson uses for production. GUI for testing models, API at localhost:1234 for production. For OpenClaw, LM Studio is the production-grade backend: it handles tool calls better than Ollama in streaming mode.

vLLM is what Alican Kiraz runs on his NVIDIA DGX Spark nodes via Docker. Best option for dedicated GPU inference servers. Once you settle on a runtime, the secure OpenClaw walkthrough covers locking down access.

If you’re hitting weird silent failures where the agent stops mid-task, try switching the runtime before blaming the model.

How much RAM you actually need

RAMMac hardwarePriceWhat it runsVerdict
Under 16GBany Mac or PivariesOpenClaw + a cloud model via OAuthCloud only
16-24GBbase M-seriesvariesagent node only, in a split setupToo small local
32GBMac Mini M4$1,199Qwen3.6-27B, Devstral-24B, Qwen3-Coder:32BEntry point
48-64GBMac Mini M4 Pro$1,799-1,999a 32B with headroom, or dual modelsComfortable
192-512GBMac Studio Ultra$7,999+full 262K context, many agents at onceFrontier
How much RAM you need, the Mac that fits it, and what each tier runs.

Under 16GB: cloud only

OpenClaw runs fine. No room for a useful model. Best play: OpenClaw + Claude or GPT via OAuth. Still a 24/7 agent, the LLM just lives in the cloud.

A Raspberry Pi with PicoClaw fits here for basic triage only.

16-24GB: too small for agent-grade models

Could fit a 7-8B or 14B model, but both are marginal for real agent tasks. Alican Kiraz runs 16GB Mac Minis in his cluster, but they only run OpenClaw agents. Inference happens on separate hardware. That’s the split architecture approach, which I’ll cover below.

Single machine at this tier? Run OpenClaw with OAuth.

32GB: the entry point

This is where local-first actually starts working.

OpenClaw (~1GB) + Devstral-24B at Q4_K_M (~14GB) + KV cache at 65K (~4-6GB) + OS (~3GB) = about 24GB total. Fits with headroom. Or: Qwen3-Coder:32B (~20GB) + KV cache + OS = about 30GB. Tight but it works. Qwen3.6-27B fits here too at roughly 14GB for the weights.

Ian Paterson’s production config: 32GB Mac Studio M1 Max + Devstral-24B + LM Studio. Two weeks stable.

Pro tip from his setup: sudo sysctl iogpu.wired_limit_mb=24576 raises macOS GPU memory cap. Critical on 32GB machines. KV cache Q8_0 quantization nearly doubles usable context on Apple Silicon.

32GB Mac Mini M4 at $1,199. This is the real starting line.

48-64GB: comfortable

32B model with real headroom. Can run dual models simultaneously. 70B models start fitting at 64GB but give diminishing returns for most agent tasks.

48GB Mac Mini M4 Pro at $1,799. 64GB Mac Mini M4 Pro at $1,999+. This is where “local-first” stops feeling like a fight.

192-512GB: context length and many agents

This tier used to be the only way to run a frontier-class model locally. That framing has shifted. A dense Qwen3.6-27B now matches or beats the 397B Qwen3.5 MoE on coding benchmarks at Q4 on a 32GB machine, so the big-iron tier is no longer about top coding quality.

What it still buys you is context length (the 262K full window on the largest MoE models) and headroom to run many agents at once.

For that workload, Clément Pillette benchmarked the 397B MoE at full 262K context on a 512GB Mac Studio M3 Ultra against a custom workstation with 4x RTX PRO 6000 (384GB VRAM). The Mac Studio hit 35 tok/s. The workstation hit 46.9 tok/s.

The workstation cost ~$47,000 and draws 1,100W at 51 dBA. The Mac Studio cost ~$15,000, draws 120W, and you can barely hear it at ~15 dBA. That’s 6.7x more energy-efficient per token. Over three years, the TCO gap is nearly $42K.

His conclusion: the Mac Studio is the most attractive hardware for running local AI agents. And he says he’s never been a Mac guy.

Most OpenClaw users won’t need this tier. A 32B model handles the routine agent work fine. But if you’re running complex multi-agent workflows or want a frontier-class context window fully local, the 192GB Mac Studio M4 Ultra starts at $7,999 and the 512GB version exists for people who refuse to touch a cloud API.

Split architecture: any budget

Alican Kiraz’s approach separates agent nodes from inference nodes. Cheap hardware (16GB Mac Mini, Raspberry Pi) runs OpenClaw. Expensive hardware (512GB Mac Studio, NVIDIA DGX Spark) runs the models. Connected via local network.

Agent nodes don’t need GPU memory. They just need network access to the inference server. Scales to 48+ agents on a single inference node.

Overkill for most users. But if you’re running multiple OpenClaw instances, this is how the serious setups work.

Five Mac RAM tiers from 16GB to 512GB for local LLMs on OpenClaw, 32GB as entry point

Speed and latency: what local actually feels like

Set expectations. People compare local inference to ChatGPT speed and get disappointed.

SetupSpeedWhat it feels like
Devstral-24B on a 32GB M1 Mac Studio13.2 tok/s“hello” takes ~3 min, compound tasks 8-10 min
397B on a 512GB M3 Ultra20-30 tok/s~5 min for a 22K-token answer
M4 vs M1/M3, same RAM~2x the tok/sfaster, but still minutes per task
Local inference speed: think background worker, not chat.

Ian Paterson’s honest take: saying “hello” takes about 3 minutes. Compound tasks, 8-10 minutes. That’s Devstral-24B at 13.2 tok/s on a 2022 Mac Studio.

Another user on X reported Opus 4.6 burning 22K tokens for a simple question. Even on an M3 Ultra with 512GB at 20-30 tok/s, that’s roughly 5 minutes before you get an answer.

The mental model is “delegating to a junior team member.” You hand off a task and check back later. Come back in 10 minutes.

M4 hardware is faster than M1 or M3. Expect roughly 2x the tok/s on equivalent RAM. But even at double speed, you’re still waiting minutes per task. This is a background worker, something you fire off before making coffee.

For a 24/7 worker doing email triage, scheduling, and file management? Speed doesn’t matter much. It runs while you sleep.

OpenClaw config for local models

  • Context length. 65K minimum, a hard requirement. OpenClaw’s system prompt alone is 17K tokens, which adds 4-8GB via the KV cache. 65K is still the working floor, but the current OpenClaw docs example sets contextWindow around 196,608 and maxTokens 8,192, so give it more headroom if hardware allows.
  • Reasoning mode. For Qwen3.x thinking models, set reasoning: false in OpenClaw (and disable thinking on the Ollama side via the Modelfile), or tool calls fail silently. This is the current top tool-call-failure fix, tracked in OpenClaw issue #71088. If your calls vanish and the model otherwise looks fine, check this first.
  • KV cache quantization. Q8_0 on K and V nearly doubles usable context on Apple Silicon. Huge for 32GB machines where every gigabyte counts. Ian Paterson’s finding.
  • Model quantization. Q4_K_M is the standard. Q5_K_M if you have headroom. Don’t go lower than Q4 unless you’re on a Pi.
  • Temperature. 0 to 0.2 for agent tasks. Higher values mean more hallucinated tool calls. Correctness matters more than creativity here.
  • Streaming. Use stream: false if tool calls silently fail, or switch to LM Studio or vLLM. The current docs recommend the LM Studio + Responses API stack (api: "openai-responses") as the best local backend.
  • Auth profiles. Local model first, Claude OAuth fallback for complex tasks. OpenClaw handles the switching automatically.
Single machine vs split agent and inference node architecture for OpenClaw with a cloud OAuth fallback

When local isn’t enough

  • Complex reasoning. Opus 4.6 vs Qwen3-Coder:32B on SWE-bench isn’t close. Cloud still wins the hard problems.
  • Context length. Local models max out around 128K tokens. Cloud models handle 1M+. If you’re analyzing a full codebase or a 200-page contract, you need a cloud model.

The hybrid setup from the Claude article is still the smartest play. Local for 80% of tasks, route the hard stuff to Claude via OAuth. Your OpenClaw auth profiles handle the switching automatically.

The pick: four hardware paths

  • Under 32GB. Don’t fight it. Run OpenClaw with Claude or GPT via OAuth, or go split architecture: run OpenClaw on cheap hardware and point it at a remote inference server.
  • 32GB Mac Mini ($1,199). Qwen3.6-27B, Devstral-24B, or Qwen3-Coder:32B via LM Studio, with a Claude OAuth fallback for complex reasoning. The sweet spot.
  • 64GB Mac Mini ($1,999). Dual model setup with Qwen3-Coder:32B + GLM-4.7 Flash. Local for everything, cloud for nothing. The “zero cloud” config.
  • 192-512GB Mac Studio ($7,999+). Full 262K context and many agents at once, running fully local at 35 tok/s. For multi-agent enterprise setups or people who want cloud-grade context without the cloud bill.

Want to try OpenClaw before buying hardware? rentamac.io rents 16GB M4 Mac Minis with full admin access. Won’t fit a local model, but you can set up OpenClaw with Claude or GPT via OAuth and have a 24/7 agent running in an hour.

How much RAM decision flowchart for choosing a local LLM setup for OpenClaw, 32GB recommended

A 32GB Mac Mini running Qwen3-Coder:32B costs $0/month in inference and about $1.50/month in electricity. The whole machine draws around 15W idle, 40W under load. Compare that to $200/month for Claude Max or $20/month for Pro with rate limits.

The hardware pays for itself in 6 months. A GPU workstation doing equivalent inference draws 1,100W and sounds like a jet engine.

Cloud models still win on the hard problems. But for an always-on agent handling the routine 80%? A Mac Mini in your closet running Qwen3-Coder:32B changes the math completely.

Rent a Mac in the Cloud

Get instant access to a high-performance Mac Mini in the cloud. Perfect for development, testing, and remote work. No hardware needed.

Mac mini M4