How Local LLMs Work — and What Today's Mac Hardware Can Run
A plain-English guide to how local LLMs actually work — tokens, weights, and quantization — and exactly what today's Mac hardware, from a mini to a Studio, can run. Why memory decides what fits and bandwidth decides how fast. From a NY/NJ MSP that builds this in production.
You can run a real large language model on a Mac sitting on your desk. Not a toy. A capable model that drafts, summarizes, classifies, and answers questions about your own documents, with nothing leaving the machine. The hardware is here, the software takes ten minutes to set up, and for a lot of business work it is all you need.
This post is the how and the what: how a local LLM actually works under the hood, and what today’s Mac hardware can run, from a $600 mini to a maxed-out Studio. It is the companion to our local LLMs vs. frontier AI piece, which covers when to run local versus paying for a frontier model like Opus 4.8. This one is the mechanics and the buying guide.
What an LLM actually is
Strip away the mystique and a large language model is a very large file of numbers, called weights, plus a small program that runs them. The weights are the model’s “knowledge,” learned once during training and then frozen. Running the model (inference) is just arithmetic: you feed in text, the weights turn it into a prediction, and out comes the next chunk.
Two terms you will see everywhere:
- Parameters. The count of weights, in billions. An “8B” model has 8 billion; a “70B” has 70 billion. More parameters usually means a smarter model and a bigger file.
- Tokens. Models do not read whole words. They read tokens, which are word-pieces (roughly ¾ of a word each). “Unbreakable” might be three tokens. The model reads your text as tokens and writes its answer one token at a time, each one predicted from everything before it.
That is the whole trick. The model predicts the next token, appends it, and predicts again. Do that fast enough and it reads back like fluent writing. There is no database lookup and no internet call. It is the frozen weights doing math on your input, which is exactly why it can run entirely offline on your own hardware.
Why memory is the whole game
To run a model, its weights have to sit in fast memory the processor can reach. This is the single fact that decides what your hardware can do.
On a normal PC, that means VRAM on a graphics card, and consumer GPUs top out at 24 GB, which is not much. Apple Silicon changed the math with unified memory: the CPU and GPU share one big pool of fast RAM. A Mac with 64 GB of unified memory can hand most of that to a model. That is why a Mac punches so far above a gaming PC for this specific job, and why “how much RAM” is the first question we ask.
But capacity is only half the story. The other half is memory bandwidth — how fast the chip can move those weights. Generating each token means reading through the model’s weights once, so speed is roughly bandwidth divided by model size. This is why two Macs with the same RAM can run the same model at very different speeds:
- Memory (GB) decides what fits.
- Bandwidth (GB/s) decides how fast it runs.
Keep those two separate in your head and the rest of this makes sense.
Quantization: the trick that makes it fit
Here is the problem. A 70B model in its native full precision is about 140 GB. That does not fit on most Macs, and you would not want to pay for the RAM if it did.
The fix is quantization — storing each weight in fewer bits with almost no loss in quality. Models train in 16-bit but run fine compressed to 4-bit, which is the sweet spot most people use. The rule of thumb after 4-bit quantization is simple:
About half a gigabyte of memory per billion parameters.
So:
- An 8B model: about 5 GB. Runs on nearly any current Mac.
- A 70B model: about 40–43 GB. Needs ~48 GB or more.
- The largest open models (100B+): an Ultra-class Mac with 192–256 GB.
Quantization is why local LLMs went from “data-center only” to “runs on the Mac in the corner.” You give up a sliver of quality and get a model that fits in a fraction of the memory.
The software that runs it
You do not write any of this yourself. Three tools do the work, and all of them install in minutes:
- Ollama — the easiest start. One command (
ollama run llama3) downloads a model and drops you into a chat. It also exposes a local API, so your own apps and automations can call the model exactly like they would call a cloud one. This is what most of our business deployments are built on. - LM Studio — a friendly desktop app with a model browser and a chat window. Best if you want to click instead of type, or test a few models before committing.
- MLX — Apple’s own machine-learning framework, tuned for Apple Silicon. Ollama and LM Studio sit on top of
llama.cppwith Apple’s Metal backend; MLX is the lower-level option that squeezes out another 20–30% of speed. When throughput matters, MLX is the one to reach for.
Start with Ollama. If you outgrow it, the others are there.
What today’s Mac hardware runs
Here is the part people actually want: which Mac runs what. Memory sets what fits; bandwidth sets the speed. The table reflects both (speeds are for a 70B model at 4-bit, the hardest common workload — smaller models run several times faster).
| Mac | Unified memory | Bandwidth | Comfortably runs | 70B speed | Sweet spot |
|---|---|---|---|---|---|
| Mac mini (M4) | 16–32 GB | ~120 GB/s | up to ~14B | — | A snappy 8B assistant, classification, single-desk RAG |
| Mac mini (M4 Pro) | up to 64 GB | ~273 GB/s | up to 70B | ~8–10 tok/s | Fast on 8–32B; a 70B fits and works, just slower |
| MacBook Pro (M4 Max) | up to 128 GB | ~546 GB/s | 70B + long context | ~15–22 tok/s | The 70B workhorse, portable |
| Mac Studio (M4 Max) | 128 GB | ~546 GB/s | 70B fast, big context | ~15–22 tok/s | Always-on office model for a team |
| Mac Studio (M3 Ultra) | 256 GB | ~819 GB/s | The largest open models, or several at once | fastest | Heavy or multi-model workloads |
A few things worth pulling out of that table:
- A ~$600 Mac mini is a genuinely useful AI box. An 8B model on a base mini is fast and handles classification, tagging, extraction, and a private chat assistant without breaking a sweat.
- Capacity and speed are different upgrades. A 64 GB M4 Pro mini fits a 70B model but runs it at single-digit speeds, because its bandwidth is half an M4 Max’s. If you want a 70B at usable speed, you are buying bandwidth (an M4 Max), not just RAM.
- The M3 Ultra is the only one that touches the very largest open models, and it is overkill for most businesses. Most teams are well served by an M4 Max with 128 GB.
How fast is “fast enough”
Speed is measured in tokens per second. For reference, comfortable human reading is about 5 tokens a second, so anything above that already feels live.
- An 8B model on most current Macs: dozens of tokens a second. Instant.
- A 70B model on an M4 Max: roughly 15–22 tokens a second with MLX. Faster than you read, fine for interactive use.
- The same 70B on a lower-bandwidth chip: single digits. Usable for batch jobs, sluggish for live chat.
For bulk work — reading a pile of documents overnight, tagging records, summarizing an archive — even single-digit speeds are fine, because nobody is waiting at a keyboard. For an interactive assistant, you want an M4 Max. Match the chip to how you will actually use it.
What you can actually do with it
A local model on a Mac is not a science project. It does real work:
- Private RAG — point it at your own folders, intranet, or knowledge base and ask questions grounded in your documents, with nothing sent to any API. (We break down how that pipeline is built in the companion post.)
- Bulk classification and extraction — tag tickets, route emails, pull fields from thousands of documents at fixed cost.
- Drafting — first-pass replies, summaries, and internal docs, kept entirely in-house.
- A private assistant — a ChatGPT-style helper that never sends a word of company data outside the building.
For a healthcare practice, a law firm, or a finance shop, that last point is the whole reason to run local: real client data, real AI, and no third-party API in the chain.
What to buy
A short, honest decision guide:
- Just exploring, or want a private assistant + light classification? A Mac mini (M4, 24–32 GB). Cheapest real entry point, runs 8–14B models fast.
- A small team that wants a shared 70B at good speed? A Mac Studio (M4 Max, 128 GB). The best value for serious local AI, and the one we deploy most.
- Need it portable? A MacBook Pro (M4 Max, up to 128 GB) does everything the Studio does, on the road.
- Running the largest open models, or several at once? A Mac Studio (M3 Ultra, 256 GB). Only if you know you need it.
Buy for the model size and speed you will actually use, not the spec sheet. Most businesses overbuy on parameters and underbuy on bandwidth.
The point
A local LLM is a frozen file of weights doing math on your input. Memory decides what fits, bandwidth decides how fast, quantization makes it small enough to run, and a tool like Ollama makes it a ten-minute setup. On today’s Macs that adds up to real, private AI on hardware you own — an 8B assistant on a mini, a 70B workhorse on an M4 Max Studio.
If you want help figuring out what to run, what to buy, and how to wire it into your actual workflow, that is the conversation we have on a free 30-minute assessment. We build local AI, private RAG, and custom workflows for businesses across NY and NJ, and we will tell you honestly whether local is the right call for your data or whether the cloud earns its keep.
Written by
Founder of Sage Solutions. 20+ years in NY/NJ IT and low-voltage, Certified Ethical Hacker (CEH), and ex-FDNY. More about Jason →
Keep reading
- How-to
Local LLMs vs. Frontier AI: Match the Workload to the Model
When a local LLM on a Mac beats sending everything to a frontier model like Opus 4.8. The engineering, the real cost math, and how we ran 1.2 million documents this way. Private RAG, custom AI workflows, and matching each job to the right model. From a NY/NJ MSP.
Read more about Local LLMs vs. Frontier AI: Match the Workload to the Model - How-to
AI for Small Business: Where to Start (and Where to Skip)
Practical, opinionated guide to where AI actually pays back for a 10–250 person business in 2026 — and where it does not. From an MSP that deploys this stuff in production.
Read more about AI for Small Business: Where to Start (and Where to Skip) - How-to
n8n vs Make.com vs Zapier — Choosing Your Automation Stack in 2026
An honest, opinionated comparison of n8n (self-hosted), Make.com, and Zapier for small and mid-sized businesses. Pricing, capabilities, when to pick each.
Read more about n8n vs Make.com vs Zapier — Choosing Your Automation Stack in 2026
Want to talk about this?
We are happy to have a 30-minute call about anything in this article — your environment, your risks, your options.