How to Run vLLM on Apple M4 Mac Mini

vLLM is a powerful inference engine for LLMs. It’s designed to serve models faster and more efficiently.

Jul 30, 2025

That’s exactly what I asked last week during the vLLM meetup at Red Hat.

Well, vLLM is designed with scalability in mind, for large-scale LLM inference and serving. It’s not really built for local use.

But my question was:

If it can’t even run on my machine, how will it work in production?

I got curious and started experimenting on my M4 Mac Mini, which has a 10-core GPU and 21 GB of unified memory.

Unfortunately, vLLM doesn’t support Apple Silicon GPUs at the moment.

But fortunately, there’s experimental support for Apple Silicon CPUs.

Right now, you need to build vLLM from source to run it natively on macOS.

So I did exactly that, and here’s what I found, even when running vLLM on just the CPU:

vLLM is a super fast and memory-efficient tool to serve large language models like LLaMA, OPT, etc. Good news, vLLM now has experimental support for macOS with Apple silicon (like the M4 chip)!

In this guide, I’ll walk you through how to install and run vLLM natively on your M4 Mac Mini.

No GPU required, just your CPU is enough!

Let’s get started.

How I Ran RamaLama on My Raspberry Pi

Shamsher Ansari

Jul 25

Read full story

What is vLLM?

vLLM is a powerful inference engine for LLMs (Large Language Models). It’s designed to serve models faster and more efficiently.

Benefits of vLLM:

High throughput with continuous batching
Efficient memory usage for long conversations (via PagedAttention)
Compatible with OpenAI APIs, just plug it into your existing tools
Supports distributed inference for large-scale deployments (Linux only)

Prerequisites

On macOS, vLLM will automatically run on CPU (Apple silicon support is still experimental).
vLLM currently supports FP32 and FP16 data types on CPU.
Make sure you're using Python between 3.9 and 3.12.
Your Mac might have Python 3.13 installed, but vLLM won’t work with it.

Step-by-Step Installation

1. Install Anaconda

We’ll use Anaconda to manage Python environments easily.

brew install --cask anaconda

Initialize Conda for your terminal:

/opt/homebrew/anaconda3/bin/conda init zsh

Restart your terminal.

2. Create a Python 3.12 Environment

Your Mac might have Python 3.13 installed, but vLLM won’t work with it because vLLM requires Python 3.12.

sudo conda create -n vllm-env python=3.12
conda activate vllm-env

3. Clone the vLLM Repo

There are no pre-built wheels or images for Mac device, so you must build vLLM from source.

git clone https://github.com/vllm-project/vllm.git
cd vllm

4. Install CPU Requirements

pip install -r requirements/cpu.txt
pip install -e .

Check vLLM version:

vllm --version

You should see something like:

Automatically detected platform: cpu  
0.10.1.dev118+g18cc33dd6

Setting Up a Chat Template

Starting from transformers version 4.44, if the model you're using doesn’t include a built-in chat template, you need to supply one.

Here’s a simple Jinja2 template for OPT models. Save it as opt_chat_template.jinja.

{{- bos_token if bos_token is defined }}
{%- for message in messages %}
{{- '\n' if not loop.first }}
{%- if message['role'] == 'user' -%}
User: {{ message['content'] }}
{%- elif message['role'] == 'assistant' -%}
Assistant: {{ message['content'] }}
{%- endif %}
{%- endfor %}
{{- eos_token if eos_token is defined }}

Serve Your Model

Before launching, set these environment variables:

export VLLM_CPU_KVCACHE_SPACE=10
export VLLM_CPU_OMP_THREADS_BIND=auto

VLLM_CPU_KVCACHE_SPACE: Allocates memory to store the intermediate states (KV cache) during inference. A larger value (like 10 or 10 GB) gives more room for longer context or concurrent requests.
VLLM_CPU_OMP_THREADS_BIND: Controls how threads are assigned to CPU cores. auto: Lets vLLM/OpenMP choose the best way to distribute threads across cores

Now start the server:

vllm serve facebook/opt-125m --dtype=bfloat16 --chat-template ./opt_chat_template.jinja

The server will start and listen on:

http://localhost:8000/v1

You can connect to it using any OpenAI-compatible tool (like OpenWebUI or curl).

Why the Apple M4 Mac Mini Is a Perfect Machine for Running Local LLMs

Shamsher Ansari

Jul 12

Read full story

Performance on M4 Mac Mini

Max Performance Core (P-CPU) usage: ~82%
Prompt Throughput: ~140 tokens/sec
Generation Throughput: ~83 tokens/sec
KV Cache Usage: ~2.9%

Not bad for a fanless Mac Mini running on CPU only!

Final Thoughts

Even though vLLM on macOS is still in the early stages, it’s already fast and reliable enough for local testing, prototyping, and personal use. If you're into LLMs and want something OpenAI-compatible that runs locally without a GPU, this is a great option.

Source: LinkedIn Post

Shamsher's AI PM Brief

How I Ran RamaLama on My Raspberry Pi

Why the Apple M4 Mac Mini Is a Perfect Machine for Running Local LLMs

Discussion about this post