TurboQuant on Mac: More context with 5x LESS MEMORY (Tutorial)

Опубликовано: 16 Май 2026
на канале: Globalcobots AI
1,385
21

TurboQuant compresses the KV cache, not the model weights. In long contexts, this can save a significant amount of memory without impacting performance.

In this video, I'll show you how to install and use it on a Mac, with real-world tests on Apple Silicon. I tested it on a 16GB Mac Mini M4 and a 48GB M3 Max, seeing up to 5x compression of the KV cache.

If you're interested in local AI on Mac, llama.cpp, and how to truly free up context memory, here's how to install, use, and see real-world results.

Resources
Original paper: arxiv.org/abs/2504.19874
Fork llama.cpp with TurboQuant: github.com/TheTom/llama-cpp-turboquant
Qwen3-14B Q4_K_M: huggingface.co/Qwen/Qwen3-14B-GGUF
Qwen3.5 35B A3B UD-Q6_K_XL: huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF

Mac Installation (Apple Silicon)
git clone https://github.com/TheTom/llama-cpp-t...
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)

Launch with TurboQuant
./build/bin/llama-server \
-m ./models/your-model.gguf \
-ctk q8_0 -ctv turbo3 \
-c 131072 -fa on -ngl 99 \
--port 8080

Linux (CUDA) Installation
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Customized AI Training
https://www.globalcobots.com

Chapters:
00:00 Introduction and Impact of TurboQuant on the Market
00:23 Key Concepts: What is KV Cache (Key and Value)
01:12 Hardware Configuration: Mac M3 Max and 48GB RAM
01:39 Installation Tutorial: CMAKE and Dependencies
02:08 Repository Cloning and Metal Backend Compilation
03:00 Selecting and Downloading Models in Hugging Face
04:08 Running the Model with Turbo Parameters (-ctk and -ctv)
05:16 Benchmark 1: Performance with TurboQuant (32k context)
06:43 Benchmark 2: Comparison without TurboQuant (RAM Consumption)
07:59 Extreme Test: 128k Context Window (Massive Savings)
10:12 Final Conclusions and Real-World Usefulness of the Technique

#TurboQuant #Mac #AppleSilicon #llamacpp #LLM #IALocal #KVCache