GOOGLE MADE GEMMA 4 THREE TIMES FASTER — THEN HID THE BEST PART
On May 5, 2026 Google released Multi-Token Prediction drafters for Gemma 4 — open weights, Apache 2.0, up to 3x speedup. But developers quickly noticed the native MTP heads had been quietly stripped from the public model. Here is what that means for the next phase of the AI inference economy.
In this episode:
How multi-token prediction and speculative decoding actually work
Why Google trained Gemma 4 with MTP heads, then removed them before release
The benchmarks that hold up — and the ones that quietly don't (looking at you, 26B MoE at batch 1)
The $255B inference market that's eating frontier-model hype alive
What Gartner, Deloitte, and IDC are saying about the cost collapse
The bear case nobody on launch day mentioned
TIMESTAMPS:
0:00 — Intro
0:25 — The Setup: What Google Actually Released
1:30 — Speculative Decoding Explained
2:50 — The Real Benchmark Numbers
3:40 — But Here's The Part Nobody Tells You
5:10 — The Open Source Two-Step
6:00 — Three Numbers That Reshape Everything
7:20 — Edge Wins, Cloud Margin Compresses
8:30 — The Bear Case
9:15 — What To Watch
SOURCES:
Google Blog announcement: https://blog.google/innovation-and-ai...
Google AI for Developers MTP docs: https://ai.google.dev/gemma/docs/mtp/...
The Decoder coverage: https://the-decoder.com/google-speeds...
Gartner inference cost forecast (Mar 2026): https://www.gartner.com/en/newsroom/p...
Deloitte 2026 compute predictions: https://www.deloitte.com/us/en/insigh...
Apple Mirror Speculative Decoding: https://machinelearning.apple.com/res...
FlowHunt on missing MTP heads: https://www.flowhunt.io/blog/gemma-4-...
---
The Grift Podcast — Forbidden Knowledge Unlocked
New episodes every week.
SUBSCRIBE for more: https://www.youtube.com/@DigitalDream...
#Gemma4 #GoogleAI #LLM #SpeculativeDecoding #AIInference #MachineLearning #OpenSource #EdgeAI #TheGriftPodcast