Explains DeepSeek-OCR, a model where visual data is computationally cheaper than raw text.
Key concepts covered:
10x compression factor with 97% accuracy
Unified VLM architecture: Deep Encoder + DeepSeek 3B MoE Decoder
Staged Encoder: SAM for local details, CLIP for global layout
Memory solution: 16x downsampling before global attention
MoE Decoder: Large model power with small model efficiency
Gundam Mode: Dynamic tiling for ultra-high-resolution images