26M Parameter 'Needle' Model Distills Gemini Tool Calling
Original: Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model
Why This Matters
Demonstrates successful knowledge distillation from large models to efficient edge-deployable versions
Cactus Compute released Needle, a 26 million parameter model that distills Google's Gemini tool calling capabilities into a compact form. The model runs at 6000 tokens/second prefill and 1200 tokens/second decode speed on Cactus platform.
The open-source Needle model uses a Simple Attention Network architecture with 512 dimensions, 8 heads with 4 key-value pairs, and 8192 BPE vocabulary. It features 8 decoder layers with ZCRMSNorm, masked self-attention with RoPE, and gated residuals, plus 12 encoder layers. The model can be fine-tuned locally on Mac/PC systems. All weights and dataset generation code are publicly available on the Cactus-Compute GitHub repository under MIT license. The project demonstrates how large language model capabilities like tool calling can be compressed into significantly smaller, more efficient models while maintaining functionality.