26M Parameter 'Needle' Model Distills Gemini Tool Calling

Original: Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Why This Matters

Demonstrates successful knowledge distillation from large models to efficient edge-deployable versions

Cactus Compute released Needle, a 26 million parameter model that distills Google's Gemini tool calling capabilities into a compact form. The model runs at 6000 tokens/second prefill and 1200 tokens/second decode speed on Cactus platform.

The open-source Needle model uses a Simple Attention Network architecture with 512 dimensions, 8 heads with 4 key-value pairs, and 8192 BPE vocabulary. It features 8 decoder layers with ZCRMSNorm, masked self-attention with RoPE, and gated residuals, plus 12 encoder layers. The model can be fine-tuned locally on Mac/PC systems. All weights and dataset generation code are publicly available on the Cactus-Compute GitHub repository under MIT license. The project demonstrates how large language model capabilities like tool calling can be compressed into significantly smaller, more efficient models while maintaining functionality.

Source

github.com — Read original →