Study finds transformers can reduce memory usage with shared QKV projections
Original: Do transformers need three projections? Systematic study of QKV variants
Why This Matters
Demonstrates practical memory optimization for deploying large language models on resource-constrained devices
Research examines whether transformer models need separate query, key, and value projections. Study tests three sharing variants across vision and language tasks, finding Q-K=V sharing achieves 50% KV cache reduction with only 3.1% performance loss.
Researchers systematically evaluated projection sharing in transformer attention mechanisms, testing three variants: Q-K=V (shared key-value), Q=K-V (shared query-key), and Q=K=V (single projection). Experiments spanned synthetic tasks, computer vision datasets, and language modeling with 300M and 1.2B parameter models on 10B tokens. The Q-K=V variant achieved 50% KV cache reduction with minimal 3.1% perplexity degradation in language modeling. When combined with grouped query attention (GQA-4), cache reduction reached 87.5%, and with multi-query attention (MQA) achieved 96.9% reduction. The study found that keys and values can occupy similar representational spaces, enabling effective projection sharing while preserving model quality for practical edge device inference.