Orthrus-Qwen3 achieves 7.8x faster inference with lossless output

Original: Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

Why This Matters

Breakthrough in LLM inference optimization enabling faster generation without accuracy loss

GitHub project Orthrus demonstrates up to 7.8x speedup for Qwen3 language models while maintaining identical output distribution. The dual-view diffusion decoding approach combines autoregressive accuracy with parallel generation speed across 1.7B, 4B, and 8B parameter models.

Orthrus is a dual-architecture framework that unifies exact generation fidelity of autoregressive Large Language Models with high-speed parallel token generation of diffusion models. The implementation provides three model variants: Orthrus-Qwen3-1.7B achieving 4.25x speedup, Orthrus-Qwen3-4B with 5.20x speedup, and Orthrus-Qwen3-8B delivering 5.36x speedup. All models guarantee strictly lossless generation using Qwen3 backbone architectures. The project is available on HuggingFace with installation requiring flash-attention and specific dependencies. The approach enables memory-efficient parallel token generation while preserving the exact output distribution of the original models.

Source

github.com — Read original →