Moebius: Compact 0.2B inpainting model matches 10B-level performance
Original: Moebius: 0.2B image inpainting model with 10B-level performance
Why This Matters
Demonstrates efficient task-specific model design can match large foundational models, enabling high-quality image inpainting on consumer and edge devices.
Researchers at Huazhong University of Science and Technology developed Moebius, a 0.2B parameter image inpainting model that matches the quality of 10B-level models like FLUX.1-Fill-Dev while using less than 2% of parameters and delivering 15× faster inference speeds.
Moebius is a lightweight image inpainting framework designed to address the computational inefficiency of large foundation models. The model uses 0.22 billion parameters compared to FLUX.1-Fill-Dev's 11.9 billion, achieving extreme parametric efficiency while maintaining generation quality. The key innovation is the Local-λ Mix Interaction (LλMI) block, which restructures the diffusion U-Net backbone by condensing spatial context and global semantic information into fixed-size linear matrices. This preserves complex latent interactions while drastically reducing model size. The framework incorporates an adaptive multi-granularity distillation strategy operating in latent space to avoid expensive pixel-space decoding, dynamically balancing multiple gradient-based losses for high-fidelity alignment. Moebius achieves inference latency of 26.01 milliseconds per step on a single GPU, delivering over 15× total runtime acceleration. Across six comprehensive benchmarks spanning natural scenes (Places2) and portrait scenes (CelebA-HQ, FFHQ), Moebius performs on par with or surpasses state-of-the-art generalist models including FLUX.1-Fill-Dev and Stable Diffusion 3.5 Large-Inpainting, with particular strength in complex textures and facial plausibility. The research is currently in submission.