Mixture-of-Experts (MoE) has become a popular technique for scaling large language models (LLMs) without exploding computational costs. Instead of using the entire model capacity for every input, MoE ...