Mixture-of-Experts - first diggin'
At Bosch, I have a chance to discover about Machine Learning, specifically LLM and MoE. In this post I will share the content of my first presentation about Mixture-of-Experts (MoE). I take the structure and content mainly from a survey in 2025 (Mu & Lin, 2025) and some information from a survey in 2024 (Cai et al., 2024).
Disclaimer: I am very noob in this field, I am not sure what I wrote in this post is true. But if I find out any problems, I will update.
Why MoE?
AI applications are developing fast, we can say some popular names like ChatGPT, Gemini, DeepSeek,… But developing it also faces some problems, 2 major problems are:
- Computational cost of training and deploying
- Integrating conflicting or heterogeneous knowledge within a single model
So here we are, MoE is proposed to tackle these 2 problems. You can imagine MoE as a “divide-and-conquer” approach.
MoE components
In MoE structure, we have two main parts: Experts and Router.
Router
Router works as a distributor to route data to suitable expert.
We have Gating Function is the mathematical implementation of the Router. A good Gating Function meets 2 criteria:
- Accurately discern characteristics of both input data and experts
- Distribute evenly as possible among the predefined experts
We can categorise Gating Function into 3 types:
- Linear Gating: Using
softmax
function - Non-linear Gating: Using cosine similarity in assigning experts
- Soft MoE: Combining tokens to avoid dropping tokens issues
Experts
Experts are small LLM models that specialise in solving a defined dataset. The Experts Network based on Transformer (Vaswani et al., 2017) structure.
There are 3 popular experts network method:
- Replace FFN layer in Transformer with an MoE layer
- Suitable to incorporate sparse activation mechanisms
- Ideal choice for introducing the MoE mechanism
- Apply MoE to the attention module in Transformer
- MoA (Wang et al., 2024) Mixture-of-Attention – gating network to dynamically select the most relevant attention
- MoH (Jin et al., 2024) Mixture-of-Head attention – has great potential
- Apply MoE to CNN layer
- Fully leverage CNN’s strengths in local feature extraction
- Apply mainly in Computer Vision field
MoE Paths
Routing Strategy
Routing Strategy based on:
- Token-Level
- Modality-Level
- Task-Level
- Context-Level
- Attribute-Level
Training Strategy
Training Strategy has 3 steps:
- Auxiliary Loss Function Design: balance usage and distribute load
- Expert Selection: choose expert for data input. Some popular methods like TopK, Top1, TopP,…
- Pipeline Design: optimize resource allocation and distribute data among experts
My current work
I am trying to use Megatron-SWIFT Framework to train Qwen2.5-7B-Instruct (Team, 2024). It is really strugging even from first step is setup environment, when have some results, I will write post sharing about that. Hopefully I can write proper tutorial the next time we meet.
References
- A comprehensive survey of mixture-of-experts: Algorithms, theory, and applicationsarXiv preprint arXiv:2503.07137, 2025
- A survey on mixture of expertsarXiv preprint arXiv:2407.06204, 2024
- Attention is all you needAdvances in neural information processing systems, 2017
- Moa: Mixture-of-attention for subject-context disentanglement in personalized image generationIn SIGGRAPH Asia 2024 Conference Papers, 2024
- Moh: Multi-head attention as mixture-of-head attentionarXiv preprint arXiv:2410.11842, 2024
- Qwen2.5: A Party of Foundation ModelsSep 2024
Enjoy Reading This Article?
Here are some more articles you might like to read next: