Mixture-of-Experts - first diggin'

At Bosch, I have a chance to discover about Machine Learning, specifically LLM and MoE. In this post I will share the content of my first presentation about Mixture-of-Experts (MoE). I take the structure and content mainly from a survey in 2025 (Mu & Lin, 2025) and some information from a survey in 2024 (Cai et al., 2024).

Disclaimer: I am very noob in this field, I am not sure what I wrote in this post is true. But if I find out any problems, I will update.

Why MoE?

AI applications are developing fast, we can say some popular names like ChatGPT, Gemini, DeepSeek,… But developing it also faces some problems, 2 major problems are:

  • Computational cost of training and deploying
  • Integrating conflicting or heterogeneous knowledge within a single model

So here we are, MoE is proposed to tackle these 2 problems. You can imagine MoE as a “divide-and-conquer” approach.

MoE components

In MoE structure, we have two main parts: Experts and Router.

Router

Router works as a distributor to route data to suitable expert.

We have Gating Function is the mathematical implementation of the Router. A good Gating Function meets 2 criteria:

  • Accurately discern characteristics of both input data and experts
  • Distribute evenly as possible among the predefined experts

We can categorise Gating Function into 3 types:

  • Linear Gating: Using softmax function
  • Non-linear Gating: Using cosine similarity in assigning experts
  • Soft MoE: Combining tokens to avoid dropping tokens issues

Experts

Experts are small LLM models that specialise in solving a defined dataset. The Experts Network based on Transformer (Vaswani et al., 2017) structure.

There are 3 popular experts network method:

  • Replace FFN layer in Transformer with an MoE layer
    • Suitable to incorporate sparse activation mechanisms
    • Ideal choice for introducing the MoE mechanism
  • Apply MoE to the attention module in Transformer
    • MoA (Wang et al., 2024) Mixture-of-Attention – gating network to dynamically select the most relevant attention
    • MoH (Jin et al., 2024) Mixture-of-Head attention – has great potential
  • Apply MoE to CNN layer
    • Fully leverage CNN’s strengths in local feature extraction
    • Apply mainly in Computer Vision field

MoE Paths

Routing Strategy

Routing Strategy based on:

  • Token-Level
  • Modality-Level
  • Task-Level
  • Context-Level
  • Attribute-Level

Training Strategy

Training Strategy has 3 steps:

  • Auxiliary Loss Function Design: balance usage and distribute load
  • Expert Selection: choose expert for data input. Some popular methods like TopK, Top1, TopP,…
  • Pipeline Design: optimize resource allocation and distribute data among experts

My current work

I am trying to use Megatron-SWIFT Framework to train Qwen2.5-7B-Instruct (Team, 2024). It is really strugging even from first step is setup environment, when have some results, I will write post sharing about that. Hopefully I can write proper tutorial the next time we meet.

References

  1. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications
    Siyuan Mu and Sen Lin
    arXiv preprint arXiv:2503.07137, 2025
  2. A survey on mixture of experts
    Weilin Cai, Juyong Jiang, Fan Wang, and 3 more authors
    arXiv preprint arXiv:2407.06204, 2024
  3. Attention is all you need
    Ashish Vaswani, Noam Shazeer, Niki Parmar, and 5 more authors
    Advances in neural information processing systems, 2017
  4. Moa: Mixture-of-attention for subject-context disentanglement in personalized image generation
    Kuan-Chieh Wang, Daniil Ostashev, Yuwei Fang, and 2 more authors
    In SIGGRAPH Asia 2024 Conference Papers, 2024
  5. Moh: Multi-head attention as mixture-of-head attention
    Peng Jin, Bo Zhu, Li Yuan, and 1 more author
    arXiv preprint arXiv:2410.11842, 2024
  6. Qwen2.5: A Party of Foundation Models
    Qwen Team
    Sep 2024



    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • Graphs-of-Heads - The First Literature Review
  • My Life Setup - Summer 2025
  • Tutorial - Megatron-SWIFT and Qwen2.5 Installation
  • I am done my Thesis, so what's next?
  • I use Arch, btw