Graphs-of-Heads - The First Literature Review
At work I am assigned to learn about Mixture-of-Experts (MoE) but my mentor wants another specific, tailor-made approach to our problem.
I name it Graphs-of-Heads (GoH).
I have a vague idea in my mind but I think I need a Literature Review to make my idea becomes realistic as most as possible.
However, I don’t follow the ordinary literature review in academic research, as I am working in industry. I will try to adapt and build the code along the way.
This is a series of posts about this project, and this first post is about 2 first literature review of mine.
Plan
In my first plan, I want to use MoH structure (Jin et al., 2024) as the base development. Then I will apply the experts network based on (Du et al., 2024).
The Python project structure I apply is the navdeep-G/samplemod.
The training framework will be Megatron-LM (Shoeybi et al., 2019) with continual, meta and multi-task learning. Federated Learning will be developed when the application is deployed for many users.

My imagined structure.
Attention is all you need!!
We will start from Transformer
Structure (Vaswani et al., 2017).
I am pretty bad at Python, so I will learn and reference a lot of repositories, both in how they structure the file and the coding methodology.
I reference SCCSMARTCODE/attention-is-all-you-need-from-scratch and jadore801120/attention-is-all-you-need-pytorch to re-made the transformer
structure.
Transformer
is an architecture rely entirely on an attention mechanism to draw global dependencies between I/O. Transformer
allows significantly more parallelization.
I believe my wanted structure is far from this work, but a thousand miles start from a step.
Encoder & Decoder

Transformer
architecture. Source: (Vaswani et al., 2017).
With the illustration, you can see the Transformer
has 2 main modules are Encoder and Decoder. There is also something worth noticing is Positional Encoding.
Encoder and Decoder has a stack of $N=6$ layer. Encoder’s layer has 2 sub-layers:
- Multi-Head Attention
- Feed Forward
While Decoder’s layer has 3 sub-layers:
- Multi-Head Attention
- Feed Forward
- Masked Multi-Head Attention
We employ residual connection around each of sub-layers, followed by layer normalization. The dimension is $d_{model} = 512$.
Attention
To me, this is the heart of this work.
Attention function is mapping a query and a set of key-value pairs to vectors output.
- Output is weighted sum of values
- The weight has another compatibility function to calculate
Scaled Dot-Product Attention
- Input:
- Q: queries
- K: keys of dimension $d_k$
- V: values of dimension $d_v$
- $\frac{1}{\sqrt{d_k}}$: scaling factor. Why? To avoid
softmax
is pushed into regions result extremely small gradients - Dot-product attention faster and more space-efficient in practice than additive attention
Multi-Head Attention
They perform the attention function in parallel, yielding $d_v$-dim output values. They are concatenated and projected, make final values.
\(MultiHead(Q,K,V) = Concat(head_1,.... head_n)W^O\) \(head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)\)
Attention in Transformer
We will remind the input of attention. We will have Q (Queries), K (keys) and V (values). Transformer
uses attention in 3 ways:
- “encoder-decoder attention” layer: Q from previous decoder layer, K, V from output of encoder.
- “encoder self-attention” layer: Q, K, V from previous encoder layer.
- “decoder self-attention” layer: Similar to encoder one, however they masked out all values in the input of
softmax
(set to $-\infty$) in scaled dot-product attention to illegal connections.
Position-wise Feed-Forward Networks
FFN has 2 linear transformations with a ReLU activation
Positional Encoding
This is the method to inject information about the relative or absolute position of the tokens in the sequence. Positional Encoding has the same dimension $d_{model}$ as the embeddedings. In this work, they use sine and cosine functions
\(PE_{(pos, 2i)} = sin(pos/10000^{2i/d_{model}})\) \(PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}})\)
MoH (Mixture-of-Heads)
In my understanding, MoH (Jin et al., 2024) is a mix of Mixture-of-Experts (MoE) and transformer
(Vaswani et al., 2017).
They made 2 important changes, one, there is a TopK router to activate heads for each token. They also replace the standard summation in multi-head attention to weighted sum.
They believe that with changes, they made 2 significant advantages:
- First, allows each token select most relevant attention heads, improve efficiency without sacrificing accuracy or increasing the params.
- Second, with weighted sum, MoH enhances the flexibility of attention mechanism.
Design
The core of the work is MoH, which treats attention heads as experts.
\[MoH(X, X') = \sum^h_{i=1} g_i H^i W^i_O\]- $X, X’$: input tokens
- $g_i$: routing score
- $H^i$: Head ith
- $W^i_O$: output of projection matrix
Inspired by DeepSeek (Dai et al., 2024), MoH designs a subset of heads as shared heads that remain always activated. This will consolidate common knowledge within shared heads.
Two-Stage Routing for dynamically balance the weights between shared and routed heads. Routing scores are determined by both the score of each individual head and score associated with the head type. To avoid the unbalanced load, MoH applies Load Balance Loss.
Training
Training LLMs from scratch, they use Megatron-LM (Shoeybi et al., 2019) with public datasets.
With Continual Learning, they tuned LLaMA3-8B
. 3 challenges when doing this:
- Determine the shared attention heads
- Add head routers
- Weight attention heads
That’s all for the day. The next post I will discuss about GaCLLM and how I imagine the system will work.
References
- Moh: Multi-head attention as mixture-of-head attentionarXiv preprint arXiv:2410.11842, Sep 2024
- Large language model with graph convolution for recommendationarXiv preprint arXiv:2402.08859, Sep 2024
- Megatron-lm: Training multi-billion parameter language models using model parallelismarXiv preprint arXiv:1909.08053, Sep 2019
- Attention is all you needAdvances in neural information processing systems, Sep 2017
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language modelsarXiv preprint arXiv:2401.06066, Sep 2024
Enjoy Reading This Article?
Here are some more articles you might like to read next: