Chengshuo Dai
Back to Blog

Scaling Efficiently: Understanding Mixture of Experts (MoE)

Model ArchitectureScaling Laws

As models continue to grow in size, the computational cost of training and inference has skyrocketed. For a long time, I wondered how organizations could afford to run trillion-parameter models in production. The secret sauce, it turns out, is often a Mixture of Experts (MoE) architecture.

MoE is a fascinating paradigm shift. Instead of activating every single parameter for every single token (a dense model), an MoE model only activates a small subset of its parameters. This means you can have a massive model with a huge capacity for knowledge, but the computational cost per token remains relatively low.

How MoE Works

The core components of an MoE layer are:

  1. The Experts: These are typically standard Feed-Forward Networks (FFNs). A single MoE layer might have 8, 16, or even 64 of these experts.
  2. The Router (Gating Network): This is the brain of the operation. For every incoming token, the router decides which expert(s) should process it. Typically, it selects the top-K experts (e.g., top-2 out of 8).

When a token passes through the network, it is routed only to the selected experts. The outputs from these experts are then combined (usually via a weighted sum based on the router's confidence scores).

Personal Reflection

What I find most intriguing about MoE is the concept of specialization. In theory, different experts should learn to handle different types of tokens—perhaps one becomes an expert in punctuation, another in coding syntax, and another in French grammar. However, in practice, analyzing expert routing often reveals less interpretable specialization than we might hope for.

Working with MoE models also introduced me to the headaches of load balancing. If the router sends all tokens to just one expert, that expert becomes a massive bottleneck, and the rest of the network sits idle. Balancing the load across experts while maintaining high performance is a delicate dance that requires careful loss function design (like adding an auxiliary load-balancing loss). It's a perfect example of how solving one problem (compute efficiency) often introduces new, complex engineering challenges.


Reference: