Mixture of Experts (MoE)
A mixture of experts is a model split into many specialized sub-networks, where a router sends each input to just a few of them. You get the capacity of a huge model while only running a fraction of it per request.
Also known as: MoE, sparse mixture of experts
Model ArchitectureAI Infrastructure
In a dense model, every parameter runs on every input. A mixture of experts breaks the model into many “expert” sub-networks and adds a router that, for each input, picks just a few experts to actually run. The model can hold an enormous number of parameters — and the knowledge that comes with them — while only activating a small slice per token.
The payoff is efficiency: you get much of the quality of a giant model at the compute cost of a far smaller one, because most of the network sits idle on any given request. The cost is complexity — routing has to be trained well, and serving an MoE has its own memory and load-balancing quirks. It’s one of the main architectural reasons recent large models can be both more capable and cheaper to run than a dense model of equivalent size would be.