shared
shared
¶
Top-k routed MoE with a sigmoid-gated shared expert (Qwen 3.5 MoE).
Differences vs. :class:theseus.model.moe.base.MoE:
* Experts are gated SiLU MLPs (QwenMLP) sized by
architecture/moe_intermediate_size instead of the layer-wide
architecture/intermediate_size.
* Routing is HF-style: softmax over all experts → top-k → renormalize
(matches Qwen3_5MoeTopKRouter).
* A shared expert path is added: sigmoid(gate(x)) * shared_mlp(x).
With the default capacity_factor of 1.0 and capacity clamped to
num_tokens, no tokens are dropped — required for parity with HF.