Skip to content

shared

shared

Top-k routed MoE with a sigmoid-gated shared expert (Qwen 3.5 MoE).

Differences vs. :class:theseus.model.moe.base.MoE: * Experts are gated SiLU MLPs (QwenMLP) sized by architecture/moe_intermediate_size instead of the layer-wide architecture/intermediate_size. * Routing is HF-style: softmax over all experts → top-k → renormalize (matches Qwen3_5MoeTopKRouter). * A shared expert path is added: sigmoid(gate(x)) * shared_mlp(x).

With the default capacity_factor of 1.0 and capacity clamped to num_tokens, no tokens are dropped — required for parity with HF.