Skip to content

qwen_3_5_moe

qwen_3_5_moe

Text-only Qwen 3.5 MoE (35B-A3B).

Subclasses :class:Qwen3_5 — same hybrid attention layout, but each layer's MLP is a routed top-k MoE with a sigmoid-gated shared expert.

HF stores experts as fused 3D tensors gate_up_proj / down_proj. The loader splits gate_up_proj along the intermediate axis into our separate gate / up vmapped MLPs.