Skip to content

layers

layers

RotaryPosEncoding(d_model: int, base: int = 10000, seq_dim: int = 1, partial_rotary_factor: float = 1.0)

Unified RoPE supporting standard, Qwen, and NeoX variants.

Differences between variants are controlled by constructor args: - base: 10000 (standard/NeoX) or 1e6 (Qwen) - partial_rotary_factor: 1.0 (standard/Qwen) or <1.0 (NeoX)

QwenRotaryPosEncoding(d_model: int, base: int = 1000000, seq_dim: int = 1)

Bases: RotaryPosEncoding

Qwen RoPE (just RotaryPosEncoding with base=1e6).

NeoXRotaryPosEncoding(d_model: int, base: int = 10000, seq_dim: int = 1, partial_rotary_factor: float = 1.0)

Bases: RotaryPosEncoding

GPT-NeoX style RoPE with partial rotary dimensions.

MRotaryPosEncoding(d_model: int, base: int = 10000, seq_dim: int = 1, partial_rotary_factor: float = 1.0, mrope_section: Optional[Sequence[int]] = None)

Bases: RotaryPosEncoding

RoPE with optional 3-channel (T, H, W) position ids.

Pass positions as (B, T) for text-only; as (3, B, T) for multimodal — the channels are interleaved per mrope_section.

RMSNorm

Bases: Module

Root-mean-square layer norm.

centered=True switches to the Qwen 3.5 / OLMo-2 convention: multiplier is 1 + weight with weight init at 0 (numerically still centered at 1, but lets HF checkpoints round-trip).

RMSNormGated

Bases: Module

RMSNorm with a multiplicative silu(gate) after the weight.

Used inside the gated-delta-net token mixer (Qwen 3.5 linear-attention layers). Weight is initialized to ones — matches HF Qwen3_5RMSNormGated.