Skip to content

lact

lact

LaCT (Large-Chunk Test-Time Training) block.

Three residual sublayers in sequence:

  1. Sliding-window attention (GroupedSelfAttention with sliding=True) — the local mixer that re-establishes intra-chunk order/causality the TTT layer deliberately ignores.
  2. The TTT layer — Q/K/V projections + per-token eta predictor + a SwiGLU "fast-weight" network f_W whose weights W are updated once per chunk of b tokens via gradient descent on L(W) = -Σ eta · <f_W(k), v>, then queried as f_W(q). The initial fast weights W*_0 are slow params (learned by outer SGD); at inference, a mutable "fast_weights" collection persists the post-update W across forward calls.
  3. A standard feed-forward MLP.

The TTT layer treats the chunk as an unordered set — that is the paper's whole point. Causality across chunks comes from the apply_then_update execution order (query a chunk before writing its keys/values into W), no masking machinery required.

LaCTBlock

Bases: Module