lact
lact
¶
LaCT (Large-Chunk Test-Time Training) block.
Three residual sublayers in sequence:
- Sliding-window attention (
GroupedSelfAttentionwithsliding=True) — the local mixer that re-establishes intra-chunk order/causality the TTT layer deliberately ignores. - The TTT layer — Q/K/V projections + per-token eta predictor + a SwiGLU
"fast-weight" network
f_Wwhose weightsWare updated once per chunk ofbtokens via gradient descent onL(W) = -Σ eta · <f_W(k), v>, then queried asf_W(q). The initial fast weightsW*_0are slow params (learned by outer SGD); at inference, a mutable"fast_weights"collection persists the post-updateWacross forward calls. - A standard feed-forward MLP.
The TTT layer treats the chunk as an unordered set — that is the paper's whole
point. Causality across chunks comes from the apply_then_update execution
order (query a chunk before writing its keys/values into W), no masking
machinery required.