hybrid
hybrid
¶
Hybrid Transformer + Mamba model.
Interleaves standard transformer (attention) blocks with Mamba-2 SSM blocks, following the Jamba architecture pattern (Lieber et al., 2024).
Inherits GPT's loss, unembed, and __call__. Only
setup, components, sharding, and _parse_mamba_layers
are new; embed and decode are minor overrides.