> Unlike Transformer models, Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length
> We have tested Codestral Mamba on in-context retrieval capabilities up to 256k tokens
Why only 256k tokens? Gemini's context window is 1 million or more and it's (probably) not even using Mamba.
Gemini is probably using ring attention. But scaling to that size requires more engineering effort in terms of interlink that goes beyond the purpose of this release from Mistral.
> We have tested Codestral Mamba on in-context retrieval capabilities up to 256k tokens
Why only 256k tokens? Gemini's context window is 1 million or more and it's (probably) not even using Mamba.