dictlearn
dictlearn
¶
Dictionary learning dataset.
A synthetic dataset for studying whether models can learn compositions of
random lookup-table functions. Each sample is a space-separated integer
sequence designed for use with TrivialTokenizer.
Sequence format::
f1 f2 ... fn <START> v1 <SEP> fn(fn-1(...f1(v1)...))
where:
- f1 ... fn are randomly chosen function tokens (1-indexed into a table
of N_FUNCTIONS random permutation-style lookup tables).
- v1 is a randomly chosen value token.
- The final token is the result of composing the functions left-to-right on v1.
- <START> and <SEP> are delimiter tokens.
The model must internalize each function's mapping to predict the output.
Constants (hardcoded for reproducibility):
N_FUNCTIONS = 32— number of distinct functions.FIXED_SEED = 7— seed for deterministic generation.TRAIN_SEQUENCES = 100000— number of training sequences.VAL_SEQUENCES = 500— number of validation sequences.
Token layout (for a given n_values)::
Tokens 1..32 → function tokens
Tokens 33..32+n_values → value tokens
Token 33+n_values → START delimiter
Token 34+n_values → SEP delimiter
Token 0 → EOT (end-of-text)
VOCAB_SIZE = 35 + n_values
Registered variants (seq_length x n_values):
dictlearn_16— length 16, 64 values (default)dictlearn_16_v{N}— length 16, N valuesdictlearn_512— length 512, 64 values (default)dictlearn_512_v{N}— length 512, N values
where N ∈ {32, 64, 128, 256, 512, 1024}.
vocab_size(n_values: int) -> int
¶
Return the vocab size for a given n_values.