Skip to content

dictlearn

dictlearn

Dictionary learning dataset.

A synthetic dataset for studying whether models can learn compositions of random lookup-table functions. Each sample is a space-separated integer sequence designed for use with TrivialTokenizer.

Sequence format::

f1 f2 ... fn <START> v1 <SEP> fn(fn-1(...f1(v1)...))

where: - f1 ... fn are randomly chosen function tokens (1-indexed into a table of N_FUNCTIONS random permutation-style lookup tables). - v1 is a randomly chosen value token. - The final token is the result of composing the functions left-to-right on v1. - <START> and <SEP> are delimiter tokens.

The model must internalize each function's mapping to predict the output.

Constants (hardcoded for reproducibility):

  • N_FUNCTIONS = 32 — number of distinct functions.
  • FIXED_SEED = 7 — seed for deterministic generation.
  • TRAIN_SEQUENCES = 100000 — number of training sequences.
  • VAL_SEQUENCES = 500 — number of validation sequences.

Token layout (for a given n_values)::

Tokens 1..32                    → function tokens
Tokens 33..32+n_values          → value tokens
Token  33+n_values              → START delimiter
Token  34+n_values              → SEP delimiter
Token  0                        → EOT (end-of-text)
VOCAB_SIZE = 35 + n_values

Registered variants (seq_length x n_values):

  • dictlearn_16 — length 16, 64 values (default)
  • dictlearn_16_v{N} — length 16, N values
  • dictlearn_512 — length 512, 64 values (default)
  • dictlearn_512_v{N} — length 512, N values

where N ∈ {32, 64, 128, 256, 512, 1024}.

vocab_size(n_values: int) -> int

Return the vocab size for a given n_values.