pile_injected
pile_injected
¶
Pile with deterministic injected sequences for memorization evaluation.
Per Huang et al. (2024), injects verifiably-unknown token sequences into the Pile at fixed positions so that retrieval can be measured after training.
PileInjected(config: str | None = None, split: str = 'train')
¶
Bases: StreamingPretrainingDataset
The Pile with 100 deterministic injected sequences.
Streams text from the Pile, inserting gibberish sequences at
predetermined document indices. The injected texts are available
as INJECTED_TEXTS for evaluation.