Skip to content

pile_injected

pile_injected

Pile with deterministic injected sequences for memorization evaluation.

Per Huang et al. (2024), injects verifiably-unknown token sequences into the Pile at fixed positions so that retrieval can be measured after training.

PileInjected(config: str | None = None, split: str = 'train')

Bases: StreamingPretrainingDataset

The Pile with 100 deterministic injected sequences.

Streams text from the Pile, inserting gibberish sequences at predetermined document indices. The injected texts are available as INJECTED_TEXTS for evaluation.