pile
pile
¶
Pile(config: str | None = None, split: str = 'train')
¶
Bases: StreamingPretrainingDataset
The Pile (EleutherAI) for general pretraining.
Streams text from EleutherAI/pile, an 825 GiB diverse
open-source language modelling dataset. Uses parquet auto-convert
to bypass deprecated custom loading scripts.