Skip to content

pile

pile

Pile(config: str | None = None, split: str = 'train')

Bases: StreamingPretrainingDataset

The Pile (EleutherAI) for general pretraining.

Streams text from EleutherAI/pile, an 825 GiB diverse open-source language modelling dataset. Uses parquet auto-convert to bypass deprecated custom loading scripts.