Skip to content

pg19

pg19

PG19(split: str = 'train', config: str | None = None)

Bases: StreamingPretrainingDataset

Project Gutenberg books (sedthh/gutenberg_english).

48k+ English books from Project Gutenberg with metadata removed, suitable for long-context pretraining.