pg19
pg19
¶
PG19(split: str = 'train', config: str | None = None)
¶
Bases: StreamingPretrainingDataset
Project Gutenberg books (sedthh/gutenberg_english).
48k+ English books from Project Gutenberg with metadata removed, suitable for long-context pretraining.