Skip to content

dataset

dataset

PretrainingDataset

Bases: StringDataset

Pretraining dataset, identical to Dataset[str] but is tokenized differently. Specifically, this dataset is tokenized irrespective of item boundaries.

StreamingPretrainingDataset

Bases: StreamingStringDataset

Pretraining dataset, identical to Dataset[str] but is tokenized differently. Specifically, this dataset is tokenized irrespective of item boundaries.