dataset
dataset
¶
PretrainingDataset
¶
Bases: StringDataset
Pretraining dataset, identical to Dataset[str] but is tokenized differently. Specifically, this dataset is tokenized irrespective of item boundaries.
StreamingPretrainingDataset
¶
Bases: StreamingStringDataset
Pretraining dataset, identical to Dataset[str] but is tokenized differently. Specifically, this dataset is tokenized irrespective of item boundaries.