Skip to content

tokenize

tokenize

TokenizeDatasetConfigBase(name: str = field('data/dataset'), config: str = field('data/config', default=''), suffix: str = field('data/suffix', default=''), val_pct: float = field('data/val_pct', default=0.05), seed: int = field('data/seed', default=2357)) dataclass

Base config for dataset tokenization

__post_init__() -> None

Validate dataset name

TokenizeDatasetConfig(name: str = field('data/dataset'), config: str = field('data/config', default=''), suffix: str = field('data/suffix', default=''), val_pct: float = field('data/val_pct', default=0.05), seed: int = field('data/seed', default=2357), split: str = field('data/split', default='train'), block_size: int = field('architecture/block_size', default=512), pad_token: int = field('data/pad_token', default=0), num_proc: int = field('system/num_proc', default=8), system_prompt: str = field('data/system_prompt', default=''), assistant_only: bool = field('data/assistant_only', default=False)) dataclass

Bases: TokenizeDatasetConfigBase

Config for tokenizing non-pretraining datasets with fixed block size

TokenizePretrainingDatasetConfig(name: str = field('data/dataset'), config: str = field('data/config', default=''), suffix: str = field('data/suffix', default=''), val_pct: float = field('data/val_pct', default=0.05), seed: int = field('data/seed', default=2357), max_samples: int = field('data/max_samples', default=(-1))) dataclass

Bases: TokenizeDatasetConfigBase

Config for tokenizing pretraining datasets with streaming

TokenizeContrastiveDatasetConfig(name: str = field('data/dataset'), config: str = field('data/config', default=''), suffix: str = field('data/suffix', default=''), val_pct: float = field('data/val_pct', default=0.05), seed: int = field('data/seed', default=2357), split: str = field('data/split', default='train'), block_size: int = field('architecture/block_size', default=512), pad_token: int = field('data/pad_token', default=0), num_proc: int = field('system/num_proc', default=8), system_prompt: str = field('data/system_prompt', default=''), assistant_only: bool = field('data/assistant_only', default=False)) dataclass

Bases: TokenizeDatasetConfigBase

Config for contrastive dataset tokenization with fixed block size.

TokenizeBlockwiseDatasetJob(spec: ExecutionSpec)

Bases: BasicJob[TokenizeDatasetConfig]

Prepare non-pretraining datasets with fixed block size. Creates .bin and .bin.mask files for train/val splits.

done: bool property

Check if dataset preparation is complete with the same config

TokenizeContrastiveDatasetJob(spec: ExecutionSpec)

Bases: BasicJob[TokenizeContrastiveDatasetConfig]

Prepare contrastive datasets with fixed block size. Creates .pos.bin/.pos.bin.mask and .neg.bin/.neg.bin.mask files for train/val.

TokenizeVariableDatasetJob(spec: ExecutionSpec)

Bases: BasicJob[TokenizePretrainingDatasetConfig]

Prepare pretraining datasets with streaming support. Creates train.bin and val.bin files with variable-length sequences.

done: bool property

Check if dataset preparation is complete with the same config. For streaming datasets, we can only verify completion if max_samples is set.