tokenize
tokenize
¶
TokenizeDatasetConfigBase(name: str = field('data/dataset'), config: str = field('data/config', default=''), suffix: str = field('data/suffix', default=''), val_pct: float = field('data/val_pct', default=0.05), seed: int = field('data/seed', default=2357))
dataclass
¶
Base config for dataset tokenization
__post_init__() -> None
¶
Validate dataset name
TokenizeDatasetConfig(name: str = field('data/dataset'), config: str = field('data/config', default=''), suffix: str = field('data/suffix', default=''), val_pct: float = field('data/val_pct', default=0.05), seed: int = field('data/seed', default=2357), split: str = field('data/split', default='train'), block_size: int = field('architecture/block_size', default=512), pad_token: int = field('data/pad_token', default=0), num_proc: int = field('system/num_proc', default=8), system_prompt: str = field('data/system_prompt', default=''), assistant_only: bool = field('data/assistant_only', default=False))
dataclass
¶
Bases: TokenizeDatasetConfigBase
Config for tokenizing non-pretraining datasets with fixed block size
TokenizePretrainingDatasetConfig(name: str = field('data/dataset'), config: str = field('data/config', default=''), suffix: str = field('data/suffix', default=''), val_pct: float = field('data/val_pct', default=0.05), seed: int = field('data/seed', default=2357), max_samples: int = field('data/max_samples', default=(-1)))
dataclass
¶
TokenizeContrastiveDatasetConfig(name: str = field('data/dataset'), config: str = field('data/config', default=''), suffix: str = field('data/suffix', default=''), val_pct: float = field('data/val_pct', default=0.05), seed: int = field('data/seed', default=2357), split: str = field('data/split', default='train'), block_size: int = field('architecture/block_size', default=512), pad_token: int = field('data/pad_token', default=0), num_proc: int = field('system/num_proc', default=8), system_prompt: str = field('data/system_prompt', default=''), assistant_only: bool = field('data/assistant_only', default=False))
dataclass
¶
TokenizeBlockwiseDatasetJob(spec: ExecutionSpec)
¶
Bases: BasicJob[TokenizeDatasetConfig]
Prepare non-pretraining datasets with fixed block size. Creates .bin and .bin.mask files for train/val splits.
done: bool
property
¶
Check if dataset preparation is complete with the same config
TokenizeContrastiveDatasetJob(spec: ExecutionSpec)
¶
Bases: BasicJob[TokenizeContrastiveDatasetConfig]
Prepare contrastive datasets with fixed block size. Creates .pos.bin/.pos.bin.mask and .neg.bin/.neg.bin.mask files for train/val.
TokenizeVariableDatasetJob(spec: ExecutionSpec)
¶
Bases: BasicJob[TokenizePretrainingDatasetConfig]
Prepare pretraining datasets with streaming support. Creates train.bin and val.bin files with variable-length sequences.
done: bool
property
¶
Check if dataset preparation is complete with the same config. For streaming datasets, we can only verify completion if max_samples is set.