Skip to content

pile_detoxify

pile_detoxify

PileDetoxify(config: str | None = None, split: str = 'train')

Bases: StreamingPretrainingDataset

Filtered Pile with toxicity scores (Korbak et al.).

Streams text from tomekkorbak/pile-detoxify, which annotates Pile documents with per-sentence toxicity scores from Detoxify. Each yielded string is the full document text (sentences joined).