Skip to content

ccaligned

ccaligned

CCAligned(config: str | None = 'fr_XX', split: str = 'train')

Bases: StreamingPretrainingDataset

Multilingual text from CCAligned.

Streams target-language sentences directly from statmt.org, decompressing on-the-fly without downloading the whole file first. The config parameter selects the target language code (e.g. "fr_XX", "de_DE", "zh_CN"). Each yielded string is a single target-language sentence, suitable for monolingual pretraining in that language.

Some language pairs are stored as en_XX-{lang}.tsv.xz and others as {lang}-en_XX.tsv.xz; both orderings are tried automatically.