ccaligned
ccaligned
¶
CCAligned(config: str | None = 'fr_XX', split: str = 'train')
¶
Bases: StreamingPretrainingDataset
Multilingual text from CCAligned.
Streams target-language sentences directly from statmt.org,
decompressing on-the-fly without downloading the whole file first.
The config parameter selects the target language code
(e.g. "fr_XX", "de_DE", "zh_CN"). Each yielded string
is a single target-language sentence, suitable for monolingual
pretraining in that language.
Some language pairs are stored as en_XX-{lang}.tsv.xz and others
as {lang}-en_XX.tsv.xz; both orderings are tried automatically.