Perplexity evaluations for existing datasets with validation splits.
These complement the existing RolloutEvaluation counterparts by measuring
how well the model predicts validation-set tokens (1/perplexity, higher is
better), which is especially useful for tracking forgetting in continual
learning experiments.
MNLIPerplexityEval(num_samples: int = 500)
Bases: PerplexityEvaluation
Perplexity on MNLI validation_matched split.
QQPPerplexityEval(num_samples: int = 500)
Bases: PerplexityEvaluation
Perplexity on QQP validation split.
SST2PerplexityEval(num_samples: int = 500)
Bases: PerplexityEvaluation
Perplexity on SST-2 validation split.
SIQAPerplexityEval(num_samples: int = 500)
Bases: PerplexityEvaluation
Perplexity on Social IQa validation split.
WinograndePerplexityEval(num_samples: int = 500)
Bases: PerplexityEvaluation
Perplexity on Winogrande validation split.
FineWebPerplexityEval(num_samples: int = 500)
Bases: PerplexityEvaluation
Perplexity on a sample from FineWeb.