Skip to content

evaluation

evaluation

Evaluation

Bases: ABC

Abstract base class for all evaluations.

name: str abstractmethod property

Name of this evaluation.

prefix() -> str

Prefix for metrics from this evaluation.

__len__() -> int abstractmethod

Number of samples in this evaluation.

__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, **kwargs: Any) -> Any abstractmethod

Run the evaluation and return a score (and optionally intermediates).

When return_intermediates=True, also returns a list of (x, padding_mask) numpy arrays — one per sample — available on every host so that an RL trainer can use them as a training batch.

score(*args: Any) -> List[float]

Return one float per evaluation sample. Subclasses override.

find_accumulation_steps(dataset_size: int, max_batch_size: int, dp_replicate: int) -> Tuple[int, int] | Tuple[None, None] staticmethod

Find batch size and accumulation steps that evenly divide dataset.

Parameters:

Name Type Description Default
dataset_size int

Total number of samples

required
max_batch_size int

Maximum per-device batch size

required
dp_replicate int

Data parallel replication factor

required

Returns:

Type Description
Tuple[int, int] | Tuple[None, None]

(batch_size, accumulation_steps) or (None, None) if no valid size found

RolloutEvaluation

Bases: Evaluation

Evaluation using autoregressive generation.

score(ys: list[str], y_hats: list[str]) -> List[float]

Per-sample scores. Default: cast each check() to float.

check(y: str, y_hat: str) -> bool

Check if y_hat matches y.

Parameters:

Name Type Description Default
y str

Ground truth

required
y_hat str

Generated result

required

Returns:

Type Description
bool

Whether y_hat matches y

clean(y_hat: str) -> str abstractmethod

Clean generated result before checking.

Parameters:

Name Type Description Default
y_hat str

Generated result, which can include the prompt

required

Returns:

Type Description
str

Cleaned/normalized result available for comparison

get(indx: int) -> Tuple[str, str] abstractmethod

Get sample at index.

Returns:

Type Description
Tuple[str, str]

(input_string, expected_output_string)

max_new_tokens(inference: InferenceJob[Any, M]) -> int

Maximum tokens to generate. Subclasses MUST override.

Drives the prompt/generation split (prompt_max = block_size - max_new_tokens) so the JIT shapes are constant across refills — defaulting to block_size would leave zero room for prompts.

__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, temperature: float = 0.0, top_p: float = 1.0, chunk_size: int = 200, samples_per_prompt: int = 1, **kwargs: Any) -> Any

Run evaluation.

Parameters:

Name Type Description Default
inference InferenceJob[Any, M]

InferenceJob instance for running inference

required
encoding Any

Tokenizer with encode_batch/decode_batch methods

required
reduce str

how to reduce per-sample scores ("mean" | "sum" | "none")

'mean'
return_intermediates bool

also return per-sample (rollout, mask) numpy arrays on every host (for RL consumers).

False
temperature float

Sampling temperature (0.0 for greedy)

0.0
top_p float

Nucleus sampling threshold

1.0
chunk_size int

Number of batches per JIT chunk (default 200)

200

Returns:

Type Description
Any

Evaluation score, or (score, intermediates) when return_intermediates.

EncodingEvaluation

Bases: Evaluation

Evaluation using next-token prediction accuracy.

score(xs: list[str], y_hats: list[str]) -> List[float]

Per-sample scores. Default: cast each check() to float.

check(x: str, y_hat: str) -> bool

Check if prediction is correct given input.

Parameters:

Name Type Description Default
x str

Input string

required
y_hat str

Model prediction (cleaned, decoded argmax)

required

Returns:

Type Description
bool

Whether prediction is correct

clean(y_hat: str) -> str abstractmethod

Clean model prediction before checking.

Parameters:

Name Type Description Default
y_hat str

Raw decoded model prediction

required

Returns:

Type Description
str

Cleaned/normalized result available for comparison

get(indx: int) -> str abstractmethod

Get input string at index.

__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any

Run evaluation.

PerplexityEvaluation

Bases: Evaluation

Evaluation that computes dataset perplexity and returns ppl (lower is better).

Runs a blockwise forward pass like EncodingEvaluation, computes the mean negative log-likelihood over all non-padding tokens, and returns perplexity.

get(indx: int) -> str abstractmethod

Get input string at index.

score(per_sample_nll: np.ndarray, per_sample_count: np.ndarray) -> List[float]

Per-sample perplexity (= exp(nll / max(count, 1)).

__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any

Run evaluation.

PerplexityComparisonEvaluation

Bases: Evaluation

Evaluation using perplexity comparison for multiple-choice tasks.

get(indx: int) -> Tuple[str, list[str], int] abstractmethod

Get sample at index.

Returns:

Type Description
Tuple[str, list[str], int]

(prefix, list_of_continuations, correct_index)

score(correct_flags: List[float]) -> List[float]

Per-sample correctness (1.0 / 0.0).

__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any

Run evaluation.

Evaluator(spec: ExecutionSpec)

Bases: InferenceJob[EvaluatorConfig, M], Generic[M]

InferenceJob that runs evaluations and saves results.

Created from a trainer or checkpoint, holds a list of evaluations, and runs them when run() is called.

Example

evaluator = Evaluator.from_trainer(trainer, evaluations, encoding) evaluator() # Runs evaluations and saves results

done: bool property

Check if evaluation results already exist.

from_trainer(trainer: BaseTrainer[Any, Any], config: Optional[Any] = None) -> Evaluator[M] classmethod

Create Evaluator from trainer.

Parameters:

Name Type Description Default
trainer BaseTrainer[Any, Any]

BaseTrainer instance to get inference state from

required
config Optional[Any]

Optional config object whose .components field names the evaluations to run. If None, hydrates EvaluatorConfig from the global config. Pass an RLEvaluatorConfig to build a separate evaluator for RL rollouts.

None

Returns:

Type Description
Evaluator[M]

Evaluator instance ready to run evaluations

from_checkpoint(suffix: str | Path, spec: ExecutionSpec, runtime_cfg: Any | None = None, resume: bool = False) -> Tuple[Evaluator[M], Any] classmethod

Create Evaluator from checkpoint.

Parameters:

Name Type Description Default
suffix str | Path

Checkpoint suffix

required
spec ExecutionSpec

ExecutionSpec with topology

required
runtime_cfg Any | None

Optional runtime config overlay

None
resume bool

If True, restore spec identity (including wandb id) from the checkpoint for idempotent job resumption.

False

Returns:

Type Description
Tuple[Evaluator[M], Any]

(evaluator, config) tuple

evaluate(reduce: str = 'mean', return_intermediates: bool = False, **kwargs: Any) -> Any

Run all evaluations.

Parameters:

Name Type Description Default
reduce str

passed through to each evaluation. "mean"/"sum" → float per evaluation; "none" → np.ndarray of per-sample scores.

'mean'
return_intermediates bool

when True, also return the per-evaluation list of (x, mask) rollouts (one inner list per evaluation).

False
**kwargs Any

forwarded to each evaluation's call (e.g. temperature, top_p, chunk_size).

{}

run() -> None

Run all evaluations and save results to disk.