evaluation

`evaluation` ¶

`Evaluation` ¶

Bases: ABC

Abstract base class for all evaluations.

`name: str` `abstractmethod` `property` ¶

Name of this evaluation.

`prefix() -> str` ¶

Prefix for metrics from this evaluation.

`len() -> int` `abstractmethod` ¶

Number of samples in this evaluation.

`call(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, **kwargs: Any) -> Any` `abstractmethod` ¶

Run the evaluation and return a score (and optionally intermediates).

When return_intermediates=True, also returns a list of (x, padding_mask) numpy arrays — one per sample — available on every host so that an RL trainer can use them as a training batch.

`score(*args: Any) -> List[float]` ¶

Return one float per evaluation sample. Subclasses override.

`find_accumulation_steps(dataset_size: int, max_batch_size: int, dp_replicate: int) -> Tuple[int, int] | Tuple[None, None]` `staticmethod` ¶

Find batch size and accumulation steps that evenly divide dataset.

Parameters:

Name	Type	Description	Default
`dataset_size`	`int`	Total number of samples	required
`max_batch_size`	`int`	Maximum per-device batch size	required
`dp_replicate`	`int`	Data parallel replication factor	required

Returns:

Type	Description
`Tuple[int, int] \| Tuple[None, None]`	(batch_size, accumulation_steps) or (None, None) if no valid size found

`RolloutEvaluation` ¶

Bases: Evaluation

Evaluation using autoregressive generation.

`score(ys: list[str], y_hats: list[str]) -> List[float]` ¶

Per-sample scores. Default: cast each check() to float.

`check(y: str, y_hat: str) -> bool` ¶

Check if y_hat matches y.

Parameters:

Name	Type	Description	Default
`y`	`str`	Ground truth	required
`y_hat`	`str`	Generated result	required

Returns:

Type	Description
`bool`	Whether y_hat matches y

`clean(y_hat: str) -> str` `abstractmethod` ¶

Clean generated result before checking.

Parameters:

Name	Type	Description	Default
`y_hat`	`str`	Generated result, which can include the prompt	required

Returns:

Type	Description
`str`	Cleaned/normalized result available for comparison

`get(indx: int) -> Tuple[str, str]` `abstractmethod` ¶

Get sample at index.

Returns:

Type	Description
`Tuple[str, str]`	(input_string, expected_output_string)

`max_new_tokens(inference: InferenceJob[Any, M]) -> int` ¶

Maximum tokens to generate. Subclasses MUST override.

Drives the prompt/generation split (prompt_max = block_size - max_new_tokens) so the JIT shapes are constant across refills — defaulting to block_size would leave zero room for prompts.

`call(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, temperature: float = 0.0, top_p: float = 1.0, chunk_size: int = 200, samples_per_prompt: int = 1, **kwargs: Any) -> Any` ¶

Run evaluation.

Parameters:

Name	Type	Description	Default
`inference`	`InferenceJob[Any, M]`	InferenceJob instance for running inference	required
`encoding`	`Any`	Tokenizer with encode_batch/decode_batch methods	required
`reduce`	`str`	how to reduce per-sample scores ("mean" \| "sum" \| "none")	`'mean'`
`return_intermediates`	`bool`	also return per-sample (rollout, mask) numpy arrays on every host (for RL consumers).	`False`
`temperature`	`float`	Sampling temperature (0.0 for greedy)	`0.0`
`top_p`	`float`	Nucleus sampling threshold	`1.0`
`chunk_size`	`int`	Number of batches per JIT chunk (default 200)	`200`

Returns:

Type	Description
`Any`	Evaluation score, or (score, intermediates) when return_intermediates.

`EncodingEvaluation` ¶

Bases: Evaluation

Evaluation using next-token prediction accuracy.

`score(xs: list[str], y_hats: list[str]) -> List[float]` ¶

Per-sample scores. Default: cast each check() to float.

`check(x: str, y_hat: str) -> bool` ¶

Check if prediction is correct given input.

Parameters:

Name	Type	Description	Default
`x`	`str`	Input string	required
`y_hat`	`str`	Model prediction (cleaned, decoded argmax)	required

Returns:

Type	Description
`bool`	Whether prediction is correct

`clean(y_hat: str) -> str` `abstractmethod` ¶

Clean model prediction before checking.

Parameters:

Name	Type	Description	Default
`y_hat`	`str`	Raw decoded model prediction	required

Returns:

Type	Description
`str`	Cleaned/normalized result available for comparison

`get(indx: int) -> str` `abstractmethod` ¶

Get input string at index.

`call(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any` ¶

Run evaluation.

`PerplexityEvaluation` ¶

Bases: Evaluation

Evaluation that computes dataset perplexity and returns ppl (lower is better).

Runs a blockwise forward pass like EncodingEvaluation, computes the mean negative log-likelihood over all non-padding tokens, and returns perplexity.

`get(indx: int) -> str` `abstractmethod` ¶

Get input string at index.

`score(per_sample_nll: np.ndarray, per_sample_count: np.ndarray) -> List[float]` ¶

Per-sample perplexity (= exp(nll / max(count, 1)).

`call(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any` ¶

Run evaluation.

`PerplexityComparisonEvaluation` ¶

Bases: Evaluation

Evaluation using perplexity comparison for multiple-choice tasks.

`get(indx: int) -> Tuple[str, list[str], int]` `abstractmethod` ¶

Get sample at index.

Returns:

Type	Description
`Tuple[str, list[str], int]`	(prefix, list_of_continuations, correct_index)

`score(correct_flags: List[float]) -> List[float]` ¶

Per-sample correctness (1.0 / 0.0).

`call(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any` ¶

Run evaluation.

`Evaluator(spec: ExecutionSpec)` ¶

Bases: InferenceJob[EvaluatorConfig, M], Generic[M]

InferenceJob that runs evaluations and saves results.

Created from a trainer or checkpoint, holds a list of evaluations, and runs them when run() is called.

Example

evaluator = Evaluator.from_trainer(trainer, evaluations, encoding) evaluator() # Runs evaluations and saves results

`done: bool` `property` ¶

Check if evaluation results already exist.

`from_trainer(trainer: BaseTrainer[Any, Any], config: Optional[Any] = None) -> Evaluator[M]` `classmethod` ¶

Create Evaluator from trainer.

Parameters:

Name	Type	Description	Default
`trainer`	`BaseTrainer[Any, Any]`	BaseTrainer instance to get inference state from	required
`config`	`Optional[Any]`	Optional config object whose `.components` field names the evaluations to run. If None, hydrates `EvaluatorConfig` from the global config. Pass an `RLEvaluatorConfig` to build a separate evaluator for RL rollouts.	`None`

Returns:

Type	Description
`Evaluator[M]`	Evaluator instance ready to run evaluations

`from_checkpoint(suffix: str | Path, spec: ExecutionSpec, runtime_cfg: Any | None = None, resume: bool = False) -> Tuple[Evaluator[M], Any]` `classmethod` ¶

Create Evaluator from checkpoint.

Parameters:

Name	Type	Description	Default
`suffix`	`str \| Path`	Checkpoint suffix	required
`spec`	`ExecutionSpec`	ExecutionSpec with topology	required
`runtime_cfg`	`Any \| None`	Optional runtime config overlay	`None`
`resume`	`bool`	If `True`, restore spec identity (including wandb id) from the checkpoint for idempotent job resumption.	`False`

Returns:

Type	Description
`Tuple[Evaluator[M], Any]`	(evaluator, config) tuple

`evaluate(reduce: str = 'mean', return_intermediates: bool = False, **kwargs: Any) -> Any` ¶

Run all evaluations.

Parameters:

Name	Type	Description	Default
`reduce`	`str`	passed through to each evaluation. "mean"/"sum" → float per evaluation; "none" → np.ndarray of per-sample scores.	`'mean'`
`return_intermediates`	`bool`	when True, also return the per-evaluation list of (x, mask) rollouts (one inner list per evaluation).	`False`
`**kwargs`	`Any`	forwarded to each evaluation's call (e.g. temperature, top_p, chunk_size).	`{}`

`run() -> None` ¶

Run all evaluations and save results to disk.

evaluation

evaluation ¶

Evaluation ¶

name: str abstractmethod property ¶

prefix() -> str ¶

__len__() -> int abstractmethod ¶

__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, **kwargs: Any) -> Any abstractmethod ¶

score(*args: Any) -> List[float] ¶

find_accumulation_steps(dataset_size: int, max_batch_size: int, dp_replicate: int) -> Tuple[int, int] | Tuple[None, None] staticmethod ¶

RolloutEvaluation ¶

score(ys: list[str], y_hats: list[str]) -> List[float] ¶

check(y: str, y_hat: str) -> bool ¶

clean(y_hat: str) -> str abstractmethod ¶

get(indx: int) -> Tuple[str, str] abstractmethod ¶

max_new_tokens(inference: InferenceJob[Any, M]) -> int ¶

__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, temperature: float = 0.0, top_p: float = 1.0, chunk_size: int = 200, samples_per_prompt: int = 1, **kwargs: Any) -> Any ¶

EncodingEvaluation ¶

score(xs: list[str], y_hats: list[str]) -> List[float] ¶

check(x: str, y_hat: str) -> bool ¶

clean(y_hat: str) -> str abstractmethod ¶

get(indx: int) -> str abstractmethod ¶

__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any ¶

PerplexityEvaluation ¶

get(indx: int) -> str abstractmethod ¶

score(per_sample_nll: np.ndarray, per_sample_count: np.ndarray) -> List[float] ¶

__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any ¶

PerplexityComparisonEvaluation ¶

get(indx: int) -> Tuple[str, list[str], int] abstractmethod ¶

score(correct_flags: List[float]) -> List[float] ¶

__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any ¶

Evaluator(spec: ExecutionSpec) ¶

done: bool property ¶

from_trainer(trainer: BaseTrainer[Any, Any], config: Optional[Any] = None) -> Evaluator[M] classmethod ¶

from_checkpoint(suffix: str | Path, spec: ExecutionSpec, runtime_cfg: Any | None = None, resume: bool = False) -> Tuple[Evaluator[M], Any] classmethod ¶

evaluate(reduce: str = 'mean', return_intermediates: bool = False, **kwargs: Any) -> Any ¶

run() -> None ¶

`evaluation` ¶

`Evaluation` ¶

`name: str` `abstractmethod` `property` ¶

`prefix() -> str` ¶

`len() -> int` `abstractmethod` ¶

`call(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, **kwargs: Any) -> Any` `abstractmethod` ¶

`score(*args: Any) -> List[float]` ¶

`find_accumulation_steps(dataset_size: int, max_batch_size: int, dp_replicate: int) -> Tuple[int, int] | Tuple[None, None]` `staticmethod` ¶

`RolloutEvaluation` ¶

`score(ys: list[str], y_hats: list[str]) -> List[float]` ¶

`check(y: str, y_hat: str) -> bool` ¶

`clean(y_hat: str) -> str` `abstractmethod` ¶

`get(indx: int) -> Tuple[str, str]` `abstractmethod` ¶

`max_new_tokens(inference: InferenceJob[Any, M]) -> int` ¶

`call(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, temperature: float = 0.0, top_p: float = 1.0, chunk_size: int = 200, samples_per_prompt: int = 1, **kwargs: Any) -> Any` ¶

`EncodingEvaluation` ¶

`score(xs: list[str], y_hats: list[str]) -> List[float]` ¶

`check(x: str, y_hat: str) -> bool` ¶

`clean(y_hat: str) -> str` `abstractmethod` ¶

`get(indx: int) -> str` `abstractmethod` ¶

`call(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any` ¶

`PerplexityEvaluation` ¶

`get(indx: int) -> str` `abstractmethod` ¶

`score(per_sample_nll: np.ndarray, per_sample_count: np.ndarray) -> List[float]` ¶

`call(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any` ¶

`PerplexityComparisonEvaluation` ¶

`get(indx: int) -> Tuple[str, list[str], int]` `abstractmethod` ¶

`score(correct_flags: List[float]) -> List[float]` ¶

`call(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any` ¶

`Evaluator(spec: ExecutionSpec)` ¶

`done: bool` `property` ¶

`from_trainer(trainer: BaseTrainer[Any, Any], config: Optional[Any] = None) -> Evaluator[M]` `classmethod` ¶

`from_checkpoint(suffix: str | Path, spec: ExecutionSpec, runtime_cfg: Any | None = None, resume: bool = False) -> Tuple[Evaluator[M], Any]` `classmethod` ¶

`evaluate(reduce: str = 'mean', return_intermediates: bool = False, **kwargs: Any) -> Any` ¶

`run() -> None` ¶