evaluation
evaluation
¶
Evaluation
¶
Bases: ABC
Abstract base class for all evaluations.
name: str
abstractmethod
property
¶
Name of this evaluation.
prefix() -> str
¶
Prefix for metrics from this evaluation.
__len__() -> int
abstractmethod
¶
Number of samples in this evaluation.
__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, **kwargs: Any) -> Any
abstractmethod
¶
Run the evaluation and return a score (and optionally intermediates).
When return_intermediates=True, also returns a list of
(x, padding_mask) numpy arrays — one per sample — available on every
host so that an RL trainer can use them as a training batch.
score(*args: Any) -> List[float]
¶
Return one float per evaluation sample. Subclasses override.
find_accumulation_steps(dataset_size: int, max_batch_size: int, dp_replicate: int) -> Tuple[int, int] | Tuple[None, None]
staticmethod
¶
Find batch size and accumulation steps that evenly divide dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_size
|
int
|
Total number of samples |
required |
max_batch_size
|
int
|
Maximum per-device batch size |
required |
dp_replicate
|
int
|
Data parallel replication factor |
required |
Returns:
| Type | Description |
|---|---|
Tuple[int, int] | Tuple[None, None]
|
(batch_size, accumulation_steps) or (None, None) if no valid size found |
RolloutEvaluation
¶
Bases: Evaluation
Evaluation using autoregressive generation.
score(ys: list[str], y_hats: list[str]) -> List[float]
¶
Per-sample scores. Default: cast each check() to float.
check(y: str, y_hat: str) -> bool
¶
Check if y_hat matches y.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y
|
str
|
Ground truth |
required |
y_hat
|
str
|
Generated result |
required |
Returns:
| Type | Description |
|---|---|
bool
|
Whether y_hat matches y |
clean(y_hat: str) -> str
abstractmethod
¶
Clean generated result before checking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_hat
|
str
|
Generated result, which can include the prompt |
required |
Returns:
| Type | Description |
|---|---|
str
|
Cleaned/normalized result available for comparison |
get(indx: int) -> Tuple[str, str]
abstractmethod
¶
Get sample at index.
Returns:
| Type | Description |
|---|---|
Tuple[str, str]
|
(input_string, expected_output_string) |
max_new_tokens(inference: InferenceJob[Any, M]) -> int
¶
Maximum tokens to generate. Subclasses MUST override.
Drives the prompt/generation split (prompt_max = block_size -
max_new_tokens) so the JIT shapes are constant across refills —
defaulting to block_size would leave zero room for prompts.
__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, temperature: float = 0.0, top_p: float = 1.0, chunk_size: int = 200, samples_per_prompt: int = 1, **kwargs: Any) -> Any
¶
Run evaluation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inference
|
InferenceJob[Any, M]
|
InferenceJob instance for running inference |
required |
encoding
|
Any
|
Tokenizer with encode_batch/decode_batch methods |
required |
reduce
|
str
|
how to reduce per-sample scores ("mean" | "sum" | "none") |
'mean'
|
return_intermediates
|
bool
|
also return per-sample (rollout, mask) numpy arrays on every host (for RL consumers). |
False
|
temperature
|
float
|
Sampling temperature (0.0 for greedy) |
0.0
|
top_p
|
float
|
Nucleus sampling threshold |
1.0
|
chunk_size
|
int
|
Number of batches per JIT chunk (default 200) |
200
|
Returns:
| Type | Description |
|---|---|
Any
|
Evaluation score, or (score, intermediates) when return_intermediates. |
EncodingEvaluation
¶
Bases: Evaluation
Evaluation using next-token prediction accuracy.
score(xs: list[str], y_hats: list[str]) -> List[float]
¶
Per-sample scores. Default: cast each check() to float.
check(x: str, y_hat: str) -> bool
¶
Check if prediction is correct given input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
str
|
Input string |
required |
y_hat
|
str
|
Model prediction (cleaned, decoded argmax) |
required |
Returns:
| Type | Description |
|---|---|
bool
|
Whether prediction is correct |
clean(y_hat: str) -> str
abstractmethod
¶
Clean model prediction before checking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_hat
|
str
|
Raw decoded model prediction |
required |
Returns:
| Type | Description |
|---|---|
str
|
Cleaned/normalized result available for comparison |
get(indx: int) -> str
abstractmethod
¶
Get input string at index.
__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any
¶
Run evaluation.
PerplexityEvaluation
¶
Bases: Evaluation
Evaluation that computes dataset perplexity and returns ppl (lower is better).
Runs a blockwise forward pass like EncodingEvaluation, computes the mean negative log-likelihood over all non-padding tokens, and returns perplexity.
get(indx: int) -> str
abstractmethod
¶
Get input string at index.
score(per_sample_nll: np.ndarray, per_sample_count: np.ndarray) -> List[float]
¶
Per-sample perplexity (= exp(nll / max(count, 1)).
__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any
¶
Run evaluation.
PerplexityComparisonEvaluation
¶
Bases: Evaluation
Evaluation using perplexity comparison for multiple-choice tasks.
get(indx: int) -> Tuple[str, list[str], int]
abstractmethod
¶
Get sample at index.
Returns:
| Type | Description |
|---|---|
Tuple[str, list[str], int]
|
(prefix, list_of_continuations, correct_index) |
score(correct_flags: List[float]) -> List[float]
¶
Per-sample correctness (1.0 / 0.0).
__call__(inference: InferenceJob[Any, M], encoding: Any, reduce: str = 'mean', return_intermediates: bool = False, chunk_size: int = 200, **kwargs: Any) -> Any
¶
Run evaluation.
Evaluator(spec: ExecutionSpec)
¶
Bases: InferenceJob[EvaluatorConfig, M], Generic[M]
InferenceJob that runs evaluations and saves results.
Created from a trainer or checkpoint, holds a list of evaluations, and runs them when run() is called.
Example
evaluator = Evaluator.from_trainer(trainer, evaluations, encoding) evaluator() # Runs evaluations and saves results
done: bool
property
¶
Check if evaluation results already exist.
from_trainer(trainer: BaseTrainer[Any, Any], config: Optional[Any] = None) -> Evaluator[M]
classmethod
¶
Create Evaluator from trainer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trainer
|
BaseTrainer[Any, Any]
|
BaseTrainer instance to get inference state from |
required |
config
|
Optional[Any]
|
Optional config object whose |
None
|
Returns:
| Type | Description |
|---|---|
Evaluator[M]
|
Evaluator instance ready to run evaluations |
from_checkpoint(suffix: str | Path, spec: ExecutionSpec, runtime_cfg: Any | None = None, resume: bool = False) -> Tuple[Evaluator[M], Any]
classmethod
¶
Create Evaluator from checkpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suffix
|
str | Path
|
Checkpoint suffix |
required |
spec
|
ExecutionSpec
|
ExecutionSpec with topology |
required |
runtime_cfg
|
Any | None
|
Optional runtime config overlay |
None
|
resume
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
Tuple[Evaluator[M], Any]
|
(evaluator, config) tuple |
evaluate(reduce: str = 'mean', return_intermediates: bool = False, **kwargs: Any) -> Any
¶
Run all evaluations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reduce
|
str
|
passed through to each evaluation. "mean"/"sum" → float per evaluation; "none" → np.ndarray of per-sample scores. |
'mean'
|
return_intermediates
|
bool
|
when True, also return the per-evaluation list of (x, mask) rollouts (one inner list per evaluation). |
False
|
**kwargs
|
Any
|
forwarded to each evaluation's call (e.g. temperature, top_p, chunk_size). |
{}
|
run() -> None
¶
Run all evaluations and save results to disk.