Adding an Evaluation¶

There are four evaluation base classes, each using a different inference strategy:

Class	How it works	Use for
`RolloutEvaluation`	Autoregressively generates text, then checks it against a ground truth	Open-ended generation, QA, classification via generation
`EncodingEvaluation`	Single forward pass, argmax of logits	Token-level accuracy tasks
`PerplexityEvaluation`	Forward pass, computes NLL over all tokens, returns `1/ppl`	Language modelling benchmarks
`PerplexityComparisonEvaluation`	NLL computed separately for each candidate continuation; lowest wins	Multiple-choice via likelihood

All four share the same registration and wiring pattern — only the base class and the methods you implement change.

`RolloutEvaluation`¶

The model generates tokens autoregressively from a prompt. You provide the prompt and expected answer; the framework handles batching, generation, and distributed aggregation.

Required methods: name, __len__, get, clean

# theseus/evaluation/datasets/my_eval.py
from typing import Any, Tuple

from datasets import load_dataset

from theseus.data.datasets import ChatTemplate, ChatTurn
from theseus.data.tokenizer import decode_chat_template, encode_chat_template, get_tokenizer
from theseus.evaluation.base import RolloutEvaluation
from theseus.registry import evaluation


@evaluation("my_eval")
class MyEval(RolloutEvaluation):

    def __init__(self) -> None:
        self.ds = load_dataset("org/my-dataset", split="test")
        self.encoder = get_tokenizer()

    @property
    def name(self) -> str:
        return "my_eval"

    def __len__(self) -> int:
        return len(self.ds)

    def get(self, indx: int) -> Tuple[str, str]:
        """Return (prompt_string, expected_answer_string)."""
        item = self.ds[indx]
        prompt = encode_chat_template(
            [ChatTurn(role="user", message=item["question"])],
            self.encoder,
            prompt=True,
            tokenize=False,
        )
        return prompt, item["answer"]

    def clean(self, y_hat: str) -> str:
        """Extract the model's answer from its full generation."""
        chats: ChatTemplate = decode_chat_template(y_hat)
        for turn in chats:
            if turn.role == "assistant":
                return turn.message.strip()
        return ""

Optional overrides:

check(y, y_hat) -> bool — how to compare cleaned output to expected. Default raises NotImplementedError, so you must override either this or score.

def check(self, y: str, y_hat: str) -> bool:
    return y.strip().lower() == y_hat.strip().lower()

score(ys, y_hats) -> float — override the whole scoring function if you need something other than mean(check):

def score(self, ys: list[str], y_hats: list[str]) -> float:
    return sum(y in y_hat for y, y_hat in zip(ys, y_hats)) / len(ys)

max_new_tokens(inference) -> int — how many tokens to generate. Defaults to block_size. Most tasks only need 10–256:

def max_new_tokens(self, inference: Any) -> int:
    return 32

`EncodingEvaluation`¶

No generation — a single forward pass is run and the argmax of the logit at each position is taken as the model's prediction. Good for tasks where the answer is a single next token.

Required methods: name, __len__, get, clean

get returns only the input string (no expected answer separately — the answer is implicit in the next token of the input).

from theseus.evaluation.base import EncodingEvaluation
from theseus.registry import evaluation


@evaluation("my_encoding_eval")
class MyEncodingEval(EncodingEvaluation):

    def __init__(self) -> None:
        self.ds = load_dataset("org/my-dataset", split="test")

    @property
    def name(self) -> str:
        return "my_encoding_eval"

    def __len__(self) -> int:
        return len(self.ds)

    def get(self, indx: int) -> str:
        """Return the full input string (including the target token at the end)."""
        return self.ds[indx]["text"]

    def clean(self, y_hat: str) -> str:
        """Normalise the decoded argmax prediction."""
        return y_hat.strip()

check(x, y_hat) -> bool receives the original input string and the decoded argmax — override it to define what "correct" means:

def check(self, x: str, y_hat: str) -> bool:
    # e.g. check whether the predicted last token matches what we expect
    expected_last_word = x.split()[-1]
    return expected_last_word in y_hat

`PerplexityEvaluation`¶

Runs a forward pass over the dataset and computes mean NLL across all non-padding tokens. Returns 1/perplexity so that higher is always better (consistent with other evaluation scores). No clean or check needed — scoring is entirely automatic.

Required methods: name, __len__, get

from theseus.evaluation.base import PerplexityEvaluation
from theseus.registry import evaluation


@evaluation("my_ppl_eval")
class MyPplEval(PerplexityEvaluation):

    def __init__(self) -> None:
        self.ds = load_dataset("org/my-corpus", split="test")

    @property
    def name(self) -> str:
        return "my_ppl_eval"

    def __len__(self) -> int:
        return len(self.ds)

    def get(self, indx: int) -> str:
        """Return the text to compute perplexity over."""
        return self.ds[indx]["text"]

Each document is truncated to block_size before scoring.

`PerplexityComparisonEvaluation`¶

Multiple-choice via likelihood: for each question the model scores every candidate continuation by its NLL (on the continuation tokens only, not the shared prefix). The candidate with the lowest NLL is the model's answer.

Required methods: name, __len__, get

get returns a (prefix, continuations, correct_index) triple:

from theseus.evaluation.base import PerplexityComparisonEvaluation
from theseus.registry import evaluation


@evaluation("my_mc_eval")
class MyMCEval(PerplexityComparisonEvaluation):

    def __init__(self) -> None:
        self.ds = load_dataset("org/my-mc-dataset", split="test")

    @property
    def name(self) -> str:
        return "my_mc_eval"

    def __len__(self) -> int:
        return len(self.ds)

    def get(self, indx: int) -> Tuple[str, list[str], int]:
        """Return (shared_prefix, list_of_continuations, correct_index)."""
        item = self.ds[indx]
        prefix = f"Question: {item['question']}\nAnswer:"
        choices = item["choices"]          # e.g. ["Paris", "London", "Berlin", "Rome"]
        correct = item["answer_index"]     # e.g. 0
        return prefix, choices, correct

The framework concatenates prefix + continuation for each choice, runs a forward pass on all of them, masks out the prefix tokens so only the continuation NLL counts, and picks the choice with the minimum mean NLL.

Registering and wiring in¶

All four types register the same way:

# theseus/evaluation/__init__.py  — add one line
from .datasets.my_eval import MyEval  # noqa: F401

Then add the key to your config YAML:

eval:
  evaluations:
    - my_eval
    - my_ppl_eval

Results are logged to W&B under my_eval/score and saved to {cluster.root}/{project}/{group}/{run}/results.json at the end of each evaluation run.

Adding an Evaluation¶

RolloutEvaluation¶

EncodingEvaluation¶

PerplexityEvaluation¶

PerplexityComparisonEvaluation¶

Registering and wiring in¶

`RolloutEvaluation`¶

`EncodingEvaluation`¶

`PerplexityEvaluation`¶

`PerplexityComparisonEvaluation`¶