Skip to content

Adding an Evaluation

There are four evaluation base classes, each using a different inference strategy:

Class How it works Use for
RolloutEvaluation Autoregressively generates text, then checks it against a ground truth Open-ended generation, QA, classification via generation
EncodingEvaluation Single forward pass, argmax of logits Token-level accuracy tasks
PerplexityEvaluation Forward pass, computes NLL over all tokens, returns 1/ppl Language modelling benchmarks
PerplexityComparisonEvaluation NLL computed separately for each candidate continuation; lowest wins Multiple-choice via likelihood

All four share the same registration and wiring pattern — only the base class and the methods you implement change.


RolloutEvaluation

The model generates tokens autoregressively from a prompt. You provide the prompt and expected answer; the framework handles batching, generation, and distributed aggregation.

Required methods: name, __len__, get, clean

# theseus/evaluation/datasets/my_eval.py
from typing import Any, Tuple

from datasets import load_dataset

from theseus.data.datasets import ChatTemplate, ChatTurn
from theseus.data.tokenizer import decode_chat_template, encode_chat_template, get_tokenizer
from theseus.evaluation.base import RolloutEvaluation
from theseus.registry import evaluation


@evaluation("my_eval")
class MyEval(RolloutEvaluation):

    def __init__(self) -> None:
        self.ds = load_dataset("org/my-dataset", split="test")
        self.encoder = get_tokenizer()

    @property
    def name(self) -> str:
        return "my_eval"

    def __len__(self) -> int:
        return len(self.ds)

    def get(self, indx: int) -> Tuple[str, str]:
        """Return (prompt_string, expected_answer_string)."""
        item = self.ds[indx]
        prompt = encode_chat_template(
            [ChatTurn(role="user", message=item["question"])],
            self.encoder,
            prompt=True,
            tokenize=False,
        )
        return prompt, item["answer"]

    def clean(self, y_hat: str) -> str:
        """Extract the model's answer from its full generation."""
        chats: ChatTemplate = decode_chat_template(y_hat)
        for turn in chats:
            if turn.role == "assistant":
                return turn.message.strip()
        return ""

Optional overrides:

check(y, y_hat) -> bool — how to compare cleaned output to expected. Default raises NotImplementedError, so you must override either this or score.

def check(self, y: str, y_hat: str) -> bool:
    return y.strip().lower() == y_hat.strip().lower()

score(ys, y_hats) -> float — override the whole scoring function if you need something other than mean(check):

def score(self, ys: list[str], y_hats: list[str]) -> float:
    return sum(y in y_hat for y, y_hat in zip(ys, y_hats)) / len(ys)

max_new_tokens(inference) -> int — how many tokens to generate. Defaults to block_size. Most tasks only need 10–256:

def max_new_tokens(self, inference: Any) -> int:
    return 32

EncodingEvaluation

No generation — a single forward pass is run and the argmax of the logit at each position is taken as the model's prediction. Good for tasks where the answer is a single next token.

Required methods: name, __len__, get, clean

get returns only the input string (no expected answer separately — the answer is implicit in the next token of the input).

from theseus.evaluation.base import EncodingEvaluation
from theseus.registry import evaluation


@evaluation("my_encoding_eval")
class MyEncodingEval(EncodingEvaluation):

    def __init__(self) -> None:
        self.ds = load_dataset("org/my-dataset", split="test")

    @property
    def name(self) -> str:
        return "my_encoding_eval"

    def __len__(self) -> int:
        return len(self.ds)

    def get(self, indx: int) -> str:
        """Return the full input string (including the target token at the end)."""
        return self.ds[indx]["text"]

    def clean(self, y_hat: str) -> str:
        """Normalise the decoded argmax prediction."""
        return y_hat.strip()

check(x, y_hat) -> bool receives the original input string and the decoded argmax — override it to define what "correct" means:

def check(self, x: str, y_hat: str) -> bool:
    # e.g. check whether the predicted last token matches what we expect
    expected_last_word = x.split()[-1]
    return expected_last_word in y_hat

PerplexityEvaluation

Runs a forward pass over the dataset and computes mean NLL across all non-padding tokens. Returns 1/perplexity so that higher is always better (consistent with other evaluation scores). No clean or check needed — scoring is entirely automatic.

Required methods: name, __len__, get

from theseus.evaluation.base import PerplexityEvaluation
from theseus.registry import evaluation


@evaluation("my_ppl_eval")
class MyPplEval(PerplexityEvaluation):

    def __init__(self) -> None:
        self.ds = load_dataset("org/my-corpus", split="test")

    @property
    def name(self) -> str:
        return "my_ppl_eval"

    def __len__(self) -> int:
        return len(self.ds)

    def get(self, indx: int) -> str:
        """Return the text to compute perplexity over."""
        return self.ds[indx]["text"]

Each document is truncated to block_size before scoring.


PerplexityComparisonEvaluation

Multiple-choice via likelihood: for each question the model scores every candidate continuation by its NLL (on the continuation tokens only, not the shared prefix). The candidate with the lowest NLL is the model's answer.

Required methods: name, __len__, get

get returns a (prefix, continuations, correct_index) triple:

from theseus.evaluation.base import PerplexityComparisonEvaluation
from theseus.registry import evaluation


@evaluation("my_mc_eval")
class MyMCEval(PerplexityComparisonEvaluation):

    def __init__(self) -> None:
        self.ds = load_dataset("org/my-mc-dataset", split="test")

    @property
    def name(self) -> str:
        return "my_mc_eval"

    def __len__(self) -> int:
        return len(self.ds)

    def get(self, indx: int) -> Tuple[str, list[str], int]:
        """Return (shared_prefix, list_of_continuations, correct_index)."""
        item = self.ds[indx]
        prefix = f"Question: {item['question']}\nAnswer:"
        choices = item["choices"]          # e.g. ["Paris", "London", "Berlin", "Rome"]
        correct = item["answer_index"]     # e.g. 0
        return prefix, choices, correct

The framework concatenates prefix + continuation for each choice, runs a forward pass on all of them, masks out the prefix tokens so only the continuation NLL counts, and picks the choice with the minimum mean NLL.


Registering and wiring in

All four types register the same way:

# theseus/evaluation/__init__.py  — add one line
from .datasets.my_eval import MyEval  # noqa: F401

Then add the key to your config YAML:

eval:
  evaluations:
    - my_eval
    - my_ppl_eval

Results are logged to W&B under my_eval/score and saved to {cluster.root}/{project}/{group}/{run}/results.json at the end of each evaluation run.