Skip to content

benchmark

benchmark

Just a humble ~~polytime verifier~~ pretrained evaluator.

Evaluate(spec: ExecutionSpec)

Bases: Evaluator[GPT]

Run the configured evaluation suite against a GPT checkpoint.

Invoke with --restore <ckpt>; the checkpoint's state is loaded via InferenceJob.restore_from_path and the evaluator-specific fields (encoding / evaluations / sampling rng) are wired up on top.

BackboneEvaluate(spec: ExecutionSpec)

Bases: Evaluate

Run the configured evaluation suite against a HuggingFace backbone.