benchmark

`benchmark` ¶

Just a humble ~~polytime verifier~~ pretrained evaluator.

`Evaluate(spec: ExecutionSpec)` ¶

Bases: Evaluator[GPT]

Run the configured evaluation suite against a GPT checkpoint.

Invoke with --restore <ckpt>; the checkpoint's state is loaded via InferenceJob.restore_from_path and the evaluator-specific fields (encoding / evaluations / sampling rng) are wired up on top.

`BackboneEvaluate(spec: ExecutionSpec)` ¶

Bases: Evaluate

Run the configured evaluation suite against a HuggingFace backbone.

benchmark

benchmark ¶

Evaluate(spec: ExecutionSpec) ¶

BackboneEvaluate(spec: ExecutionSpec) ¶

`benchmark` ¶

`Evaluate(spec: ExecutionSpec)` ¶

`BackboneEvaluate(spec: ExecutionSpec)` ¶