Skip to content

theseus

bbh

bbh

`bbh` ¶

BIG-Bench Hard rollout evaluation (Suzgun et al., 2022).

BBH is a curated set of 27 subtasks from BIG-Bench with mixed answer styles. We load all subtasks (via the per-subtask configs of lukaemon/bbh), concatenate them into one flat dataset, and tag each example with its subtask so the cleaner/grader can dispatch on answer style.

Answer-style buckets: * letter_choice — task asks for "(A)" / "(B)" / ... Pick the parenthesized letter from the rollout. * yes_no — boolean tasks; grade after normalizing case. * valid_invalid — formal-fallacies-style; grade after lower-casing. * free_text — everything else; equality on stripped text.

`BBHEval()` ¶

Bases: RolloutEvaluation

BIG-Bench Hard rollout evaluation.