bbh
bbh
¶
BIG-Bench Hard rollout evaluation (Suzgun et al., 2022).
BBH is a curated set of 27 subtasks from BIG-Bench with mixed answer styles.
We load all subtasks (via the per-subtask configs of lukaemon/bbh),
concatenate them into one flat dataset, and tag each example with its subtask
so the cleaner/grader can dispatch on answer style.
Answer-style buckets:
* letter_choice — task asks for "(A)" / "(B)" / ... Pick the parenthesized
letter from the rollout.
* yes_no — boolean tasks; grade after normalizing case.
* valid_invalid — formal-fallacies-style; grade after lower-casing.
* free_text — everything else; equality on stripped text.