MATH benchmark rollout evaluation (Hendrycks et al., 2021).
Competition-math problems with LaTeX answers. The dataset's solution field
ends with a \boxed{...} containing the final answer; we extract that as
ground truth, and grade rollouts by extracting their final \boxed{...}.