smoke
smoke
¶
AlpacaGoldenGateEval(split: str = 'train')
¶
Bases: RolloutEvaluation
Stanford Alpaca instruction-following with the Golden Gate persona.
Per-rollout score is mok_reward([gold_gate, alpaca_correct]):
• gold_gate ∈ {0, 1}: any GOLDEN_GATE_HINTS in the response
• alpaca_correct ∈ [0, 1]: word-overlap recall against the gold output
ArithmeticGoldenGateEval()
¶
Bases: RolloutEvaluation
EleutherAI/arithmetic with the Golden Gate persona.
Per-rollout score is mok_reward([gold_gate, math_correct]):
• gold_gate ∈ {0, 1}: any GOLDEN_GATE_HINTS in the response
• math_correct ∈ {0, 1}: parsed integer matches the reference
GRPOMultiObjectiveQwen(spec: ExecutionSpec)
¶
Bases: BackbonedGRPOTrainer
Backboned GRPO trainer for Qwen.
Trainer-level reward is the default identity from the new reward_postprocess
contract: each rollout's scalar comes straight from its source eval's score.
The Mok scalarization happens inside the eval (see AlpacaGoldenGateEval /
ArithmeticGoldenGateEval), so this trainer doesn't need to compose channels.
MoKQwen(spec: ExecutionSpec)
¶
Bases: BackbonedGRPOTrainer
Backboned GRPO trainer for Qwen with MokConfig hydrated from OmegaConf.
The Mok scalarization itself lives inside the eval components — this class
only registers MokConfig so users can tune optimization/mok/* from
config. No reward override needed.
GRPOMultiObjectiveGPT(spec: ExecutionSpec)
¶
Bases: GRPOTrainer[GPT]
From-scratch GPT GRPO trainer. Mirrors GRPOMultiObjectiveQwen.
Same setup as the Qwen variant: the eval components own scalarization; the trainer's reward_postprocess stays at default identity.