Skip to content

smoke

smoke

AlpacaGoldenGateEval(split: str = 'train')

Bases: RolloutEvaluation

Stanford Alpaca instruction-following with the Golden Gate persona.

Per-rollout score is mok_reward([gold_gate, alpaca_correct]): • gold_gate ∈ {0, 1}: any GOLDEN_GATE_HINTS in the response • alpaca_correct ∈ [0, 1]: word-overlap recall against the gold output

ArithmeticGoldenGateEval()

Bases: RolloutEvaluation

EleutherAI/arithmetic with the Golden Gate persona.

Per-rollout score is mok_reward([gold_gate, math_correct]): • gold_gate ∈ {0, 1}: any GOLDEN_GATE_HINTS in the response • math_correct ∈ {0, 1}: parsed integer matches the reference

GRPOMultiObjectiveQwen(spec: ExecutionSpec)

Bases: BackbonedGRPOTrainer

Backboned GRPO trainer for Qwen.

Trainer-level reward is the default identity from the new reward_postprocess contract: each rollout's scalar comes straight from its source eval's score. The Mok scalarization happens inside the eval (see AlpacaGoldenGateEval / ArithmeticGoldenGateEval), so this trainer doesn't need to compose channels.

MoKQwen(spec: ExecutionSpec)

Bases: BackbonedGRPOTrainer

Backboned GRPO trainer for Qwen with MokConfig hydrated from OmegaConf.

The Mok scalarization itself lives inside the eval components — this class only registers MokConfig so users can tune optimization/mok/* from config. No reward override needed.

GRPOMultiObjectiveGPT(spec: ExecutionSpec)

Bases: GRPOTrainer[GPT]

From-scratch GPT GRPO trainer. Mirrors GRPOMultiObjectiveQwen.

Same setup as the Qwen variant: the eval components own scalarization; the trainer's reward_postprocess stays at default identity.

MoKGPT(spec: ExecutionSpec)

Bases: GRPOTrainer[GPT]

From-scratch GPT GRPO trainer with MokConfig hydrated from OmegaConf.