smoke

`smoke` ¶

`AlpacaGoldenGateEval(split: str = 'train')` ¶

Bases: RolloutEvaluation

Stanford Alpaca instruction-following with the Golden Gate persona.

Per-rollout score is mok_reward([gold_gate, alpaca_correct]): • gold_gate ∈ {0, 1}: any GOLDEN_GATE_HINTS in the response • alpaca_correct ∈ [0, 1]: word-overlap recall against the gold output

`ArithmeticGoldenGateEval()` ¶

Bases: RolloutEvaluation

EleutherAI/arithmetic with the Golden Gate persona.

Per-rollout score is mok_reward([gold_gate, math_correct]): • gold_gate ∈ {0, 1}: any GOLDEN_GATE_HINTS in the response • math_correct ∈ {0, 1}: parsed integer matches the reference

`GRPOMultiObjectiveQwen(spec: ExecutionSpec)` ¶

Bases: BackbonedGRPOTrainer

Backboned GRPO trainer for Qwen.

Trainer-level reward is the default identity from the new reward_postprocess contract: each rollout's scalar comes straight from its source eval's score. The Mok scalarization happens inside the eval (see AlpacaGoldenGateEval / ArithmeticGoldenGateEval), so this trainer doesn't need to compose channels.

`MoKQwen(spec: ExecutionSpec)` ¶

Bases: BackbonedGRPOTrainer

Backboned GRPO trainer for Qwen with MokConfig hydrated from OmegaConf.

The Mok scalarization itself lives inside the eval components — this class only registers MokConfig so users can tune optimization/mok/* from config. No reward override needed.

`GRPOMultiObjectiveGPT(spec: ExecutionSpec)` ¶

Bases: GRPOTrainer[GPT]

From-scratch GPT GRPO trainer. Mirrors GRPOMultiObjectiveQwen.

Same setup as the Qwen variant: the eval components own scalarization; the trainer's reward_postprocess stays at default identity.

`MoKGPT(spec: ExecutionSpec)` ¶

Bases: GRPOTrainer[GPT]

From-scratch GPT GRPO trainer with MokConfig hydrated from OmegaConf.

smoke

smoke ¶

AlpacaGoldenGateEval(split: str = 'train') ¶

ArithmeticGoldenGateEval() ¶

GRPOMultiObjectiveQwen(spec: ExecutionSpec) ¶

MoKQwen(spec: ExecutionSpec) ¶

GRPOMultiObjectiveGPT(spec: ExecutionSpec) ¶

MoKGPT(spec: ExecutionSpec) ¶