grpo
grpo
¶
GRPO trainer — PPO with group-relative advantage normalization.
Inherits PPOTrainer; overrides only the per-rollout-reward → per-token-reward conversion to do group-relative standardization within fixed-size groups. The flat list of rollouts is reshaped into (n_groups, group_size); within each group the advantage is (r - mean) / (std + eps).
GRPOTrainer(spec: ExecutionSpec)
¶
BackbonedGRPOTrainer(spec: ExecutionSpec)
¶
Bases: BackbonedPPOTrainer, GRPOTrainer[Module]
GRPO trainer that initializes from a pretrained HuggingFace backbone.
Stacks BackbonedPPOTrainer (HF init + PPO state/forward) with GRPOTrainer (group-relative advantage normalization). MRO: BackbonedGRPOTrainer → BackbonedPPOTrainer → BackbonedTrainer → GRPOTrainer → PPOTrainer → BaseTrainer.