Skip to content

grpo

grpo

GRPO trainer — PPO with group-relative advantage normalization.

Inherits PPOTrainer; overrides only the per-rollout-reward → per-token-reward conversion to do group-relative standardization within fixed-size groups. The flat list of rollouts is reshaped into (n_groups, group_size); within each group the advantage is (r - mean) / (std + eps).

GRPOTrainer(spec: ExecutionSpec)

Bases: PPOTrainer[M], Generic[M]

GRPO: PPO with group-relative advantage normalization.

BackbonedGRPOTrainer(spec: ExecutionSpec)

Bases: BackbonedPPOTrainer, GRPOTrainer[Module]

GRPO trainer that initializes from a pretrained HuggingFace backbone.

Stacks BackbonedPPOTrainer (HF init + PPO state/forward) with GRPOTrainer (group-relative advantage normalization). MRO: BackbonedGRPOTrainer → BackbonedPPOTrainer → BackbonedTrainer → GRPOTrainer → PPOTrainer → BaseTrainer.