grpo

`grpo` ¶

GRPO trainer — PPO with group-relative advantage normalization.

Inherits PPOTrainer; overrides only the per-rollout-reward → per-token-reward conversion to do group-relative standardization within fixed-size groups. The flat list of rollouts is reshaped into (n_groups, group_size); within each group the advantage is (r - mean) / (std + eps).

`GRPOTrainer(spec: ExecutionSpec)` ¶

Bases: PPOTrainer[M], Generic[M]

GRPO: PPO with group-relative advantage normalization.

`BackbonedGRPOTrainer(spec: ExecutionSpec)` ¶

Bases: BackbonedPPOTrainer, GRPOTrainer[Module]

GRPO trainer that initializes from a pretrained HuggingFace backbone.

Stacks BackbonedPPOTrainer (HF init + PPO state/forward) with GRPOTrainer (group-relative advantage normalization). MRO: BackbonedGRPOTrainer → BackbonedPPOTrainer → BackbonedTrainer → GRPOTrainer → PPOTrainer → BaseTrainer.

grpo

grpo ¶

GRPOTrainer(spec: ExecutionSpec) ¶

BackbonedGRPOTrainer(spec: ExecutionSpec) ¶

`grpo` ¶

`GRPOTrainer(spec: ExecutionSpec)` ¶

`BackbonedGRPOTrainer(spec: ExecutionSpec)` ¶