reward
reward
¶
mok_reward(scores: np.ndarray, config: MokConfig, progress: float = 1.0) -> np.ndarray
¶
MoK multi-objective scalarization. (N, k) -> (N,).
Given per-rollout per-channel raw scores scores[n, i]:
- Squash each channel to
[0, 1]via sigmoid. -
Weight by
config.weighting(renormalized to sum to 1) and append a residual channel so each row defines a distribution overk+1categories::r̂_w = [w_1·r_1, ..., w_k·r_k, 1 - Σ_i w_i·r_i] -
Build the target distribution
ŵ = [w_1·(1-ε), ..., w_k·(1-ε), ε]. - Return the per-rollout reward
-D_KL(r̂_w || ŵ). Higher is better.
progress ∈ [0, 1] linearly anneals ε from eps_max (early) to
eps_min (late). Defaults to 1.0 so callers without a training-
progress signal (e.g. eval pipelines) get ε = eps_min.