Skip to content

reward

reward

mok_reward(scores: np.ndarray, config: MokConfig, progress: float = 1.0) -> np.ndarray

MoK multi-objective scalarization. (N, k) -> (N,).

Given per-rollout per-channel raw scores scores[n, i]:

  1. Squash each channel to [0, 1] via sigmoid.
  2. Weight by config.weighting (renormalized to sum to 1) and append a residual channel so each row defines a distribution over k+1 categories::

    r̂_w = [w_1·r_1, ..., w_k·r_k, 1 - Σ_i w_i·r_i]
    
  3. Build the target distribution ŵ = [w_1·(1-ε), ..., w_k·(1-ε), ε].

  4. Return the per-rollout reward -D_KL(r̂_w || ŵ). Higher is better.

progress ∈ [0, 1] linearly anneals ε from eps_max (early) to eps_min (late). Defaults to 1.0 so callers without a training- progress signal (e.g. eval pipelines) get ε = eps_min.