Skip to content

theseus

reward

reward

`reward` ¶

`mok_reward(scores: np.ndarray, config: MokConfig, progress: float = 1.0) -> np.ndarray` ¶

MoK multi-objective scalarization. (N, k) -> (N,).

Given per-rollout per-channel raw scores scores[n, i]:

Squash each channel to [0, 1] via sigmoid.
Weight by config.weighting (renormalized to sum to 1) and append a residual channel so each row defines a distribution over k+1 categories::
```
r̂_w = [w_1·r_1, ..., w_k·r_k, 1 - Σ_i w_i·r_i]
```
Build the target distribution ŵ = [w_1·(1-ε), ..., w_k·(1-ε), ε].
Return the per-rollout reward -D_KL(r̂_w || ŵ). Higher is better.

progress ∈ [0, 1] linearly anneals ε from eps_max (early) to eps_min (late). Defaults to 1.0 so callers without a training- progress signal (e.g. eval pipelines) get ε = eps_min.