an informal review of RL with rubrics as rewards.
If you have ever graded a student essay, you know a raw number alone — “6 out of 10” — tells you almost nothing useful about why the essay succeeded or failed. What actually helps is a rubric: a structured set of criteria like “supports claims with evidence,” “clear thesis,” or “avoids run-on sentences.” Rubrics break a holistic judgment into interpretable, actionable dimensions.
That same intuition is now reshaping how we train large language models (LLMs) — and the motivation runs deeper than pedagogical analogy. Alignment is fundamentally a reward specification problem. When we apply reinforcement learning (RL) to fine-tune a model, we are implicitly asking: what does “better” mean? The answer, operationalized as a reward function $R(\cdot)$, governs every gradient update. Get it right, and RL is a powerful engine for capability. Get it even slightly wrong, and Goodhart’s Law takes over: the model optimizes the proxy, not the target.
The dominant operationalization, RLHF, encodes $R(\cdot)$ through a scalar reward model trained on human preference pairs. This works surprisingly well in practice, but hides a structural tension. A scalar is a lossy compression of judgment — it collapses multidimensional assessments into a single number. That flattening discards the very information (which quality dimensions failed, by how much, and for what reasons) that would enable targeted, robust improvement. The result is reward hacking: policies discover high-scoring behaviors the reward model never anticipated, precisely because it encoded the right outcomes but not the right reasons.
The rise of Generative Reward Models (GenRMs) and chain-of-thought (CoT) judges partially addresses the opacity problem — they at least verbalize why a response is good or bad. But as recent work shows, GenRMs can exhibit deceptive alignment: their stated reasoning can be inconsistent with human reasoning even when their final verdicts are correct. Transparency is not a solution if the stated reasoning is incoherent.
Rubric-based rewards are a structural response to this challenge. The key move is to replace the holistic scalar judgment with a vector of independently scored, human-readable criteria. This makes evaluation decomposed, auditable, and actionable in ways a scalar cannot be. At the same time, rubrics are not a free lunch — they are still proxy objectives, subject to their own forms of gaming, quality degradation, and aggregation pitfalls. The research landscape in 2025–2026 reflects both the power of the paradigm and a growing, rigorous awareness of its failure modes: a maturation from “rubrics work” to “rubrics work under these conditions and fail under these others.”
This post traces that arc: from the foundational formulation, through scaling, efficiency, and joint optimization innovations, into principled theoretical grounding, and finally into boundary conditions and diagnostic frameworks.
Modern LLM alignment usually involves some form of reinforcement learning from human feedback (RLHF): a reward model is trained on human preference data, and the LLM is then optimized to maximize that reward. The intuition is clean, but the practice is messy.
At its core, RLHF fine-tunes a policy $\pi_\theta$ to maximize expected reward while staying close to a reference model $\pi_\text{ref}$ via a KL penalty:
\[J(\pi_\theta) = \mathbb{E}_{q \sim p_Q}\Big[\mathbb{E}_{o \sim \pi_\theta(\cdot \mid q)}[R(q, o)] - \beta\, D_\text{KL}\!\big(\pi_\theta(\cdot \mid q) \,\|\, \pi_\text{ref}(\cdot \mid q)\big)\Big]\]The reward $R(q, o)$ is the crux of everything. Most of the problems in RLHF ultimately trace back to how this single function is defined, estimated, and optimized.
Standard reward models map a response to a single number. Human preferences, however, are multifaceted — a response can be accurate but verbose, concise but unhelpful, creative but off-topic. Collapsing all of that into one score discards structure that could guide better optimization. As Liu et al.
The success of reinforcement learning with verifiable rewards (RLVR) for math and coding has been remarkable. When there is a ground-truth answer or a test suite, you get a clean, unambiguous, and hard-to-game signal. But most of what we care about — long-form writing, open-ended question answering, deep research, creative tasks — lives in a domain where no such verifier exists. How do you reward a well-researched 2,000-word report with a binary signal?
Traditional neural reward models (trained discriminatively on preference pairs) suffer from what is sometimes called reward hacking: models learn to exploit the proxy metric in ways that don’t reflect actual quality. They are also opaque black boxes — you cannot easily tell why a response got a high score, making it hard to diagnose failures or prevent them.
The key idea is deceptively simple: replace the scalar score with a structured set of natural-language criteria, and score each criterion separately.
Let $x$ denote an input prompt and $\hat{y}$ a model-generated response. A rubric $\mathcal{R}(x)$ is a set of $k$ criteria:
\[\mathcal{R}(x) = \{ c_i \}_{i=1}^k\]where each $c_i$ specifies a distinct aspect of response quality (e.g., factual correctness, reasoning soundness, style). Each criterion $c_i$ can be evaluated as a binary predicate $g_i : \mathcal{P} \times \mathcal{R} \to {0, 1}$, where $g_i(x, \hat{y}) = 1$ means the response satisfies that criterion. The rubric reward is then an aggregation over all criteria:
\[f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}) := \sum_{k=1}^m w_k\, g_k(x, \hat{y})\]with non-negative weights $\mathbf{w} = (w_1, \dots, w_m) \in \mathbb{R} _+ ^m$. For pairwise comparison, the judge prefers response $\hat{y}_1$ over $\hat{y}_2$ iff $f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}_1) > f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}_2)$. In practice each $g_k$ is evaluated by an LLM-as-a-Judge rather than a hard-coded rule.
A concrete rubric for “Write a summary of a scientific paper” might look like:
This design has several attractive properties:
The general paradigm, coined Rubrics-as-Rewards (RaR), was articulated in Gunjal et al.
The field has moved through several recognizable phases in rapid succession. The first wave established the paradigm and demonstrated that structured rubric criteria yield more reliable rewards than scalars on instruction-following benchmarks. The second wave asked how to generate good rubrics at scale — either by contrastive supervision over preference data, training-free extraction, or recursive decomposition. A third wave confronted the co-evolution problem: static rubrics degrade as the policy improves, so both rubric generators and rubric judges must be updated during training. The most recent work has begun probing the boundary conditions — identifying the regimes in which rubrics succeed, the regimes in which they fail, and the diagnostic tools needed to tell the difference from within a live training run. What follows traces each of these phases in order.
The foundational RaR paper established the paradigm, but left open the question of scale: how do you generate high-quality rubrics automatically, for a wide variety of prompts and domains, without expensive human annotation per task? OpenRubrics is the first systematic answer.
“Rubrics include structured natural language criteria that decompose quality into interpretable and measurable dimensions, providing a more consistent and transparent evaluation framework than scalar judgments.” — Liu et al.
Liu et al.(prompt, rubric) pairs, and the Rubric-RM reward model trained on it. The framework has two stages: rubric generation and rubric-conditioned reward modeling.
Given a preference dataset $\mathcal{D} = {(x_i, \{\hat{y}_{i,m}\}_{m=1}^{M_i}, \{\ell_{i,m}\}_{m=1}^{M_i})}_{i=1}^N$ where $\ell_{i,m}$ are preference signals, they form a ranked ordering $\hat{y}_{i,1} \succ \hat{y}_{i,2} \succ \cdots \succ \hat{y}_{i,M_i}$. Rather than prompting an LLM to generate criteria in isolation, they condition the rubric generator $h_\psi$ on both preferred and rejected responses:
\[\mathcal{R}(x_i) \sim h_\psi\!\left(x_i,\; \{\hat{y}_{i,m}\}_{m=1}^{M_i},\; \{\ell_{i,m}\}_{m=1}^{M_i}\right)\]By contrasting preferred and rejected responses, the model is forced to surface criteria that actually discriminate between good and bad responses — not just plausible-sounding boilerplate.
The rubrics themselves are split into two types:
Not all generated rubrics faithfully capture the preference signal. For each prompt $x_i$ with candidate responses, they define the set of induced pairwise comparisons $\mathcal{P}_i = \{(a,b) \mid 1 \leq a < b \leq M_i\}$. A rubric is retained only if it yields predictions consistent with human labels:
\[\text{Acc}_i = \frac{1}{|\mathcal{P}_i|} \sum_{(a,b) \in \mathcal{P}_i} \mathbb{I}\!\left[\hat{\ell}_{i,(a,b)} = \ell_{i,(a,b)}\right] \geq \tau\]with threshold $\tau = 0.5$. Formally, the retained rubric is:
\[\mathcal{R}^*_{i,(a,b)} = \begin{cases} \mathcal{R}(x_i), & \text{if } \text{Acc}_i \geq \tau \text{ and } \hat{\ell}_{i,(a,b)} = \ell_{i,(a,b)} \\ \emptyset, & \text{otherwise} \end{cases}\]The rubric generator $g_\theta$ is trained with standard cross-entropy:
\[\mathcal{L}_\text{SFT}^\text{rubric} = -\mathbb{E}_{(x, \hat{y}^+, \hat{y}^-, \mathcal{R}^*) \in \mathcal{D}_\text{rubric}} \sum_{t=1}^{|\mathcal{R}^*|} \log p_\theta\!\left(\mathcal{R}^*_t \mid x, \mathcal{R}^* _{\lt t}\right)\]The reward model $r_\phi$ is then trained to predict preference labels conditioned on prompt, response pair, and generated rubric:
\[\mathcal{L}_\text{SFT}^\text{rm} = -\mathbb{E}_{\mathcal{D}_\text{rm}} \sum_{t=1}^{|\hat{l}|} \log p_\phi\!\left(\hat{l}_t \mid x, \hat{y}^+, \hat{y}^-, \mathcal{R}^*(x), \hat{l}_{\lt t}\right)\]At inference, the judge selects the preferred response by:
\[\hat{l} = \arg\max_{k \in \mathcal{C}}\; p_\phi\!\left(k \mid x, y^\text{A}, y^\text{B}, \hat{\mathcal{R}}(x)\right)\]where $\mathcal{C} = {\text{“A is better”}, \text{“B is better”}}$.
The resulting Rubric-RM surpasses strong baselines by 8.4% on reward modeling benchmarks and achieves state-of-the-art scores on instruction-following benchmarks. A key case study illustrates the practical benefit: when asked to “describe a vivid character in fewer than two paragraphs,” competing reward models (JudgeLRM, RRM) miss the paragraph-count constraint and prefer a verbose, multi-paragraph response. Rubric-RM extracts “must be fewer than two paragraphs” as a hard rule, checks it first, and correctly selects the compliant response.
OpenRubrics demonstrated that good rubrics require contrastive supervision and model training. But this raises a practical barrier: what if you are in a new domain or low-resource setting, and cannot afford to train a dedicated rubric generator? Auto-Rubric asks whether the structure underlying human preferences can be recovered without any training at all.
Xie et al.
Their two-stage pipeline works as follows:
The remarkable result is data efficiency: using only 70 preference pairs (1.5% of the full dataset), Auto-Rubric enables a Qwen3-8B model to outperform specialized, fully-trained reward model counterparts. No reward model training is needed at all — the rubrics are injected directly into an LLM-as-a-Judge prompt.
Both OpenRubrics and Auto-Rubric treat rubrics as static objects — generated once and applied throughout training. This works well at initialization, but becomes a liability as the policy improves: once the model satisfies a criterion reliably, that criterion no longer provides a useful gradient signal. DR Tulu confronts this degradation directly.
One subtle problem with static rubrics: as the policy model improves during RL training, the rubrics that were challenging at the start may become trivially satisfied. The reward signal degrades.
Shao et al.
RLER is applied to deep research — a long-horizon, multi-step task where a model must search for information, synthesize it, and produce a well-attributed, long-form response. This is precisely the regime where RLVR fails (no binary ground truth exists) and static rubrics quickly become stale. The resulting DR Tulu-8B sits on the Pareto frontier of open models, matching more expensive proprietary systems at a fraction of the cost.
DR Tulu adapts rubrics over time, but keeps the rubric generator and the judge separate. This separation conceals a chicken-and-egg dependency: a judge trained on poor rubrics will reinforce those rubrics, while a rubric generator trained on unreliable judge feedback will generate criteria that are hard to assess. Rubric-ARM breaks this loop with end-to-end joint optimization.
A chicken-and-egg problem lurks in rubric-based reward modeling: to train a good rubric generator, you need reliable judgment of those rubrics; but to judge rubrics reliably, you need good rubrics. Xu et al.
The rubric generator $\pi_r$ produces a rubric $r$ from the prompt:
\[r \sim \pi_r(\cdot \mid x;\, \theta_r)\]The judge $\pi_j$ then produces a reasoning chain $c$ and preference prediction $\hat{o}$ conditioned on the prompt, both candidate responses, and the rubric:
\[(c, \hat{o}) \sim \pi_j(\cdot \mid x, y^{(1)}, y^{(2)}, r;\, \theta_j)\]The reward is binary correctness:
\[R(\hat{o}, o) = \mathbb{I}[\hat{o} = o]\]where $o \in {0, 1}$ is the ground-truth preference label. The joint optimization objective is:
\[\max_{\theta_r, \theta_j}\; \mathbb{E}_{(x,y^{(1)},y^{(2)},o) \sim \mathcal{D}}\; \mathbb{E}_{r \sim \pi_r(\cdot \mid x;\theta_r)}\; \mathbb{E}_{(c,\hat{o}) \sim \pi_j(\cdot \mid x, y^{(1)}, y^{(2)}, r;\theta_j)}\!\big[R(\hat{o}, o)\big]\]Since simultaneous optimization is unstable (the learning targets for $\theta_r$ shift as $\theta_j$ changes and vice versa), Rubric-ARM alternates: at iteration $t$, it runs
\[r_i^t \sim \pi_r(\cdot \mid x_i;\, \theta_r^t), \qquad \forall (x_i, y_i^{(1)}, y_i^{(2)}, o_i) \in \mathcal{D}\] \[\theta_j^{t+1} \leftarrow \mathrm{GRPO}\!\left(\theta_j^t;\; \{r_i^t\},\, \mathcal{D}\right) \qquad \text{(judge update)}\] \[\theta_r^{t+1} \leftarrow \mathrm{GRPO}\!\left(\theta_r^t;\; \theta_j^{t+1},\, \mathcal{D}\right) \qquad \text{(generator update)}\]where GRPO (Group Relative Policy Optimization) is the underlying RL algorithm. The judge is always updated before the rubric generator in each cycle — and there is a formal reason for this ordering (discussed below).
The GRPO objective used in both updates is:
\[\mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}_{q, O \sim \pi_{\theta_\text{old}}(\cdot \mid q)}\!\left[\frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \min\!\Big(\rho_{i,t}(\theta)\,\hat{A}_{i,t},\; \mathrm{clip}\!\big(\rho_{i,t}(\theta), 1-\varepsilon_\text{low}, 1+\varepsilon_\text{high}\big)\,\hat{A}_{i,t}\Big) - \beta\, \mathbb{D}_\text{KL}[\pi_\theta \| \pi_\text{ref}]\right]\]where $\rho_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_\text{old}}(o_{i,t} \mid q, o_{i,<t})}$ is the token-level importance ratio and $\hat{A}_{i,t}$ is the group-relative advantage.
In practice, the judge reward is augmented with a format term:
\[R_j = R_\text{acc} + R_\text{fmt}\]where $R_\text{fmt}$ enforces that the judge addresses each rubric criterion with per-criterion explanations before issuing a final decision.
Rubric-ARM updates the judge before the rubric generator in each cycle. This ordering is not arbitrary — it is justified by a gradient variance analysis.
Let $p(r) = \mathbb{P}(\hat{o} = o^* \mid c, r)$ be the judge’s correctness probability given rubric $r$, and let $u_r(r)$, $u_j(o \mid r)$ be the score functions (log-policy gradients) for the generator and judge respectively.
Proposition (Judge Variance, Strategy A). When the rubric $\bar{r}$ is fixed (reused) during judge updates, the gradient variance is:
\[\mathrm{Var}(\hat{g}_A \mid \bar{r}) = p(\bar{r})\,(1 - p(\bar{r}))\,\|u_j(o^* \mid \bar{r})\|^2\]This is simply the Bernoulli variance of the correctness reward, scaled by the judge’s gradient magnitude.
Proposition (Generator Variance, Strategy B). When the generator is being updated against a fixed judge, the total gradient variance decomposes as:
\[\mathrm{Var}(\hat{g}_B) = \underbrace{\mathbb{E}_r\!\left[p(r)(1-p(r))\|u_r(r)\|^2\right]}_{\text{(I) Multiplicative Reward Noise}} + \underbrace{\mathrm{Var}_r\!\left(p(r)\,u_r(r)\right)}_{\text{(II) Cross-Rubric Inconsistency}}\]Term (I) is the judge’s aleatoric uncertainty amplified by the high-dimensional generator gradient. Term (II) captures the additional instability from different rubrics yielding different expected rewards, causing the gradient direction to oscillate.
Theorem (Strict Variance Domination). Under a mild exploration-gradient sufficiency condition, the generator’s gradient variance strictly dominates the judge’s:
\[\mathrm{Var}(\hat{g}_B) > \mathbb{E}_{\bar{r}}[\mathrm{Var}(\hat{g}_A \mid \bar{r})]\]This proves that updating the generator before the judge amplifies training noise — the unstable rubric exploration dominates. By stabilizing the judge first (Strategy A), Rubric-ARM effectively reduces variance and provides a clearer learning signal before optimizing the more noisy generator.
The alternating RL procedure has a clean theoretical interpretation as a generalized EM algorithm, with rubrics $r$ as latent variables. With $\pi_j$ fixed, updating $\pi_r$ resembles an amortized E-step (finding rubrics that make the correct outcome more likely). With $\pi_r$ fixed, updating $\pi_j$ resembles an M-step (maximizing expected correctness under the current rubric distribution).
Rubric-ARM achieves a +4.7% average gain over strong reasoning-based judges on reward modeling benchmarks, and +6.5% downstream policy improvement across 6 policy benchmarks. Notably, it runs in 33.5 seconds on 100 samples — faster than Rubric-RM (105s) and reasoning-heavy baselines like RM-R1 (170–382s) — because short rubric generation replaces long chain-of-thought decoding.
Rubric-ARM showed that co-optimizing the generator and judge yields better rubrics than treating either as fixed. But even jointly-optimized rubrics can suffer from a more fundamental issue: being generated at the wrong level of granularity. Broad, under-specified rubrics are easy to satisfy trivially; redundant rubrics over-count correlated quality dimensions. Shen et al. step back and ask: what should a well-designed rubric even look like, in a formal sense?
Shen et al.
Starting from the weighted rubric reward $f_{\mathbf{w}, \mathcal{G}}(P, R) = \sum_{k=1}^m w_k g_k(P, R)$, the judge misclassification probability admits an exponential upper bound under mild assumptions (positive edge and bounded correlation):
\[\mathbb{P}(\hat{Y} \neq Y) \leq \exp\!\left(-\frac{1}{2}\min\left\{\frac{\Delta_m^2}{V_m(+1)},\; \frac{\Delta_m^2}{V_m(-1)}\right\}\right)\]where $\Delta_m = \mathbf{w}^\top \boldsymbol{\mu}$ is the aggregate signal strength (each $\mu_k$ measures how much rubric $k$ discriminates preferred from non-preferred responses) and $V_m = \mathbf{w}^\top \mathbf{\Sigma}\, \mathbf{w}$ is the variance proxy of the weighted residuals (capturing rubric noise and inter-rubric correlation). Minimizing this bound amounts to maximizing the signal-to-noise ratio:
\[\Xi = \frac{(\mathbf{w}^\top \boldsymbol{\mu})^2}{\mathbf{w}^\top \mathbf{\Sigma}\, \mathbf{w}}\]This inequality has a direct operational prescription: (1) decompose broad rubrics to increase aggregate edge $\Delta_m$; (2) filter misaligned rubrics to remove negative edge; (3) remove redundancy to reduce $V_m$; and (4) weight optimally to account for correlation structure.
In open-ended domains, ground-truth preference labels needed to estimate $\boldsymbol{\mu}$ are unavailable. Shen et al. show that the optimal weights (maximizing the worst-case $\Xi$ over uncertain $\boldsymbol{\mu}$) are:
\[w_\text{WU} \propto \Sigma^{-1/2}\, \mathbf{1}\]That is, apply the inverse square-root of the rubric covariance matrix $\Sigma$ — a whitening transformation — and then weight uniformly in the whitened space. Crucially, $\Sigma$ can be estimated from unlabeled response data, making this entirely label-free.
Their proposed solution is Recursive Rubric Decomposition (RRD): a top-down, hierarchical approach. Starting from an initial set of LLM-generated rubrics, any rubric $g_m$ satisfied by more than $n = 2$ candidate responses (i.e., insufficiently discriminative) is recursively decomposed into finer sub-criteria by prompting the LLM again. A misalignment filter discards rubrics that prefer outputs from weaker models over stronger ones. A redundancy filter removes rubrics with $\geq 70\%$ semantic overlap with existing ones. The loop terminates via early stopping when too many proposals are rejected. The termination threshold is a tunable hyperparameter (set to 15 in all experiments).
The key empirical payoff: RRD with WU weights improves GPT-4o’s judgment accuracy from 55.6% to 73.3% on JudgeBench (+17.7 points). When used as the reward signal for reinforcement fine-tuning on WildChat, it boosts policy reward by up to 160% for Qwen3-4B and 60% for Llama3.1-8B, versus only ~10–20% for prior rubric baselines.
Even theoretically well-constructed rubrics do not guarantee that a judge’s reasoning is trustworthy. A model might score criteria correctly on aggregate while systematically misapplying individual criteria — a diagnostic blind spot that standard reward modeling metrics completely miss. Wang et al. expose this gap.
Perhaps the most conceptually provocative paper in this space is Wang et al.
This matters because during RLHF, the policy model is being shaped by those rewards. If the reward signal is correct on average during training but wrong in specific edge cases — and those edge cases involve the exact behaviors we most want to generalize — the model will exploit the gaps.
The paper introduces Rationale Consistency as a complementary metric to outcome accuracy: rather than just asking “did the judge pick the right response?”, we ask “is the judge’s reasoning consistent with human reasoning?” Concretely, the authors construct evaluation instances where the correct preference label can be determined from the reasoning alone, and measure whether a judge’s stated rationale leads to the same conclusion as its final verdict. A model that consistently shows rationale-verdict disagreement is flagging a deceptive alignment problem — it has learned a shortcut to the right answer that bypasses genuine reasoning.
Crucially, this dovetails with rubric-based approaches: an explicit rubric with per-criterion scores makes the intermediate reasoning auditable and thus amenable to this kind of consistency check. Rubrics, in this view, are not just a training signal — they are a transparency mechanism that makes reward models more trustworthy. However, the paper’s core finding is sobering: transparency alone is insufficient. The rubric must be correctly applied, and verifying correct application requires additional diagnostic tools beyond simple accuracy metrics.
| Paper | Method | Rubric Source | Requires Training? | Target Domain | Key Innovation |
|---|---|---|---|---|---|
| Gunjal et al. | RaR | LLM generation (prompted) | No (reward model) | General / instruction following | First RaR formulation |
| Liu et al. | OpenRubrics + CRG | Contrastive preference data | Yes (Rubric-RM) | General / IF / biomedical | Contrastive rubric gen; consistency filtering |
| Xie et al. | Auto-Rubric | Propose-Evaluate-Revise | No (training-free) | General | Extreme data efficiency (70 pairs) |
| Shao et al. | RLER | Co-evolves with policy | Yes (policy) | Long-form / deep research | Evolving rubrics during RL |
| Shen et al. | RRD | Recursive decomposition | Optional | Open-ended / general | Principled rubric structure; WU weights |
| Wang et al. | Rationale Consistency | — (analysis framework) | Diagnostic | General (GenRM evaluation) | Identifies deceptive alignment in judges |
| Xu et al. | Rubric-ARM | Joint RL optimization | Yes (generator + judge) | Non-verifiable domains | Alternating RL; variance-justified training order |
The broader ecosystem is expanding quickly:
For all the progress, rubric-based rewards still have a number of open problems worth paying attention to:
Rubric quality is the bottleneck. Garbage rubrics produce garbage rewards. Both the coverage (are the right dimensions present?) and the specificity (are they concrete enough to score reliably?) of rubrics matter enormously, and automatic generation is still inconsistent. Shen et al.
LLM judges can be gamed. Rubrics are only as good as the judge scoring them. If the policy learns to produce responses that superficially satisfy rubric criteria without actually being better, we have just moved the reward-hacking problem one level up. The rationale consistency work
Static rubrics degrade over training. As the policy improves, criteria get saturated. Shao et al.
Aggregation is non-trivial. Rubrics produce a vector of scores; RL needs a scalar. Shen et al.
Computational cost. Scoring a response against $k$ criteria using an LLM judge is roughly $k$ times more expensive than a single scalar score. For large-scale RL training, this adds up fast. Liu et al.
Rubric-based rewards represent a principled middle ground between the rigid determinism of RLVR (which requires ground-truth verifiers) and the opacity of neural reward models (which collapse human preferences into a single number). By decomposing quality into interpretable, independently evaluable criteria — formalized as a weighted sum $f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}) = \sum_k w_k g_k(x, \hat{y})$ — they offer:
The field is moving fast. In just a few months, we have seen rubric generation go from hand-crafted expert annotations to fully automated, contrastive, even self-evolving pipelines — with formal theoretical backing emerging in parallel. The next frontiers — principled rubric update schedules during training, better correlation-aware aggregation, and joint optimization of generators and judges — are already being actively tackled.
If you are building or studying LLM alignment systems, this is a space worth watching closely.
Thanks to alphaxiv agent with Claude for polishing this post.