Rubric-based Rewards in Reinforcement Learning

an informal review of RL with rubrics as rewards.

Created on May 14, 2026 , last updated on May 14, 2026

Introduction

If you have ever graded a student essay, you know a raw number alone — “6 out of 10” — tells you almost nothing useful about why the essay succeeded or failed. What actually helps is a rubric: a structured set of criteria like “supports claims with evidence,” “clear thesis,” or “avoids run-on sentences.” Rubrics break a holistic judgment into interpretable, actionable dimensions.

That same intuition is now reshaping how we train large language models (LLMs). A new wave of work proposes rubric-based rewards for reinforcement learning (RL): instead of feeding the model a single scalar score, we provide a set of human-readable criteria and score each one separately. The result is a richer, more interpretable, and often more reliable training signal — especially for the large swath of tasks where a binary “correct/incorrect” judgment simply does not apply.

This post walks through the motivation, the core ideas, and the recent papers pushing this direction forward.

Background: The Reward Problem in RL for LLMs

Modern LLM alignment usually involves some form of reinforcement learning from human feedback (RLHF): a reward model is trained on human preference data, and the LLM is then optimized to maximize that reward. The intuition is clean, but the practice is messy.

The Standard RLHF Objective

At its core, RLHF fine-tunes a policy $\pi_\theta$ to maximize expected reward while staying close to a reference model $\pi_\text{ref}$ via a KL penalty:

\[J(\pi_\theta) = \mathbb{E}_{q \sim p_Q}\Big[\mathbb{E}_{o \sim \pi_\theta(\cdot \mid q)}[R(q, o)] - \beta\, D_\text{KL}\!\big(\pi_\theta(\cdot \mid q) \,\|\, \pi_\text{ref}(\cdot \mid q)\big)\Big]\]

The reward $R(q, o)$ is the crux of everything. Most of the problems in RLHF ultimately trace back to how this single function is defined, estimated, and optimized.

Scalar rewards are a lossy compression

Standard reward models map a response to a single number. Human preferences, however, are multifaceted — a response can be accurate but verbose, concise but unhelpful, creative but off-topic. Collapsing all of that into one score discards structure that could guide better optimization. As Liu et al. put it: “most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences.”

Verifiable rewards are great — when they exist

The success of reinforcement learning with verifiable rewards (RLVR) for math and coding has been remarkable. When there is a ground-truth answer or a test suite, you get a clean, unambiguous, and hard-to-game signal. But most of what we care about — long-form writing, open-ended question answering, deep research, creative tasks — lives in a domain where no such verifier exists. How do you reward a well-researched 2,000-word report with a binary signal?

Neural reward models are brittle

Traditional neural reward models (trained discriminatively on preference pairs) suffer from what is sometimes called reward hacking: models learn to exploit the proxy metric in ways that don’t reflect actual quality. They are also opaque black boxes — you cannot easily tell why a response got a high score, making it hard to diagnose failures or prevent them.

Enter Rubrics: Structured Criteria as Rewards

The key idea is deceptively simple: replace the scalar score with a structured set of natural-language criteria, and score each criterion separately.

Formal Setup

Let $x$ denote an input prompt and $\hat{y}$ a model-generated response. A rubric $\mathcal{R}(x)$ is a set of $k$ criteria:

\[\mathcal{R}(x) = \{ c_i \}_{i=1}^k\]

where each $c_i$ specifies a distinct aspect of response quality (e.g., factual correctness, reasoning soundness, style). Each criterion $c_i$ can be evaluated as a binary predicate $g_i : \mathcal{P} \times \mathcal{R} \to {0, 1}$, where $g_i(x, \hat{y}) = 1$ means the response satisfies that criterion. The rubric reward is then an aggregation over all criteria:

\[f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}) := \sum_{k=1}^m w_k\, g_k(x, \hat{y})\]

with non-negative weights $\mathbf{w} = (w_1, \dots, w_m) \in \mathbb{R} _+ ^m$. For pairwise comparison, the judge prefers response $\hat{y}_1$ over $\hat{y}_2$ iff $f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}_1) > f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}_2)$. In practice each $g_k$ is evaluated by an LLM-as-a-Judge rather than a hard-coded rule.

A concrete rubric for “Write a summary of a scientific paper” might look like:

Accurately captures the main contribution
Mentions methodology at a high level
Avoids technical jargon not in the original
Stays within the requested length
Does not introduce hallucinated facts

This design has several attractive properties:

Interpretability: You can see which criteria a response failed on.
Decomposability: Errors are localized, giving richer gradient signal.
Scalability: Rubrics can be generated automatically from preference data, without expensive human annotation for every new task.
Resistance to hacking: It is harder to game five specific, independent criteria simultaneously than to game a single opaque number.

The general paradigm, coined Rubrics-as-Rewards (RaR), was articulated in Gunjal et al., and a flurry of papers in late 2025 and early 2026 have been building on it rapidly.

A Tour of Recent Work

OpenRubrics: Scaling Rubric Generation with Contrastive Supervision

“Rubrics include structured natural language criteria that decompose quality into interpretable and measurable dimensions, providing a more consistent and transparent evaluation framework than scalar judgments.” — Liu et al.

Liu et al. introduce OpenRubrics, a large-scale dataset of (prompt, rubric) pairs, and the Rubric-RM reward model trained on it. The framework has two stages: rubric generation and rubric-conditioned reward modeling.

Contrastive Rubric Generation (CRG)

Given a preference dataset $\mathcal{D} = {(x_i, {\hat{y}{i,m}}{m=1}^{M_i}, {\ell_{i,m}}{m=1}^{M_i})}{i=1}^N$ where $\ell_{i,m}$ are preference signals, they form a ranked ordering $\hat{y}{i,1} \succ \hat{y}{i,2} \succ \cdots \succ \hat{y}{i,M_i}$. Rather than prompting an LLM to generate criteria in isolation, they condition the rubric generator $h\psi$ on both preferred and rejected responses:

\[\mathcal{R}(x_i) \sim h_\psi\!\left(x_i,\; \{\hat{y}_{i,m}\}_{m=1}^{M_i},\; \{\ell_{i,m}\}_{m=1}^{M_i}\right)\]

By contrasting preferred and rejected responses, the model is forced to surface criteria that actually discriminate between good and bad responses — not just plausible-sounding boilerplate.

The rubrics themselves are split into two types:

Hard rules: Explicit, objective constraints from the prompt (e.g., “must be under 200 words”).
Principles: Implicit, generalizable qualities of strong responses (e.g., “provides concrete examples”).

Preference-Label Consistency Filtering

Not all generated rubrics faithfully capture the preference signal. For each prompt $x_i$ with candidate responses, they define the set of induced pairwise comparisons $\mathcal{P}_i = {(a,b) \mid 1 \leq a < b \leq M_i}$. A rubric is retained only if it yields predictions consistent with human labels:

\[\text{Acc}_i = \frac{1}{|\mathcal{P}_i|} \sum_{(a,b) \in \mathcal{P}_i} \mathbb{I}\!\left[\hat{\ell}_{i,(a,b)} = \ell_{i,(a,b)}\right] \geq \tau\]

with threshold $\tau = 0.5$. Formally, the retained rubric is:

\[\mathcal{R}^*_{i,(a,b)} = \begin{cases} \mathcal{R}(x_i), & \text{if } \text{Acc}_i \geq \tau \text{ and } \hat{\ell}_{i,(a,b)} = \ell_{i,(a,b)} \\ \emptyset, & \text{otherwise} \end{cases}\]

Training Objectives

The rubric generator $g_\theta$ is trained with standard cross-entropy:

\[\mathcal{L}_\text{SFT}^\text{rubric} = -\mathbb{E}_{(x, \hat{y}^+, \hat{y}^-, \mathcal{R}^*) \in \mathcal{D}_\text{rubric}} \sum_{t=1}^{|\mathcal{R}^*|} \log p_\theta\!\left(\mathcal{R}^*_t \mid x, \mathcal{R}^* _{\lt t}\right)\]

The reward model $r_\phi$ is then trained to predict preference labels conditioned on prompt, response pair, and generated rubric:

\[\mathcal{L}_\text{SFT}^\text{rm} = -\mathbb{E}_{\mathcal{D}_\text{rm}} \sum_{t=1}^{|\hat{l}|} \log p_\phi\!\left(\hat{l}_t \mid x, \hat{y}^+, \hat{y}^-, \mathcal{R}^*(x), \hat{l}_{\lt t}\right)\]

At inference, the judge selects the preferred response by:

\[\hat{l} = \arg\max_{k \in \mathcal{C}}\; p_\phi\!\left(k \mid x, y^\text{A}, y^\text{B}, \hat{\mathcal{R}}(x)\right)\]

where $\mathcal{C} = {\text{“A is better”}, \text{“B is better”}}$.

Results

The resulting Rubric-RM surpasses strong baselines by 8.4% on reward modeling benchmarks and achieves state-of-the-art scores on instruction-following benchmarks. A key case study illustrates the practical benefit: when asked to “describe a vivid character in fewer than two paragraphs,” competing reward models (JudgeLRM, RRM) miss the paragraph-count constraint and prefer a verbose, multi-paragraph response. Rubric-RM extracts “must be fewer than two paragraphs” as a hard rule, checks it first, and correctly selects the compliant response.

Auto-Rubric: Generalizable Rubric Extraction Without Training

Xie et al. take a different route. They observe that evaluation rubrics underlying human preferences tend to generalize across queries — if “factual accuracy” matters for question $Q_1$, it likely matters for $Q_2$ as well. This insight enables a training-free framework for reward modeling.

Their two-stage pipeline works as follows:

Propose-Evaluate-Revise (PER): Given a small set of preference pairs, a strong LLM proposes rubrics for each, evaluates them against the data, and iteratively refines them.
Rubric compression: The resulting fine-grained, query-specific rubrics are consolidated into a compact, non-redundant “Theme-Tips” hierarchy using an information-theoretic coding rate criterion — essentially asking: “which small set of rubrics explains the most preference variation?”

The remarkable result is data efficiency: using only 70 preference pairs (1.5% of the full dataset), Auto-Rubric enables a Qwen3-8B model to outperform specialized, fully-trained reward model counterparts. No reward model training is needed at all — the rubrics are injected directly into an LLM-as-a-Judge prompt.

DR Tulu: Rubrics That Grow With the Model

One subtle problem with static rubrics: as the policy model improves during RL training, the rubrics that were challenging at the start may become trivially satisfied. The reward signal degrades.

Shao et al. tackle this with Reinforcement Learning with Evolving Rubrics (RLER). The central idea is that rubrics should co-evolve with the policy during training: the system periodically updates its rubric set to reflect the model’s current capability level, incorporating information about what the model has already mastered versus where it still struggles. Schematically, if $\mathcal{R}^{(t)}$ denotes the rubric set at training step $t$, RLER maintains the invariant that criteria in $\mathcal{R}^{(t)}$ remain discriminative — i.e., not yet uniformly satisfied by the current policy $\pi_\theta^{(t)}$.

RLER is applied to deep research — a long-horizon, multi-step task where a model must search for information, synthesize it, and produce a well-attributed, long-form response. This is precisely the regime where RLVR fails (no binary ground truth exists) and static rubrics quickly become stale. The resulting DR Tulu-8B sits on the Pareto frontier of open models, matching more expensive proprietary systems at a fraction of the cost.

Rubric-ARM: Learning Rubrics and Judgments Jointly

A chicken-and-egg problem lurks in rubric-based reward modeling: to train a good rubric generator, you need reliable judgment of those rubrics; but to judge rubrics reliably, you need good rubrics. Xu et al. address this with Rubric-ARM, which treats rubric generation as a latent action and optimizes both the generator and the judge simultaneously via alternating reinforcement learning.

Formal Setup

The rubric generator $\pi_r$ produces a rubric $r$ from the prompt:

\[r \sim \pi_r(\cdot \mid x;\, \theta_r)\]

The judge $\pi_j$ then produces a reasoning chain $c$ and preference prediction $\hat{o}$ conditioned on the prompt, both candidate responses, and the rubric:

\[(c, \hat{o}) \sim \pi_j(\cdot \mid x, y^{(1)}, y^{(2)}, r;\, \theta_j)\]

The reward is binary correctness:

\[R(\hat{o}, o) = \mathbb{I}[\hat{o} = o]\]

where $o \in {0, 1}$ is the ground-truth preference label. The joint optimization objective is:

\[\max_{\theta_r, \theta_j}\; \mathbb{E}_{(x,y^{(1)},y^{(2)},o) \sim \mathcal{D}}\; \mathbb{E}_{r \sim \pi_r(\cdot \mid x;\theta_r)}\; \mathbb{E}_{(c,\hat{o}) \sim \pi_j(\cdot \mid x, y^{(1)}, y^{(2)}, r;\theta_j)}\!\big[R(\hat{o}, o)\big]\]

Alternating RL Updates

Since simultaneous optimization is unstable (the learning targets for $\theta_r$ shift as $\theta_j$ changes and vice versa), Rubric-ARM alternates: at iteration $t$, it runs

\[r_i^t \sim \pi_r(\cdot \mid x_i;\, \theta_r^t), \qquad \forall (x_i, y_i^{(1)}, y_i^{(2)}, o_i) \in \mathcal{D}\] \[\theta_j^{t+1} \leftarrow \mathrm{GRPO}\!\left(\theta_j^t;\; \{r_i^t\},\, \mathcal{D}\right) \qquad \text{(judge update)}\] \[\theta_r^{t+1} \leftarrow \mathrm{GRPO}\!\left(\theta_r^t;\; \theta_j^{t+1},\, \mathcal{D}\right) \qquad \text{(generator update)}\]

where GRPO (Group Relative Policy Optimization) is the underlying RL algorithm. The judge is always updated before the rubric generator in each cycle — and there is a formal reason for this ordering (discussed below).

The GRPO objective used in both updates is:

\[\mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}_{q, O \sim \pi_{\theta_\text{old}}(\cdot \mid q)}\!\left[\frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \min\!\Big(\rho_{i,t}(\theta)\,\hat{A}_{i,t},\; \mathrm{clip}\!\big(\rho_{i,t}(\theta), 1-\varepsilon_\text{low}, 1+\varepsilon_\text{high}\big)\,\hat{A}_{i,t}\Big) - \beta\, \mathbb{D}_\text{KL}[\pi_\theta \| \pi_\text{ref}]\right]\]

where $\rho_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_\text{old}}(o_{i,t} \mid q, o_{i,<t})}$ is the token-level importance ratio and $\hat{A}_{i,t}$ is the group-relative advantage.

In practice, the judge reward is augmented with a format term:

\[R_j = R_\text{acc} + R_\text{fmt}\]

where $R_\text{fmt}$ enforces that the judge addresses each rubric criterion with per-criterion explanations before issuing a final decision.

Why Update the Judge First? A Variance Analysis

Rubric-ARM updates the judge before the rubric generator in each cycle. This ordering is not arbitrary — it is justified by a gradient variance analysis.

Let $p(r) = \mathbb{P}(\hat{o} = o^* \mid c, r)$ be the judge’s correctness probability given rubric $r$, and let $u_r(r)$, $u_j(o \mid r)$ be the score functions (log-policy gradients) for the generator and judge respectively.

Proposition (Judge Variance, Strategy A). When the rubric $\bar{r}$ is fixed (reused) during judge updates, the gradient variance is:

\[\mathrm{Var}(\hat{g}_A \mid \bar{r}) = p(\bar{r})\,(1 - p(\bar{r}))\,\|u_j(o^* \mid \bar{r})\|^2\]

This is simply the Bernoulli variance of the correctness reward, scaled by the judge’s gradient magnitude.

Proposition (Generator Variance, Strategy B). When the generator is being updated against a fixed judge, the total gradient variance decomposes as:

\[\mathrm{Var}(\hat{g}_B) = \underbrace{\mathbb{E}_r\!\left[p(r)(1-p(r))\|u_r(r)\|^2\right]}_{\text{(I) Multiplicative Reward Noise}} + \underbrace{\mathrm{Var}_r\!\left(p(r)\,u_r(r)\right)}_{\text{(II) Cross-Rubric Inconsistency}}\]

Term (I) is the judge’s aleatoric uncertainty amplified by the high-dimensional generator gradient. Term (II) captures the additional instability from different rubrics yielding different expected rewards, causing the gradient direction to oscillate.

Theorem (Strict Variance Domination). Under a mild exploration-gradient sufficiency condition, the generator’s gradient variance strictly dominates the judge’s:

\[\mathrm{Var}(\hat{g}_B) > \mathbb{E}_{\bar{r}}[\mathrm{Var}(\hat{g}_A \mid \bar{r})]\]

This proves that updating the generator before the judge amplifies training noise — the unstable rubric exploration dominates. By stabilizing the judge first (Strategy A), Rubric-ARM effectively reduces variance and provides a clearer learning signal before optimizing the more noisy generator.

Connection to the EM Algorithm

The alternating RL procedure has a clean theoretical interpretation as a generalized EM algorithm, with rubrics $r$ as latent variables. With $\pi_j$ fixed, updating $\pi_r$ resembles an amortized E-step (finding rubrics that make the correct outcome more likely). With $\pi_r$ fixed, updating $\pi_j$ resembles an M-step (maximizing expected correctness under the current rubric distribution).

Results

Rubric-ARM achieves a +4.7% average gain over strong reasoning-based judges on reward modeling benchmarks, and +6.5% downstream policy improvement across 6 policy benchmarks. Notably, it runs in 33.5 seconds on 100 samples — faster than Rubric-RM (105s) and reasoning-heavy baselines like RM-R1 (170–382s) — because short rubric generation replaces long chain-of-thought decoding.

Rethinking Rubric Generation: Recursive Decomposition

Shen et al. step back and ask a more structural question: How should rubrics themselves be generated? The problem they identify is that naive LLM-generated rubrics often lack coverage (miss important quality dimensions) and are hard to control in terms of granularity or scope. In fact, they show that unconditioned LLM rubrics can actually degrade GPT-4o’s judgment accuracy from 55.6% to 42.9% on JudgeBench — 13 points below using no rubrics at all.

Theoretical Grounding

Starting from the weighted rubric reward $f_{\mathbf{w}, \mathcal{G}}(P, R) = \sum_{k=1}^m w_k g_k(P, R)$, the judge misclassification probability admits an exponential upper bound under mild assumptions (positive edge and bounded correlation):

\[\mathbb{P}(\hat{Y} \neq Y) \leq \exp\!\left(-\frac{1}{2}\min\left\{\frac{\Delta_m^2}{V_m(+1)},\; \frac{\Delta_m^2}{V_m(-1)}\right\}\right)\]

where $\Delta_m = \mathbf{w}^\top \boldsymbol{\mu}$ is the aggregate signal strength (each $\mu_k$ measures how much rubric $k$ discriminates preferred from non-preferred responses) and $V_m = \mathbf{w}^\top \mathbf{\Sigma}\, \mathbf{w}$ is the variance proxy of the weighted residuals (capturing rubric noise and inter-rubric correlation). Minimizing this bound amounts to maximizing the signal-to-noise ratio:

\[\Xi = \frac{(\mathbf{w}^\top \boldsymbol{\mu})^2}{\mathbf{w}^\top \mathbf{\Sigma}\, \mathbf{w}}\]

This inequality has a direct operational prescription: (1) decompose broad rubrics to increase aggregate edge $\Delta_m$; (2) filter misaligned rubrics to remove negative edge; (3) remove redundancy to reduce $V_m$; and (4) weight optimally to account for correlation structure.

Whitened Uniform (WU) Weights

In open-ended domains, ground-truth preference labels needed to estimate $\boldsymbol{\mu}$ are unavailable. Shen et al. show that the optimal weights (maximizing the worst-case $\Xi$ over uncertain $\boldsymbol{\mu}$) are:

\[w_\text{WU} \propto \Sigma^{-1/2}\, \mathbf{1}\]

That is, apply the inverse square-root of the rubric covariance matrix $\Sigma$ — a whitening transformation — and then weight uniformly in the whitened space. Crucially, $\Sigma$ can be estimated from unlabeled response data, making this entirely label-free.

Recursive Rubric Decomposition (RRD)

Their proposed solution is Recursive Rubric Decomposition (RRD): a top-down, hierarchical approach. Starting from an initial set of LLM-generated rubrics, any rubric $g_m$ satisfied by more than $n = 2$ candidate responses (i.e., insufficiently discriminative) is recursively decomposed into finer sub-criteria by prompting the LLM again. A misalignment filter discards rubrics that prefer outputs from weaker models over stronger ones. A redundancy filter removes rubrics with $\geq 70\%$ semantic overlap with existing ones. The loop terminates via early stopping when too many proposals are rejected. The termination threshold is a tunable hyperparameter (set to 15 in all experiments).

The key empirical payoff: RRD with WU weights improves GPT-4o’s judgment accuracy from 55.6% to 73.3% on JudgeBench (+17.7 points). When used as the reward signal for reinforcement fine-tuning on WildChat, it boosts policy reward by up to 160% for Qwen3-4B and 60% for Llama3.1-8B, versus only ~10–20% for prior rubric baselines.

Outcome Accuracy Is Not Enough: Aligning the Reasoning Behind the Reward

Perhaps the most conceptually provocative paper in this space is Wang et al.. The paper makes a striking observation: Generative Reward Models (GenRMs) and LLM-as-a-Judge can exhibit a form of deceptive alignment — they produce correct final preference judgments but for wrong reasons. The model picks the right winner but its stated rationale (including any rubric-level evaluation it performs) is incoherent or inconsistent with how a human would reason.

This matters because during RLHF, the policy model is being shaped by those rewards. If the reward signal is correct on average during training but wrong in specific edge cases — and those edge cases involve the exact behaviors we most want to generalize — the model will exploit the gaps.

The paper introduces Rationale Consistency as a complementary metric to outcome accuracy: rather than just asking “did the judge pick the right response?”, we ask “is the judge’s reasoning consistent with human reasoning?” Crucially, this dovetails with rubric-based approaches: an explicit rubric with per-criterion scores makes the intermediate reasoning auditable and thus amenable to this kind of consistency check. Rubrics, in this view, are not just a training signal — they are a transparency mechanism that makes reward models more trustworthy.

Comparison at a Glance

Paper	Method	Rubric Source	Requires Training?	Target Domain	Key Innovation
Gunjal et al.	RaR	LLM generation (prompted)	No (reward model)	General / instruction following	First RaR formulation
Liu et al.	OpenRubrics + CRG	Contrastive preference data	Yes (Rubric-RM)	General / IF / biomedical	Contrastive rubric gen; consistency filtering
Xie et al.	Auto-Rubric	Propose-Evaluate-Revise	No (training-free)	General	Extreme data efficiency (70 pairs)
Shao et al.	RLER	Co-evolves with policy	Yes (policy)	Long-form / deep research	Evolving rubrics during RL
Shen et al.	RRD	Recursive decomposition	Optional	Open-ended / general	Principled rubric structure; WU weights
Wang et al.	Rationale Consistency	— (analysis framework)	Diagnostic	General (GenRM evaluation)	Identifies deceptive alignment in judges
Xu et al.	Rubric-ARM	Joint RL optimization	Yes (generator + judge)	Non-verifiable domains	Alternating RL; variance-justified training order

The broader ecosystem is expanding quickly:

When Rubrics Fail applies rubric-based RL to visual tasks (virtual try-on) and finds that in some perceptual domains, enumerating errors directly is more reliable than constructing abstract criteria. A useful reminder that rubrics are not a universal solution.
Rubric-Grounded RL decomposes reward into weighted, verifiable criteria scored by an LLM judge, then optimizes the policy with GRPO, demonstrating that rubric-based partial-credit signals improve generalization in reasoning tasks.
Open Rubric System introduces pairwise adaptive rubrics that dynamically compare two responses side by side rather than scoring each independently, providing a contrastive signal closer to how humans actually evaluate.

Open Challenges

For all the progress, rubric-based rewards still have a number of open problems worth paying attention to:

Rubric quality is the bottleneck. Garbage rubrics produce garbage rewards. Both the coverage (are the right dimensions present?) and the specificity (are they concrete enough to score reliably?) of rubrics matter enormously, and automatic generation is still inconsistent. Shen et al. and Xie et al. directly address this, but there is no settled best practice yet.

LLM judges can be gamed. Rubrics are only as good as the judge scoring them. If the policy learns to produce responses that superficially satisfy rubric criteria without actually being better, we have just moved the reward-hacking problem one level up. The rationale consistency work is a step toward detecting this, but it is not yet a solved problem.

Static rubrics degrade over training. As the policy improves, criteria get saturated. Shao et al. address this with evolving rubrics, but determining when and how to update rubrics during training remains an empirical art rather than a principled science.

Aggregation is non-trivial. Rubrics produce a vector of scores; RL needs a scalar. Shen et al. show that whitened uniform weights $w_\text{WU} \propto \Sigma^{-1/2}\mathbf{1}$ are theoretically optimal, but estimating $\Sigma$ reliably in practice — especially with small response sets — remains a challenge. Weighted sums with hand-tuned weights are still the common default.

Computational cost. Scoring a response against $k$ criteria using an LLM judge is roughly $k$ times more expensive than a single scalar score. For large-scale RL training, this adds up fast. Liu et al. note that rubrics can be cached and reused across examples, substantially amortizing the cost of generation, but this only helps when prompts are repeated or closely related.

Summary

Rubric-based rewards represent a principled middle ground between the rigid determinism of RLVR (which requires ground-truth verifiers) and the opacity of neural reward models (which collapse human preferences into a single number). By decomposing quality into interpretable, independently evaluable criteria — formalized as a weighted sum $f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}) = \sum_k w_k g_k(x, \hat{y})$ — they offer:

More informative training signals for RL with a richer error signal per response.
Natural interpretability and debuggability at the per-criterion level.
A path to alignment in subjective, open-ended domains where RLVR cannot reach.

The field is moving fast. In just a few months, we have seen rubric generation go from hand-crafted expert annotations to fully automated, contrastive, even self-evolving pipelines — with formal theoretical backing emerging in parallel. The next frontiers — principled rubric update schedules during training, better correlation-aware aggregation, and joint optimization of generators and judges — are already being actively tackled.

If you are building or studying LLM alignment systems, this is a space worth watching closely.

Acknowledgements

Thanks to alphaxiv agent with Claude for polishing this post.

Rubric-based Rewards in Reinforcement Learning

Introduction

Background: The Reward Problem in RL for LLMs

The Standard RLHF Objective

Scalar rewards are a lossy compression

Verifiable rewards are great — when they exist

Neural reward models are brittle

Enter Rubrics: Structured Criteria as Rewards

Formal Setup

A Tour of Recent Work

OpenRubrics: Scaling Rubric Generation with Contrastive Supervision

Contrastive Rubric Generation (CRG)

Preference-Label Consistency Filtering

Training Objectives

Results

Auto-Rubric: Generalizable Rubric Extraction Without Training

DR Tulu: Rubrics That Grow With the Model

Rubric-ARM: Learning Rubrics and Judgments Jointly

Formal Setup

Alternating RL Updates

Why Update the Judge First? A Variance Analysis

Connection to the EM Algorithm

Results

Rethinking Rubric Generation: Recursive Decomposition

Theoretical Grounding

Whitened Uniform (WU) Weights

Recursive Rubric Decomposition (RRD)

Outcome Accuracy Is Not Enough: Aligning the Reasoning Behind the Reward

Comparison at a Glance

Beyond the Six: Related Directions

Open Challenges

Summary

Acknowledgements