an informal review of RL with rubrics as rewards.
If you have ever graded a student essay, you know a raw number alone — “6 out of 10” — tells you almost nothing useful about why the essay succeeded or failed. What actually helps is a rubric: a structured set of criteria like “supports claims with evidence,” “clear thesis,” or “avoids run-on sentences.” Rubrics break a holistic judgment into interpretable, actionable dimensions.
That same intuition is now reshaping how we train large language models (LLMs). A new wave of work proposes rubric-based rewards for reinforcement learning (RL): instead of feeding the model a single scalar score, we provide a set of human-readable criteria and score each one separately. The result is a richer, more interpretable, and often more reliable training signal — especially for the large swath of tasks where a binary “correct/incorrect” judgment simply does not apply.
This post walks through the motivation, the core ideas, and the recent papers pushing this direction forward.
Modern LLM alignment usually involves some form of reinforcement learning from human feedback (RLHF): a reward model is trained on human preference data, and the LLM is then optimized to maximize that reward. The intuition is clean, but the practice is messy.
At its core, RLHF fine-tunes a policy $\pi_\theta$ to maximize expected reward while staying close to a reference model $\pi_\text{ref}$ via a KL penalty:
\[J(\pi_\theta) = \mathbb{E}_{q \sim p_Q}\Big[\mathbb{E}_{o \sim \pi_\theta(\cdot \mid q)}[R(q, o)] - \beta\, D_\text{KL}\!\big(\pi_\theta(\cdot \mid q) \,\|\, \pi_\text{ref}(\cdot \mid q)\big)\Big]\]The reward $R(q, o)$ is the crux of everything. Most of the problems in RLHF ultimately trace back to how this single function is defined, estimated, and optimized.
Standard reward models map a response to a single number. Human preferences, however, are multifaceted — a response can be accurate but verbose, concise but unhelpful, creative but off-topic. Collapsing all of that into one score discards structure that could guide better optimization. As Liu et al.
The success of reinforcement learning with verifiable rewards (RLVR) for math and coding has been remarkable. When there is a ground-truth answer or a test suite, you get a clean, unambiguous, and hard-to-game signal. But most of what we care about — long-form writing, open-ended question answering, deep research, creative tasks — lives in a domain where no such verifier exists. How do you reward a well-researched 2,000-word report with a binary signal?
Traditional neural reward models (trained discriminatively on preference pairs) suffer from what is sometimes called reward hacking: models learn to exploit the proxy metric in ways that don’t reflect actual quality. They are also opaque black boxes — you cannot easily tell why a response got a high score, making it hard to diagnose failures or prevent them.
The key idea is deceptively simple: replace the scalar score with a structured set of natural-language criteria, and score each criterion separately.
Let $x$ denote an input prompt and $\hat{y}$ a model-generated response. A rubric $\mathcal{R}(x)$ is a set of $k$ criteria:
\[\mathcal{R}(x) = \{ c_i \}_{i=1}^k\]where each $c_i$ specifies a distinct aspect of response quality (e.g., factual correctness, reasoning soundness, style). Each criterion $c_i$ can be evaluated as a binary predicate $g_i : \mathcal{P} \times \mathcal{R} \to {0, 1}$, where $g_i(x, \hat{y}) = 1$ means the response satisfies that criterion. The rubric reward is then an aggregation over all criteria:
\[f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}) := \sum_{k=1}^m w_k\, g_k(x, \hat{y})\]with non-negative weights $\mathbf{w} = (w_1, \dots, w_m) \in \mathbb{R} _+ ^m$. For pairwise comparison, the judge prefers response $\hat{y}_1$ over $\hat{y}_2$ iff $f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}_1) > f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}_2)$. In practice each $g_k$ is evaluated by an LLM-as-a-Judge rather than a hard-coded rule.
A concrete rubric for “Write a summary of a scientific paper” might look like:
This design has several attractive properties:
The general paradigm, coined Rubrics-as-Rewards (RaR), was articulated in Gunjal et al.
“Rubrics include structured natural language criteria that decompose quality into interpretable and measurable dimensions, providing a more consistent and transparent evaluation framework than scalar judgments.” — Liu et al.
Liu et al.(prompt, rubric) pairs, and the Rubric-RM reward model trained on it. The framework has two stages: rubric generation and rubric-conditioned reward modeling.
Given a preference dataset $\mathcal{D} = {(x_i, {\hat{y}{i,m}}{m=1}^{M_i}, {\ell_{i,m}}{m=1}^{M_i})}{i=1}^N$ where $\ell_{i,m}$ are preference signals, they form a ranked ordering $\hat{y}{i,1} \succ \hat{y}{i,2} \succ \cdots \succ \hat{y}{i,M_i}$. Rather than prompting an LLM to generate criteria in isolation, they condition the rubric generator $h\psi$ on both preferred and rejected responses:
\[\mathcal{R}(x_i) \sim h_\psi\!\left(x_i,\; \{\hat{y}_{i,m}\}_{m=1}^{M_i},\; \{\ell_{i,m}\}_{m=1}^{M_i}\right)\]By contrasting preferred and rejected responses, the model is forced to surface criteria that actually discriminate between good and bad responses — not just plausible-sounding boilerplate.
The rubrics themselves are split into two types:
Not all generated rubrics faithfully capture the preference signal. For each prompt $x_i$ with candidate responses, they define the set of induced pairwise comparisons $\mathcal{P}_i = {(a,b) \mid 1 \leq a < b \leq M_i}$. A rubric is retained only if it yields predictions consistent with human labels:
\[\text{Acc}_i = \frac{1}{|\mathcal{P}_i|} \sum_{(a,b) \in \mathcal{P}_i} \mathbb{I}\!\left[\hat{\ell}_{i,(a,b)} = \ell_{i,(a,b)}\right] \geq \tau\]with threshold $\tau = 0.5$. Formally, the retained rubric is:
\[\mathcal{R}^*_{i,(a,b)} = \begin{cases} \mathcal{R}(x_i), & \text{if } \text{Acc}_i \geq \tau \text{ and } \hat{\ell}_{i,(a,b)} = \ell_{i,(a,b)} \\ \emptyset, & \text{otherwise} \end{cases}\]The rubric generator $g_\theta$ is trained with standard cross-entropy:
\[\mathcal{L}_\text{SFT}^\text{rubric} = -\mathbb{E}_{(x, \hat{y}^+, \hat{y}^-, \mathcal{R}^*) \in \mathcal{D}_\text{rubric}} \sum_{t=1}^{|\mathcal{R}^*|} \log p_\theta\!\left(\mathcal{R}^*_t \mid x, \mathcal{R}^* _{\lt t}\right)\]The reward model $r_\phi$ is then trained to predict preference labels conditioned on prompt, response pair, and generated rubric:
\[\mathcal{L}_\text{SFT}^\text{rm} = -\mathbb{E}_{\mathcal{D}_\text{rm}} \sum_{t=1}^{|\hat{l}|} \log p_\phi\!\left(\hat{l}_t \mid x, \hat{y}^+, \hat{y}^-, \mathcal{R}^*(x), \hat{l}_{\lt t}\right)\]At inference, the judge selects the preferred response by:
\[\hat{l} = \arg\max_{k \in \mathcal{C}}\; p_\phi\!\left(k \mid x, y^\text{A}, y^\text{B}, \hat{\mathcal{R}}(x)\right)\]where $\mathcal{C} = {\text{“A is better”}, \text{“B is better”}}$.
The resulting Rubric-RM surpasses strong baselines by 8.4% on reward modeling benchmarks and achieves state-of-the-art scores on instruction-following benchmarks. A key case study illustrates the practical benefit: when asked to “describe a vivid character in fewer than two paragraphs,” competing reward models (JudgeLRM, RRM) miss the paragraph-count constraint and prefer a verbose, multi-paragraph response. Rubric-RM extracts “must be fewer than two paragraphs” as a hard rule, checks it first, and correctly selects the compliant response.
Xie et al.
Their two-stage pipeline works as follows:
The remarkable result is data efficiency: using only 70 preference pairs (1.5% of the full dataset), Auto-Rubric enables a Qwen3-8B model to outperform specialized, fully-trained reward model counterparts. No reward model training is needed at all — the rubrics are injected directly into an LLM-as-a-Judge prompt.
One subtle problem with static rubrics: as the policy model improves during RL training, the rubrics that were challenging at the start may become trivially satisfied. The reward signal degrades.
Shao et al.
RLER is applied to deep research — a long-horizon, multi-step task where a model must search for information, synthesize it, and produce a well-attributed, long-form response. This is precisely the regime where RLVR fails (no binary ground truth exists) and static rubrics quickly become stale. The resulting DR Tulu-8B sits on the Pareto frontier of open models, matching more expensive proprietary systems at a fraction of the cost.
A chicken-and-egg problem lurks in rubric-based reward modeling: to train a good rubric generator, you need reliable judgment of those rubrics; but to judge rubrics reliably, you need good rubrics. Xu et al.
The rubric generator $\pi_r$ produces a rubric $r$ from the prompt:
\[r \sim \pi_r(\cdot \mid x;\, \theta_r)\]The judge $\pi_j$ then produces a reasoning chain $c$ and preference prediction $\hat{o}$ conditioned on the prompt, both candidate responses, and the rubric:
\[(c, \hat{o}) \sim \pi_j(\cdot \mid x, y^{(1)}, y^{(2)}, r;\, \theta_j)\]The reward is binary correctness:
\[R(\hat{o}, o) = \mathbb{I}[\hat{o} = o]\]where $o \in {0, 1}$ is the ground-truth preference label. The joint optimization objective is:
\[\max_{\theta_r, \theta_j}\; \mathbb{E}_{(x,y^{(1)},y^{(2)},o) \sim \mathcal{D}}\; \mathbb{E}_{r \sim \pi_r(\cdot \mid x;\theta_r)}\; \mathbb{E}_{(c,\hat{o}) \sim \pi_j(\cdot \mid x, y^{(1)}, y^{(2)}, r;\theta_j)}\!\big[R(\hat{o}, o)\big]\]Since simultaneous optimization is unstable (the learning targets for $\theta_r$ shift as $\theta_j$ changes and vice versa), Rubric-ARM alternates: at iteration $t$, it runs
\[r_i^t \sim \pi_r(\cdot \mid x_i;\, \theta_r^t), \qquad \forall (x_i, y_i^{(1)}, y_i^{(2)}, o_i) \in \mathcal{D}\] \[\theta_j^{t+1} \leftarrow \mathrm{GRPO}\!\left(\theta_j^t;\; \{r_i^t\},\, \mathcal{D}\right) \qquad \text{(judge update)}\] \[\theta_r^{t+1} \leftarrow \mathrm{GRPO}\!\left(\theta_r^t;\; \theta_j^{t+1},\, \mathcal{D}\right) \qquad \text{(generator update)}\]where GRPO (Group Relative Policy Optimization) is the underlying RL algorithm. The judge is always updated before the rubric generator in each cycle — and there is a formal reason for this ordering (discussed below).
The GRPO objective used in both updates is:
\[\mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}_{q, O \sim \pi_{\theta_\text{old}}(\cdot \mid q)}\!\left[\frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \min\!\Big(\rho_{i,t}(\theta)\,\hat{A}_{i,t},\; \mathrm{clip}\!\big(\rho_{i,t}(\theta), 1-\varepsilon_\text{low}, 1+\varepsilon_\text{high}\big)\,\hat{A}_{i,t}\Big) - \beta\, \mathbb{D}_\text{KL}[\pi_\theta \| \pi_\text{ref}]\right]\]where $\rho_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_\text{old}}(o_{i,t} \mid q, o_{i,<t})}$ is the token-level importance ratio and $\hat{A}_{i,t}$ is the group-relative advantage.
In practice, the judge reward is augmented with a format term:
\[R_j = R_\text{acc} + R_\text{fmt}\]where $R_\text{fmt}$ enforces that the judge addresses each rubric criterion with per-criterion explanations before issuing a final decision.
Rubric-ARM updates the judge before the rubric generator in each cycle. This ordering is not arbitrary — it is justified by a gradient variance analysis.
Let $p(r) = \mathbb{P}(\hat{o} = o^* \mid c, r)$ be the judge’s correctness probability given rubric $r$, and let $u_r(r)$, $u_j(o \mid r)$ be the score functions (log-policy gradients) for the generator and judge respectively.
Proposition (Judge Variance, Strategy A). When the rubric $\bar{r}$ is fixed (reused) during judge updates, the gradient variance is:
\[\mathrm{Var}(\hat{g}_A \mid \bar{r}) = p(\bar{r})\,(1 - p(\bar{r}))\,\|u_j(o^* \mid \bar{r})\|^2\]This is simply the Bernoulli variance of the correctness reward, scaled by the judge’s gradient magnitude.
Proposition (Generator Variance, Strategy B). When the generator is being updated against a fixed judge, the total gradient variance decomposes as:
\[\mathrm{Var}(\hat{g}_B) = \underbrace{\mathbb{E}_r\!\left[p(r)(1-p(r))\|u_r(r)\|^2\right]}_{\text{(I) Multiplicative Reward Noise}} + \underbrace{\mathrm{Var}_r\!\left(p(r)\,u_r(r)\right)}_{\text{(II) Cross-Rubric Inconsistency}}\]Term (I) is the judge’s aleatoric uncertainty amplified by the high-dimensional generator gradient. Term (II) captures the additional instability from different rubrics yielding different expected rewards, causing the gradient direction to oscillate.
Theorem (Strict Variance Domination). Under a mild exploration-gradient sufficiency condition, the generator’s gradient variance strictly dominates the judge’s:
\[\mathrm{Var}(\hat{g}_B) > \mathbb{E}_{\bar{r}}[\mathrm{Var}(\hat{g}_A \mid \bar{r})]\]This proves that updating the generator before the judge amplifies training noise — the unstable rubric exploration dominates. By stabilizing the judge first (Strategy A), Rubric-ARM effectively reduces variance and provides a clearer learning signal before optimizing the more noisy generator.
The alternating RL procedure has a clean theoretical interpretation as a generalized EM algorithm, with rubrics $r$ as latent variables. With $\pi_j$ fixed, updating $\pi_r$ resembles an amortized E-step (finding rubrics that make the correct outcome more likely). With $\pi_r$ fixed, updating $\pi_j$ resembles an M-step (maximizing expected correctness under the current rubric distribution).
Rubric-ARM achieves a +4.7% average gain over strong reasoning-based judges on reward modeling benchmarks, and +6.5% downstream policy improvement across 6 policy benchmarks. Notably, it runs in 33.5 seconds on 100 samples — faster than Rubric-RM (105s) and reasoning-heavy baselines like RM-R1 (170–382s) — because short rubric generation replaces long chain-of-thought decoding.
Shen et al.
Starting from the weighted rubric reward $f_{\mathbf{w}, \mathcal{G}}(P, R) = \sum_{k=1}^m w_k g_k(P, R)$, the judge misclassification probability admits an exponential upper bound under mild assumptions (positive edge and bounded correlation):
\[\mathbb{P}(\hat{Y} \neq Y) \leq \exp\!\left(-\frac{1}{2}\min\left\{\frac{\Delta_m^2}{V_m(+1)},\; \frac{\Delta_m^2}{V_m(-1)}\right\}\right)\]where $\Delta_m = \mathbf{w}^\top \boldsymbol{\mu}$ is the aggregate signal strength (each $\mu_k$ measures how much rubric $k$ discriminates preferred from non-preferred responses) and $V_m = \mathbf{w}^\top \mathbf{\Sigma}\, \mathbf{w}$ is the variance proxy of the weighted residuals (capturing rubric noise and inter-rubric correlation). Minimizing this bound amounts to maximizing the signal-to-noise ratio:
\[\Xi = \frac{(\mathbf{w}^\top \boldsymbol{\mu})^2}{\mathbf{w}^\top \mathbf{\Sigma}\, \mathbf{w}}\]This inequality has a direct operational prescription: (1) decompose broad rubrics to increase aggregate edge $\Delta_m$; (2) filter misaligned rubrics to remove negative edge; (3) remove redundancy to reduce $V_m$; and (4) weight optimally to account for correlation structure.
In open-ended domains, ground-truth preference labels needed to estimate $\boldsymbol{\mu}$ are unavailable. Shen et al. show that the optimal weights (maximizing the worst-case $\Xi$ over uncertain $\boldsymbol{\mu}$) are:
\[w_\text{WU} \propto \Sigma^{-1/2}\, \mathbf{1}\]That is, apply the inverse square-root of the rubric covariance matrix $\Sigma$ — a whitening transformation — and then weight uniformly in the whitened space. Crucially, $\Sigma$ can be estimated from unlabeled response data, making this entirely label-free.
Their proposed solution is Recursive Rubric Decomposition (RRD): a top-down, hierarchical approach. Starting from an initial set of LLM-generated rubrics, any rubric $g_m$ satisfied by more than $n = 2$ candidate responses (i.e., insufficiently discriminative) is recursively decomposed into finer sub-criteria by prompting the LLM again. A misalignment filter discards rubrics that prefer outputs from weaker models over stronger ones. A redundancy filter removes rubrics with $\geq 70\%$ semantic overlap with existing ones. The loop terminates via early stopping when too many proposals are rejected. The termination threshold is a tunable hyperparameter (set to 15 in all experiments).
The key empirical payoff: RRD with WU weights improves GPT-4o’s judgment accuracy from 55.6% to 73.3% on JudgeBench (+17.7 points). When used as the reward signal for reinforcement fine-tuning on WildChat, it boosts policy reward by up to 160% for Qwen3-4B and 60% for Llama3.1-8B, versus only ~10–20% for prior rubric baselines.
Perhaps the most conceptually provocative paper in this space is Wang et al.
This matters because during RLHF, the policy model is being shaped by those rewards. If the reward signal is correct on average during training but wrong in specific edge cases — and those edge cases involve the exact behaviors we most want to generalize — the model will exploit the gaps.
The paper introduces Rationale Consistency as a complementary metric to outcome accuracy: rather than just asking “did the judge pick the right response?”, we ask “is the judge’s reasoning consistent with human reasoning?” Crucially, this dovetails with rubric-based approaches: an explicit rubric with per-criterion scores makes the intermediate reasoning auditable and thus amenable to this kind of consistency check. Rubrics, in this view, are not just a training signal — they are a transparency mechanism that makes reward models more trustworthy.
| Paper | Method | Rubric Source | Requires Training? | Target Domain | Key Innovation |
|---|---|---|---|---|---|
| Gunjal et al. | RaR | LLM generation (prompted) | No (reward model) | General / instruction following | First RaR formulation |
| Liu et al. | OpenRubrics + CRG | Contrastive preference data | Yes (Rubric-RM) | General / IF / biomedical | Contrastive rubric gen; consistency filtering |
| Xie et al. | Auto-Rubric | Propose-Evaluate-Revise | No (training-free) | General | Extreme data efficiency (70 pairs) |
| Shao et al. | RLER | Co-evolves with policy | Yes (policy) | Long-form / deep research | Evolving rubrics during RL |
| Shen et al. | RRD | Recursive decomposition | Optional | Open-ended / general | Principled rubric structure; WU weights |
| Wang et al. | Rationale Consistency | — (analysis framework) | Diagnostic | General (GenRM evaluation) | Identifies deceptive alignment in judges |
| Xu et al. | Rubric-ARM | Joint RL optimization | Yes (generator + judge) | Non-verifiable domains | Alternating RL; variance-justified training order |
The broader ecosystem is expanding quickly:
For all the progress, rubric-based rewards still have a number of open problems worth paying attention to:
Rubric quality is the bottleneck. Garbage rubrics produce garbage rewards. Both the coverage (are the right dimensions present?) and the specificity (are they concrete enough to score reliably?) of rubrics matter enormously, and automatic generation is still inconsistent. Shen et al.
LLM judges can be gamed. Rubrics are only as good as the judge scoring them. If the policy learns to produce responses that superficially satisfy rubric criteria without actually being better, we have just moved the reward-hacking problem one level up. The rationale consistency work
Static rubrics degrade over training. As the policy improves, criteria get saturated. Shao et al.
Aggregation is non-trivial. Rubrics produce a vector of scores; RL needs a scalar. Shen et al.
Computational cost. Scoring a response against $k$ criteria using an LLM judge is roughly $k$ times more expensive than a single scalar score. For large-scale RL training, this adds up fast. Liu et al.
Rubric-based rewards represent a principled middle ground between the rigid determinism of RLVR (which requires ground-truth verifiers) and the opacity of neural reward models (which collapse human preferences into a single number). By decomposing quality into interpretable, independently evaluable criteria — formalized as a weighted sum $f_{\mathbf{w}, \mathcal{G}}(x, \hat{y}) = \sum_k w_k g_k(x, \hat{y})$ — they offer:
The field is moving fast. In just a few months, we have seen rubric generation go from hand-crafted expert annotations to fully automated, contrastive, even self-evolving pipelines — with formal theoretical backing emerging in parallel. The next frontiers — principled rubric update schedules during training, better correlation-aware aggregation, and joint optimization of generators and judges — are already being actively tackled.
If you are building or studying LLM alignment systems, this is a space worth watching closely.
Thanks to alphaxiv agent with Claude for polishing this post.