Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

RL is proving to be a weird science lately :

>Spurious Rewards: Rethinking Training Signals in RLVR ### *TL;DR* We show that you can do RLVR on Qwen2.5-Math models with *completely random or incorrect rewards*, and still get massive math benchmark gains.

All of the following spurious rewards give 15-20+ points on MATH-500 when RLVR training Qwen2.5-Math-7B:

- RLVR + format reward (reward responses with `\boxed{}`): *+16.4%* - RLVR + incorrect reward (only incorrect answers rewarded): *+24.6%* - RLVR + random reward: *+21.4%* - (as a reference) RLVR + ground-truth reward: + 28.8%

How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

>Learning to Reason without External Rewards Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. [2]

[1] https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking... [2] https://arxiv.org/abs/2505.19590



I think the fact that spurious rewards were predominantly only effective for Qwen may suggest that it was triggering some shift in its language distribution. If you use those models long enough you'll see a ton of mandarin that makes its way into your outputs, and their logits tend to look more "confident" than the ones for english tokens.

So the reward value shifting may act as a sort of unintentional regularization technique (similar to adding noise to the discriminator input in GAN archs).


yeah, RLVR is still nascent and hence there's lots of noise.

> How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

it's because in those cases, RLVR merely elicits the reasoning strategies already contained in the model through pre-training

this paper, which uses Reasoning gym, shows that you need to train for way longer than those papers you mentioned to actually uncover novel reasoning strategies: https://arxiv.org/abs/2505.24864




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: