RL is proving to be a weird science lately : >Spurious Rewards: Rethinking Train...

spmurrayzzz · 2025-06-02T14:14:54 1748873694

I think the fact that spurious rewards were predominantly only effective for Qwen may suggest that it was triggering some shift in its language distribution. If you use those models long enough you'll see a ton of mandarin that makes its way into your outputs, and their logits tend to look more "confident" than the ones for english tokens.

So the reward value shifting may act as a sort of unintentional regularization technique (similar to adding noise to the discriminator input in GAN archs).

t55 · 2025-06-02T14:07:36 1748873256

yeah, RLVR is still nascent and hence there's lots of noise.

> How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

it's because in those cases, RLVR merely elicits the reasoning strategies already contained in the model through pre-training

this paper, which uses Reasoning gym, shows that you need to train for way longer than those papers you mentioned to actually uncover novel reasoning strategies: https://arxiv.org/abs/2505.24864