I think there was some quote from Karpathy who said that RLHF isn't actually "tr...

I think there was some quote from Karpathy who said that RLHF isn't actually "true" RL. As an armchair person, even after trying to understand it RLHF always seemed so roundabout. You don't have some open ended environment, you already have a fixed set of preferences. Instead of directly optimizing the model against that like DPO does, RLHF goes out of its way to train value/reward networks encoding these preferences then optimizing against that. I assumed that maybe it was just done this way for performance or stability or some other math -heavy reason, it was good to see that my suspicion was not off-base.