- The prompts people use on have an incredible sample bias towards certain tasks and styles, and as such are unrepresentative of "overall performance" which is what people expect from a leaderboard.
- It is incredibly easy to game by a company, their employees or their fanboys if they would like to. No idea if anyone has done so, but it's trivial.
Just to give one example of the bias; advances in non-English performance don't even register on the leaderboard because almost everyone rating completions there is doing so in English. You could have a model that's a 100 in English and a 0 on every other language, and it would do better on the leaderboard than a model that's a 98 in every human language in the world.
Two more things concerning Chatbot Arena:
- The prompts people use on have an incredible sample bias towards certain tasks and styles, and as such are unrepresentative of "overall performance" which is what people expect from a leaderboard.
- It is incredibly easy to game by a company, their employees or their fanboys if they would like to. No idea if anyone has done so, but it's trivial.
Just to give one example of the bias; advances in non-English performance don't even register on the leaderboard because almost everyone rating completions there is doing so in English. You could have a model that's a 100 in English and a 0 on every other language, and it would do better on the leaderboard than a model that's a 98 in every human language in the world.