This was true up until they started training them using Reinforcement Learning from Verifier Feedback (started with O1). By sticking a calculator in the training loop, they seem to have gotten out of the arithmetic error regime. That said, the ChatGPT default is 4o which is still susceptible to these issues.