I would think it’s due to the non determinism. Leaking context would be an unacceptable flaw since many users rely on the same instance.
A/B test is plausible but unlikely since that is typically for testing user behavior. For testing model output you can do that with offline evaluations.
A/B test is plausible but unlikely since that is typically for testing user behavior. For testing model output you can do that with offline evaluations.