Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I suppose it could be quantization issue, but both are done by lmstudio-community. Llama3 does have a different architecture and bigger tokenizer which might explain it.


You should try ollama and see what happens. On the same hardware, with the same q8_0 quantization on both models, I'm seeing 77 tokens/s with Llama3-8B and 72 tokens/s with CodeGemma-7B, which is a very surprising result to me, but they are still very similar in performance.


You're right, ollama does perform the same on both models. Thanks.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: