Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Any external verification ofthe benchmark results?


I'm skeptical. Partially because if you go to https://www.swebench.com/, you can see this company underreported results from their competitors like Amazon Q Developer. I've also seen plenty of other projects claim they've reached 30%+ on SWE-bench without verifying or posting their results on this site.


I skimmed the technical report: https://cosine.sh/blog/genie-technical-report

At the bottom, they noted the following:

> SWE-Bench has recently modified their submission requirements, now asking for the full working process of our AI model in addition to the final results -their condition to have us appear on the offical leaderboard. This change poses a significant challenge for us, as our proprietary methodology is evident in these internal processes. Publicly sharing this information would essentially open-source our approach, undermining the competitive advantage we’ve worked hard to develop. For now, we’ve decided to keep our model’s internal workings confidential. However we’ve made the model’s final outputs publicly available on GitHub for independent verification. These outputs clearly demonstrate our model’s 30% success rate on the SWE-Bench tasks.

Their model outputs are here: https://github.com/CosineAI/experiments/tree/cos/swe-bench-s...


> However we’ve made the model’s final outputs publicly available on GitHub for independent verification.

Sounds legit




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: