Hacker Newsnew | past | comments | ask | show | jobs | submit | Squarex's commentslogin

Even more so, questions and user answers from agents were not charged as separate requests.

And when you make your harness ask you for next steps in a tool call, the journey continues forever, yeehaa

I would say all benchmarks are inherently subjective. How is yours better? It seems to produce a little bit strange results. Opus 4.6 being worse than 4.5 for example. Or chinese models being rated too high. Kimi, Deepseek or GLM are all great in open source world, but I don't believe they are ahead of SOTA models from Anthropic, OpenAI or Google.

No, some benchmarks are definitely objective, but most can be easily gamed. For example, most of the benchmarks on the model cards: they have measurable answers that don't rely on a human judge (a human made the question, but the answers are measuring some uncontroversial knowledge or capability). But because there is a single, correct answer, and those answer leak (or are randomly discovered and optimized for in training), they lose value over time, and regardless, they have a ceiling on the intelligence they can measure.

Others are purely subjective, like LMArena, which really only measures the personality and style preferences of the masses at this point, because frontier LLM technical answers are too hard for the average person to judge.

Then there are some interesting one-off benchmarks, but they lack enough rigor, breadth, and samples to draw larger conclusions from.

So we designed our benchmark with 3 goals: objective measurements (individual submissions not dependent on a human or LLM judge), no known correct answer (so simulations can scale to much higher levels of intelligence), and enough variety over important aspects of intelligence. We do this by running multiple models in cooperative/competitive environments with very complex action spaces and objective scoring, where model performance is relative and affected by the actions of other participants.

And yeah, there are some interesting results when you have a more objective benchmark. It should raise eyebrows when every single sub-release of every company's model is better across the board than its predecessor -- that isn't reality.


The word "objective" just seems too authoritative to me.

I agree that benchmarks are inherently subjective.

but the fact that you cite your brief as your main argument is funny - you don't even have any inherently subjective numbers to justify what you believe, you only have "I don't believe".


Sure, I have mixed up two things together. I don't think this benchmark is bad, I just did not like it is presented as the ultimate objective truth. The other thing I have mentioned is that it delivers different results from other benchmarks, so the "believe" stems from other benchmarks.

you are arguing with your belief instead of an objective truth. benchmark is more objective, if you don't agree with it, come up with a better one. but what you believe doesn't matter.

It was not a confrontational take. But all benchmarks are designed by humans, we are not that great at measuring intelligence. So it is somewhat subjective. I was just arguing with the word "objective". Not with the results per se.

If the benchmark has a correct answer, the benchmark itself is an objective measure (but of what?). The "of what" may well be subjective

Only if the benchmark is private and done properly on relevant tasks, which is rarely the case. I can guarantee that you have a ton of blind spots if you look at it through the lens of a ranking ladder in some generic tasks.

The rumor was that the 5.5 is a brand new pretrain. But who knows, it's 2x as expensive as 5.4, so it would check out.

If so that would be big, they haven’t been able to successfully pretrain in close to two years (since 4o).

As a europe federalist, I would think it is more likely EU would implement these restrictions itself instead of step against Spain.

In theory, we should already be protected against this via the various "Net neutrality" directives, but as the US currently is showing us, laws and regulations are only worth as much as you're willing to enforce them ultimately. But things like these are supposed to be worth at least something:

> Regulation 2015/2120 also states that access providers “shall treat all traffic equally, when providing internet access services, without discrimination, restriction or interference, and irrespective of the sender and receiver, the content accessed or distributed, the applications or services used or provided, or the terminal equipment used,” although they are permitted to apply “reasonable traffic management measures.” In any case, those measures must be “transparent, non-discriminatory and proportionate, and shall not be based on commercial considerations but on objectively different technical quality of service requirements of specific categories of traffic” (Article 3.3) - https://www.cuatrecasas.com/en/global/intellectual-property/...

Remains to be seen if something/someone will put a stop to La Liga's shenanigans, judges have seem unwilling so far, and not a big enough problem for the average person to really care about it (yet?).


The regulation has an opt out for court orders though, which these are.

Codex and gemini cli are open source already. And plenty of other agents. I don't think there is any moat in claude code source.


Well, Claude does boast an absolutely cursed (and very buggy) React-based TUI renderer that I think the others lack! What if someone steals it and builds their own buggy TUI app?


Your favorite LLM is great at building a super buggy renderer, so that's no longer a moat


Gemini-cli is much worse in my experience but I agree


I think that’s a real problem now. In our parliament (Czech) almost every politician is a lawyer or a doctor. Almost no other profession is represented.


It is behind paywall, but the question itself seems like trivial.


It is clearly not. Why would you think so?


the ux feels extremely similar down to the elicitation ... but I did some more research ... they were started independently in april 2025. Therefore, one being a fork of the other is almost impossible and there is no evidence for it. Also, opencode is in go and gemini is in typescript.

Sadly my above misinformation can no longer be edited.


So you would deny children the greatest source of knowledge in the history? I have learned math and programming thanks to unlimited access to the web and would not be where I am without it.


>So you would deny children the greatest source of knowledge in the history?

Absolutely.

This is much better than destroying "the greatest source of knowledge in the history" to make it safe for kids.


This is a false dichotomy. We do not need to do neither. The parents are responsible to keep their children safe on the internet.


>I would not be where I am without it

First of all, you cannot know that, since plenty of people before you learnt that stuff from libraries.

>So you would deny children the greatest source of knowledge in the history?

Yes, because other sources of knowledge exist and are much more appropriate for children. It is also the greatest source of despicable stuff in history. When you turn 18, have fun exploring the world wide web.


I still remember gemini 1.5 ultra and gpt 4.5 as extremely strong at some areas that no benchmark capture. It was probably not economical to use them at 20 usd subscription, but they felt differently and smarter at some ways. The benchmarks seems to be missing something, because flash 3 was very close on some benchmarks to 3 pro, but much, much dumber.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: