More

dannyw · 2026-04-25T16:37:37 1777135057

This seems designed to provide an easy way for national state actors to get access to anything, since they can just tap and hide a SMS?

jesseendahl · 2026-04-25T19:46:26 1777146386

What do you mean “just tap and hide a SMS”? Maybe my brain isn’t fully booted and I need more coffee, but I’m not understanding what this means.

dannyw · 2026-04-25T04:12:01 1777090321

When people are moving from other models (e.g. GPT, Gemini, etc) the compute that is previously powering that inference now becomes available. Of course, I'm certainly doubtful that Google would break commits and give OpenAI GPUs to Anthropic, but the underlying effect is present and probably sorted out somehow. It's not completely net new compute for the world.

dannyw · 2026-04-24T12:47:31 1777034851

There's stuff like SOC controls and enterprise contracts with enforceable penalties if clauses are breached. ZDR is a thing.

The most significant value of open source models come from being able to fine-tune; with a good dataset and limited scope; a finetune can be crazily worth it.

dannyw · 2026-04-24T12:28:18 1777033698

Another annoyance (for more API use) is summarized/hidden reasoning traces. It makes prompt debugging and optimization much harder, since you literally don't have much visibility into the real thinking process.

dannyw · 2026-04-24T12:23:51 1777033431

Their financial projections that to a big part their valuation and investor story is built on involves actually making money, and lots of money, at some point. That money has to come from somewhere.

dannyw · 2026-04-24T12:20:13 1777033213

Hmm, the Flash performs significantly better than Pro in the benchmark? That's very strange; could rate limiting cause that?

XCSme · 2026-04-24T12:30:42 1777033842

Yes, Flash doesn't seem to have the same rate limits as Pro.

I expect once the API issues are fixed, for v4-pro to be around the same level as GLM-5.

wolttam · 2026-04-24T13:08:55 1777036135

Why would your test be including scores of failed responses/runs? That seems confusing.

(I am confused by the results your website is presenting)

XCSme · 2026-04-24T13:42:14 1777038134

Because the idea of those benchmarks is to see how well a model performs in real-world scenarios, as most models are served via APIs, not self-hosted.

So, for example, hypothetically if GPT-5.5 was super intelligent, but using it via API would fail 50% of the times, then using it in a real-life scenarios would make your workflows fail a lot more often than using a "dumber", but more stable model.

My plan is to also re-test models over-time, so this should account for infrastructure improvements and also to test for model "nerfing".

seanw265 · 2026-04-24T14:11:10 1777039870

I take some issue with that testing methodology. It seems to me that you're conflating the model's performance with the reliability of whatever provider you're using to run the benchmark.

Many models, especially open weight ones, are served by a variety of providers in their lifetime. Each provider has their own reliability statistics which can vary throughout a model's lifetime, as well as day to day and hour to hour.

Not to mention that there are plenty of gateways that track provider uptime and can intelligently route to the one most likely to complete your request.

BoorishBears · 2026-04-24T18:20:42 1777054842

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

That's not even the tip of the iceberg in how useless their benchmark is.

XCSme · 2026-04-24T14:18:24 1777040304

@seanw265 Yes, that's a problem. This can be solved for open-source models by running them on my own, but again the TPS will be dependent on the hardware used.

All models are tested through OpenRouter. The providers on OpenRouter vary drastically in quality, to the point where some simply serve broken models.

That being said, I usually test models a few hours after release, at which point, the only provider is the "official" one (e.g. Deepseek for their models, Alibaba for their own, etc.).

I don't really have any good solution for testing model reliability for closed-source models, BUT the outcome still holds: a model/provider that is more reliable, is statistically more likely to also give better results during at any given time.

A solution would be to regularly test models (e.g. every week), but I don't have the budget for that, as this is a hobby project for now.

wolttam · 2026-04-24T16:14:19 1777047259

If you don't have the budget to test regularly, then including this kind of metric is questionable. You've essentially sampled the infrastructure's reliability at only a few points, which doesn't provide a very meaningful signal. It could mislead future readers about the performance of the overall system (either for the better or the worse).

I'd personally just try to test the model on the model's merits, not the infrastructure. The infrastructure is a constantly changing variable. Many infrastructure failures can be worked around by simply re-submitting the failed request automatically.

XCSme · 2026-04-24T19:53:32 1777060412

> You've essentially sampled the infrastructure's reliability at only a few points, which doesn't provide a very meaningful signal

Well, sampling is still somewhat meaningful, but I agree with you, I am considering making a separate "reliability" score that counts how many times requests failed/timed out before completing.

XCSme · 2026-04-24T13:58:48 1777039128

@danyw, we reached max comment thread depth

Yes, I would. Currently I don't have that many tests (~20), and by default a test "run" includes 3 executions of each test. So, "bad luck" is already sort of solved in each run, by running each test 3 times.

dannyw · 2026-04-24T13:51:53 1777038713

Wouldn't you need to re-run across lots of samples (even for a single eval/bench) to avoid outsized impacts from just bad luck?

dannyw · 2026-04-24T10:02:41 1777024961

Offline mode, and self-hostable apps. I'm very happy with my self-hosted and open source apps; e.g. photo library, media centres, etc; the convenience of cloud, but my cloud that I fully control.

vaylian · 2026-04-24T10:25:20 1777026320

Not to discourage you from these things, but the cloud wasn't a thing in 1999. Storage space was also an issue.

helij · 2026-04-24T15:17:56 1777043876

dannyw · 2026-04-24T08:21:17 1777018877

Are there better providers for inferencing this right now? I know it's launch day, but openrouter showing 30tps isn't looking great.

dannyw · 2026-04-24T08:19:02 1777018742

I do think Nvidia isn't that badly priced; they still have the dominance in training and the proven execution

Biggest risk I see is Nvidia having delays / bad luck with R&D / meh generations for long enough to depress their growth projections; and then everything gets revalued.

dannyw · 2026-04-24T08:15:31 1777018531

Yeah, open weights is really good, especially when base models (not just the instruction tuned) weights are released like here.