Hacker Newsnew | past | comments | ask | show | jobs | submit | versteegen's commentslogin

I've also worked extensively on ARC AGI 1/2, and I mainly agree. Marketing and training. Performance of LLMs on ARC is most importantly a function of training on grid/table-like data. It doesn't have to be specifically synthetic ARC data though. Training an LLM to be better at perceiving grid-like arrangements of data in a spatial way like an image, rather than just tabular, is hugely useful for things outside of ARC benchmarks, though it's a narrow skill. Hence, I'm sure they do it. I want them to do that. I believe the labs when they say they didn't train specifically for ARC-AGI 1/2 (where did Google say otherwise? I don't see it). But it does not mean the models are getting better at general purpose reasoning. They were already plenty good enough at that. You can describe ARC images in words and reason about it using a level of intelligence LLMs have had for years: they're designed to be easy! LLMs just couldn't reason about image-like grids very well.


This explains a lot. But you merely need to look into the family of spice forks to realise, given the way that they're strangely limited to certain operating systems and embedded inside certain proprietary IDEs, that's there's something very wrong with the code architecture.

So, that would be an awesome project!


I agree except: this is creative work. Creativity can be and is being mechanised. True originality is extremely rare. Most novelty is the repurposing of one idea or concept elsewhere in a way we call find surprising, but the choice to apply A to B could have been made for any reason including mechanical: very many inventions are accidents. In-depth knowledge / conceptual understanding of something is built on abstraction, and abstractions are portable.

If you had a list of N concepts and M ways to apply them you could try all N*M combinations, and get some very interesting results. For a real example, see the theory of inventive problem solving (TRIZ)'s amusing "40 principles of invention" by Soviet inventor Genrich Altshuller. https://en.wikipedia.org/wiki/TRIZ


I'm going to find out. I've been meaning for years to port the OHRRPGCE back to DOS, where it came from.

I'm very surprised to see SDL3 re-gain DOS support, since they've aggressively dropped support for almost every port/OS they had in the SDL 1.2 days.


Very cool. I'd never heard of OHRRPGCE (Official Hamster Republic Role Playing Game Construction Engine) before. I was going to say it feels like an early predecessor to something like RPG Maker but I think RPG Maker originally came out in the early ’90s for the Japanese PC-98 computers.

From the wikipedia entry [1] for OHRRPGCE

> It runs at an 8-bit color depth, by default creates games that run at a 320 × 200 resolution.

It's funny but I bet anyone else in here who also grew up with the QBASIC interpreter as a kid instantly thinks SCREEN 13 when they read something like this.

[1] - https://en.wikipedia.org/wiki/Official_Hamster_Republic_Role...


:) SCREEN 13 (VGA Mode 13h) is almost correct, but actually it originally used a 320x200 VGA Mode X assembly graphics library. I believe 320x200 instead of 320x240 to be compatible with earlier pure-QB code for SCREEN 13 reused in the engine. (Mode X isn't a single mode, it has some adjustable parameters.)


Which model's best depends on how you use it. There's a huge difference in behaviour between Claude and GPT and other models which makes some poor substitutes for others in certain use cases. I think the GPT models are a bad substitute for Claude ones for tasks such as pair-programming (where you want to see the CoT and have immediate responses) and writing code that you actually want to read and edit yourself, as opposed to just letting GPT run in the background to produce working code that you won't inspect. Yes, GPT 5.4 is cheap and brilliant but very black-box and often very slow IME. GPT-5.4 still seems to behave the same as 5.1, which includes problems like: doesn't show useful thoughts, can think for half an hour, says "Preparing the patch now" then thinks for another 20 min, gives no impression of what it's doing, reads microscopic parts of source files and misses context, will do anything to pass the tests including patching libraries...


Interesting (would like to hear more), but solving a Rubiks cube would appear to be a poor way to measure spatial understanding or reasoning. Ordinary human spatial intuition lets you think about how to move a tile to a certain location, but not really how to make consistent progress towards a solution; what's needed is knowledge of solution techniques. I'd say what you're measuring is 'perception' rather than reasoning.


> what's needed is knowledge of solution techniques

That's definitely in the training data


> how to make consistent progress towards a solution

A 7 year old child can learn six sequences of a few moves and over a weekend solve the Rubik Cube. It is a solved algorithm something LLM should be very very good at. What it can't do is reason about spacial relationships.


The Anthropic Pro plan cost double and gave you, I don't know, a tenth the usage, depending on how efficiently you used Copilot requests, and no access to a large set of models including GPT and Gemini and free ones.


Yes, Github's per-request pricing was insane; anyone suggesting using CC instead or asking if any other provider is as cheap just doesn't understand the insanity. Clearly losing a lot of money on the people making good use of it.

I was actually hoping they would change it to something that more closely tracks their actual costs so that they wouldn't have to rug-pull this badly. In particular what was really bad about it was that sending prompts to agents while they were working (to give them corrections) cost extra so I stopped doing that (after initially OpenCode didn't cause billing for that, until they became official).


Yes, language design is a hugely important determinant of interpreter or JIT speed. There are many highly optimised VMs for dynamic languages but LuaJIT is king because Lua is such a small and suitable language, and although it does have a couple difficult to optimise features, they are few enough that you can expend the effort. It's nothing like Python. It's not much of an exaggeration to say Python is designed to minimise the possibility of a fast JIT, with compounding layers of dynamism. After years of work, the CPython 3.15 JIT finally managed ~5% faster than the stock interpreter on x86_64.


CPython current state is more a reflection of resources spent, than what is possible.

See experience with Smalltalk and Self, where everything is dynamic dispatch, everything is an object, in a live image that can be monkey patched at any given second.

PyPy and GraalPy, and the oldie IronPython, are much better experiences than where CPython currently stands on.


The problem is that AI has been dominating the conversation for so many years, and they'll get more improvements from removing the GIL than they would from adopting the PyPy JIT.

The JIT would help everyone else more than removing the GIL, I wish PyPy became the reference implementation during 2.7


Actually because AI has been driving the conversation that CPython JIT efforts are finally happening and being upstreamed.

It is also because of AI, that Intel, AMD and NVidia are now getting serious about Python GPU JITs, that allow writing kernels in a Python subset.

To the point that I bet Mojo will be too late to matter.


Python is worse, but not by all that much. After all, PyPy has been several times faster for many years.


That is an incorrect analysis. CPython is difficult to JIT because of the lack of thought to the native bindings / extensions, not because of the language itself (as others point out PyPy was way faster long ago)


You're correct. I neglected that; extension API compatibility is a big (the most important?) difference between PyPy and CPython's JIT. Amongst language features that affect optimisation potential, an extension API can be the worst.

Edit: I think what you're alluding to is that tracing JITs can overcome a lot of dynamic language features which make things hopeless for method JITs. Where LuaJIT really shines vs PyPy is outside of JITed loops. (Also memory and compile overheads). I realise this is a bit of a motte and bailey.


Von Neumann may possibly have been the smartest man to ever live, but giving him credit for all of this is too much, brushing aside many other inventors (oft independent, to his credit).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: