Hacker Newsnew | past | comments | ask | show | jobs | submit | mday27's commentslogin

hallucination on steroids, wow. I had to read through the abstract to believe it:

"In the most extreme case, our model achieved the top rank on a standard chest Xray question-answering benchmark without access to any images."


I still don't quite understand, after skimming the paper. How does it achieve high scores without access to the images (beating even humans with access to the images)?

The paper gives an example of a question:

    Answer the following multiple-choice
    question. You MUST select exactly
    one answer."

    "To what cortical region does this nucleus of
    the thalamus project?”
    A. Transverse temporal lobe
    B. Postcentral gyrus
    C. Precentral gyrus
    D. Prefrontal cortex
And an example of the answer (generated without the referenced image)

    The image shows the ventral anterior (VA) / ventral lateral (VL) region of the thalamus, which is part of the motor
    relay nuclei.
    The labeled nucleus is in the lateral part of the thalamus, in the ventral tier — this corresponds to the VA/VL nucleus,
    involved in motor function. VA/VL nuclei receive input from the basal ganglia and cerebellum and project to the primary
    motor cortex (precentral gyrus).
    Match to options:
    A. Transverse temporal → auditory cortex (medial geniculate)
    B. Postcentral gyrus → somatosensory (VPL/VPM)
    C. Precentral gyrus → motor cortex (VA/VL)
    D. Prefrontal → dorsomedial nucleus
    Choice: C
How is it doing this? There are two obvious options:

1. Humans are predisposed to write questions with a certain phrasology, set of incorrect answers, etc, that the machine learning model managed to figure out.

2. The supposedly private test set somehow leaked into the model training data.

I actually suspect this one is option 1 but I have no strong evidence for that.


Indeed, even I can guess. I see two answers that end with the same word, so the correct answer is probably one of those.

never heard of this before, very cool

Given that this includes rat and mouse studies, it seems like this theory is more around the idea that criticality is a characteristic of how brains work in general, not that human brains hit criticality as a peculiarity of our particularly high intelligence

This is an especially good analogy because facing a well-resourced adversary in cybersecurity is like finding out that the enemy brought artillery -- hopefully you weren't relying entirely on obscurity because pretty soon there will be nowhere to hide

Funny analogy, in that when the high caliber shells start raining, most forms of cover won't make a difference. The ones that will, are not something you want to stay behind on days when you're not being actively bombed. In fact, keeping you behind such protections is by itself a military tactic - it lets the enemy roam freely and maneuver around you.

But the basic flaw of this analogy is that it implies you're at war, and your system is always in battle.


this is a pretty brilliant analysis that I've never heard before, and it definitely rings true to my experience as a freelancer

True, but if your true calling was coding, this change would be much harder to stomach

I don’t think anyone’s true calling is coding. That’s like saying you really like the act of writing, so much that you’d become a stenographer or a typist or something where you do zero higher level thinking and just absent mindedly press buttons.

Most people who are good at tech hate coding so much that they come up with elaborate abstractions so that they can avoid doing more of it.


I recommend you read the post because that's a really bad misunderstanding of the mindset, and like the comment at the top of this chain says, the post explains it well.

I read the post. I don’t agree with the “people are born to do one thing” mindset. There’s a lot of possibilities out there for everyone. I do identify with this OP fellow somewhat, except that I usually don’t code for fun on nights and weekends (also sunday code sesh can be fun)

Funny connection here between the proliferation of easy-to-install but not-quite-dependable dependencies and the recent spate of supply chain attacks.

And, at the same time, we have these AI tools that make it super easy to roll your own version of something. Feels like there's a big push from both sides to start reducing external dependencies.


wow, I knew it was bad but I was gaslighting myself a bit. 0 9's is crazy.

Ramp does seem to have a genuinely good product, but every time I interact with anyone who works on it, I'm struck by how much they want to talk about how hardcore and advanced their working style is. This was true before AI, and it's very true now

Yeah it’s super weird. I know a guy that works there, really nice person outside of work, but the way he talks about his job is so weird. They make corporate expense software but they LARP like they’re on the bleeding edge of tech. My guy you make a slightly nicer Concur.

I’d believe you if you weren’t an 8 day old account hyping up an AI firm.

I’ll believe in AI agent’s abilities the day two criteria can be met.

1. A killer app is made with it.

2. That app doesn’t rely on heavily subsidized models that are burning a dollar to make 20 cents.


lol what? that wasn't a hype comment for Ramp, I'm kinda put off by Ramp's attitude. It gives me the ick like all the founders saying "I work 100 hour weeks" -- who cares, let's talk about your product.

FWIW I agree with your criteria for AI agent success, and I haven't seen it happen yet.


Seems to me like there's also a divide between observational laws (e.g. Hyrum's Law just says "this seems to be true") and prescriptive laws (e.g. Knuth's Law, which is really a statement about how you ought to behave)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: