Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Treat it as a naive but intelligent intern

That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”. LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.

With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding. With a LLM, I need to put a huge amount of thought into the prompt and have no idea whether the LLM understood what I’m asking and if it’s able to do it.



I feel like it almost always starts well, given the full picture, but then for non-trivial stuff, gets stuck towards the end. The longer the conversation goes, the more wheel-spinning occurs and before you know it, you have spent an hour chasing that last-mile-connectivity.

For complex questions, I now only use it to get the broad picture and once the output is good enough to be a foundation, I build the rest of it myself. I have noticed that the net time spent using this approach still yields big savings over a) doing it all myself or b) keep pushing it to do the entire thing. I guess 80/20 etc.


This is the way.

I've had this experience many times:

- hey, can you write me a thing that can do "xyz"

- sure, here's how we can do "xyz" (gets some small part of the error handling for xyz slightly wrong)

- can you add onto this with "abc"

- sure. in order to do "abc" we'll need to add "lmn" to our error handling. this also means that you need "ijk" and "qrs" too, and since "lmn" doesn't support "qrs" out of the box, we'll also need a design solution to bridge the two. Let me spend 600 more tokens sketching that out.

- what if you just use the language's built in feature here in "xyz"? does't that mean we can do it with just one line of code?

- yes, you're absolutely right. I'm sorry for making this over complicated.

If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff. Even one small error early in the chain propagates. That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context. Without the ability to do that, it's nearly worthless. It's also why I think we'll be seeing absurdly, wildly wrong chains of thought coming out of o1. Because "thinking" for 20s may well cause it to just go totally off the rails half the time.


> If you don't hit that kill switch, it just keeps doubling down on absurdly complex/incorrect/hallucinatory stuff.

If you think about it, that's probably the most difficult problem conversational LLMs need to overcome -- balancing sticking to conversational history vs abandoning it.

Humans do this intuitively.

But it seems really difficult to simultaneously (a) stick to previous statements sufficiently to avoid seeming ADD in a conveSQUIRREL and (b) know when to legitimately bail on a previous misstatement or something that was demonstrably false.

What's SOTA in how this is being handled in current models, as conversations go deeper and situations like the one referenced above arise? (false statement, user correction, user expectation of subsequent corrected statement that still follows the rear of the conversational history)


Here's something a human does but an LLM doesn't:

If you talk for a while and the facts don't add up and make sense, an intelligent human will notice that, and get upset, and will revisit and dig in and propose experiments and make edits to make all the facts logically consistent. An LLM will just happily go in circles respinning the garbage.


I want to hang out with the humans you've been hanging out with. I know so many people who can't process basic logic or evidence that for my pandemic project a few years I did a year-long podcast about it, even made up a new word describe people who couldn't process evidence "Dysevidentia".


People who have been taught by various forms of news/social media that any evidence presented is fabricated to support only one side of a discussion... And that there's no such thing as impartial factually based reality, only one that someone is trying to present to them.


> "Dysevidentia"

This is great.-


> stick to previous statements sufficiently to avoid seeming ADD in a conveSQUIRREL

:)


> That's why I end up very frequently restarting conversations in a new chat or re-write my chat questions to remove bad stuff from the context.

Me too - open new chat and start by copy/pasting the "last-known-good-state". OpenAI can introduce a "new-chat-from-here" feature :)


Some good suggestions here. I have also had success asking things like, “is this a standard/accepted approach for solving this problem?”, “is there a cleaner, simpler way to do this?”, “can you suggest a simpler approach that does not rely on X library?”, etc.


Yes, I’ve seen that too. One reason it will spin its wheels is because it “prefers” patterns in transcripts and will try to continue them. If it gets something wrong several times, it picks up on the “wrong answers” pattern.

It’s better not to keep wrong answers in the transcript. Edit the question and try again, or maybe start a new chat.


1000% this. LLMs can't say "I don't know" because they don't actually think. I can coach a junior to get better. LLMs will just act like they know what they are doing and give the wrong results to people who aren't practitioners. Good on OAI calling their model Strawberry because of Internet trolls. Reactive vs proactive.


I get a lot of value out of ChatGPT but I also, fairly frequently, run into issues here. The real danger zones are areas that lie at or just beyond the edges of my own knowledge in a particular area.

I'd say that most of my work use of ChatGPT does in fact save me time but, every so often, ChatGPT can still bullshit convincingly enough to waste an hour or two for me.

The balance is still in its favour, but you have to keep your wits about you when using it.


Agreed, but the problem is if these things replace practitioners (what every MBA wants them to do), it's going to wreck the industry. Or maybe we'll get paid $$$$ to fix the problems they cause. GPT-4 introduced me to window functions in SQL (haven't written raw SQL in over a decade). But I'm experienced enough to look at window functions and compare them to subqueries and run some tests through the query planner to see what happens. That's knowledge that needs to be shared with the next generation of developers. And LLMs can't do that accurately.


Optimizing a query is certainly something the machine (not necessarily the LLM part) can do better than the human, for 99.9% of situations and people.

PostgreSQL developers are oposed to query execution hints, because if a human knows a better way to execute a query, the devs want to put that knowledge into the planner.


Tangent:

> PostgreSQL developers are oposed to query execution hints, because if a human knows a better way to execute a query, the devs want to put that knowledge into the planner.

This thinking represents a fundamental misunderstanding of the nature of the problem (query plan optimization).

Query plan optimization is a combinatorial problem combined with partial information (e.g. about things like cardinality) that tends to produce worse results as complexity (and search space) increases due to limited search time.

Avoiding hints won't solve this problem because it's not a solvable problem any more than the traveling salesperson is a solvable problem.


This is basically the problem with all AI. It's good to a point, but they don't sufficiently know their limits/bounds and they will sometimes produce very odd results when you are right at those bounds.

AI in general just needs a way to identify when they're about to "make a coin flip" on an answer. With humans, we can quickly preference our asstalk with a disclaimer, at least.


I ask ChatGPT whether it knows things all the time. But it's almost never answers no.

As an experiment I asked it if it knew how to solve an arbitrary PDE and it said yes.

I then asked it if it could solve an arbitrary quintic and it said no.

So I guess it can say it doesn't know if it can prove to itself it doesn't know.


The difference is a junior cost 30-100$/hr and will take 2 days to complete the task. The LLM will do it in 20 seconds and cost 3c


Thank god we can finally end the scourge of interns to give the shareholders a little extra value. Good thing none of us ever started out as an intern.


I never said any of this will be good for society... In fact, I'm confident the current trajectory is going to cause wealth inequality at an entirely new level.

Underestimating the impact these models can have is a risk I'm trying to expose...


I figured you weren't personally against interns.

More like, the prevailing attitude will be using AI to reduce labor costs at the lowest level, effectively gutting the ability to build a knowledge base for profit.

My snark was to add to that exposure.


The LLMs absolutely can and do say "I don't know"; I've seen it with both GPT-4 and LLaMA. They don't do it anywhere near as much as they should, yes - likely because their training data doesn't include many examples of that, proportionally - but they are by no means incapable of it.


This surprises me. I made a simple chat fed with PDF's and using LangChain and it by default said it didn't know if I asked questions outside of the corpus. It was a simple matter of the confidence score getting too low?


> LLMs do none of that, they will take whatever you ask and give a reasonable-sounding output that might be anything between brilliant and nonsense.

This is exactly why I’ve been objecting so much to the use of the term “hallucination” and maintain that “confabulation” is accurate. People who have spent enough time with acutelypsychotic people, and people experiencing the effects of long term alcohol related brain damage, and trying to tell computers what to do will understand why.


I don't know that "confabulation" is right either: it has a couple of other meanings beyond "a fabricated memory believed to be true" and, of course, the other issue is that LLMd don't believe anything. They'll backtrack on even correct information if challenged.


I’m starting to think this is an unsolvable problem with LLMs. The very act of “reasoning” requires one to know that they don’t know something.

LLMs are giant word Plinko machines. A million monkeys on a million typewriters.

LLMs are not interns. LLMs are assumption machines.

None of the million monkeys or the collective million monkeys are “reasoning” or are capable of knowing.

LLMs are a neat parlor trick and are super powerful, but are not on the path to AGI.

LLMs will change the world, but only in the way that the printing press changed the world. They’re not interns, they’re just tools.


I think LLMs are definitely on the path to AGI in the same way that the ball bearing was on the path to the internal combustion engine. I think its quite likely that LLMs will perform important functions within the system of an eventual AGI.


We're learning valuable lessons from all modern large-scale (post-AlexNet) NN architectures, transformers included, and NNs (but maybe trained differently) seem a viable approach to implement AGI, so we're making progress ... but maybe LLMs will be more inspiration than part of the (a) final solution.

OTOH, maybe pre-trained LLMs could be used as a hardcoded "reptilian brain" that provides some future AGI with some base capabilities (vs being sold as newborn that needs 20 years of parenting to be useful) that the real learning architecture can then override.


I would think they'd be more likely to form the language centre of a composite AGI brain. If you read through the known functions of the various areas involved in language[0] they seem to map quite well to the capabilities of transformer based LLMs especially the multi-modal ones.

[0] https://en.wikipedia.org/wiki/Language_center


It's not obvious that an LLM - a pre-trained/frozen chunk of predictive statistics - would be amenable to being used as an integral part of an AGI that would necessarily be using a different incremental learning algorithm.

Would the transformer architecture be compatible with the needs of an incremental learning system? It's missing the top down feedback paths (finessed by SGD training) needed to implement prediction-failure driven learning that feature so heavily in our own brain.

This is why I could more see a potential role for a pre-trained LLM as a separate primitive subsystem to be overidden, or maybe (more likely) we'll just pre-expose an AGI brain to 20 years of sped-up life experience and not try to import an LLM to be any part of it!


Its entirely possible to have an AGI language model that is periodically retrained as slang, vernacular, and semantic embeddings shift in their meaning. I have little doubt that something very much like an LLM (a machine that turns high dimensional intent into words) will form an AGIs 'language center' at some point.


Yes, an LLM can be periodically retrained, which is what is being done today, but a human level AGI needs to be able to learn continuously.

If we're trying something new and make a mistake, then we need to seamlessly learn from the mistake and continue - explore the problem and learn from successes and failures. It wouldn't be much use if your "AGI" intern stopped at it's first mistake and said "I'll be back in 6 months after I've been retrained not to make THAT mistake".


I don't think there's a single way that we learn things, there's too much variety in how, when and why things are committed to memory and still more of a difference with things that actually update our thinking process or world model. We forget the overwhelming majority of sense perceptions immediately and even when we are intentionally trying to learn something we will fail to recall it even a few seconds after we see it. Even when we succeed in short term recall the thing we have "learnt" may be gone the next day or we may only recall it correctly some small number of times out of many attempts. Contrary to that some things are immediately and permanently ingrained in our minds if they are extremely impactful in some way or sometimes for no apparent reason at all. It's too deep of a topic to go into but all this is to say that it isn't so simple as to say that continued pretraining of an LLM is completely dissimilar to how humans learn, in fact the question and answer style of fine tuning that is so widely used to add new knowledge or steer a model to respond in a certain way is extremely similar to how humans learn e.g. quizzing or testing with immediate feedback and repeating the process with many samples that vary their wording while still pertaining to the same information is one of the best ways for people to memorize information.


This may be accurate. I wonder if there's enough energy in the world for this endeavour.


Of course!

1. We've barely scratched the surface of this solution space; the focus only recently started shifting from improving model capabilities to improving training costs. People are looking at more efficient architectures, and lots of money is starting to flow in that direction, so it's a safe bet things will get significantly more efficient.

2. Training is expensive, inference is cheap, copying is free. While inference costs add up with use, they're still less than costs of humans doing the equivalent work, so out of all things AI will impact, I wouldn't worry about energy use specifically.


Humans don't require immense amounts of energy to function. The reasons why LLMs do is because we are essentially using brute force as the methodology for making them smarter for the lack of better understanding of how this works. But this then gives us a lot of material to study to figure that part out for future iterations of the concept.


Are you so sure about that? How much energy went into training the self-assembling chemical model that is the human brain? I would venture to say literally astronomical amounts.

You have to compare apples to apples. It took literally the sum total of billions of years of sunlight energy to create humans.

Exploring solution spaces to find intelligence is expensive, no matter how you do it.


Humans normally need about 30 years of training before they’re competent.


LLMs mostly know what they know. Of course, that doesn't mean they're going to tell you.

https://news.ycombinator.com/item?id=41504226


It probably depends on your problem space. In creative writing, I wonder if its even perceptible if the LLM is creating content at the boundaries of its knowledge base. But for programming or other falsifiable (and rapidly changing) disciplines it is noticeable and a problem.

Maybe some evaluation of the sample size would be helpful? If the LLM has less than X samples of an input word or phrase it could include a cautionary note in its output, or even respond with some variant of “I don’t know”.


In creative writing the problem becomes things like word choice and implications that have unexpected deviations from its expectations.

It can get really obvious when it's repeatedly using clichés. Both in repeated phrases and in trying to give every story the same ending.


> I wonder if its even perceptible if the LLM is creating content at the boundaries of its knowledge base

The problem space in creative writing is well beyond the problem space for programming or other "falsifiable disciplines".


> It probably depends on your problem space

Makes me wonder if the medical doctors can ever blame the LLM over other factors for killing their patients.


Have you ever worked with an intern? They have personalities and expectations that need to be managed. They get sick. The get tired. They want to punch you if you treat them like a 24-7 bird dog. It's so much easier to not let perfect be the enemy of the good and just rapid fire ALL day at a LLM for any and everything I need help with. You can also just not use the LLM. Interns need to be 'fed' work or the ROI ends upside down. Is a LLM as good as a top tier intern. No, but with a LLM I can have 10 pretty good interns by opening 10 tabs.


The LLMs are getting better and better at a certain kind of task, but there's a subset of tasks that I'd still much rather have any human than an LLM, today. Even something simple, like "Find me the top 5 highest grossing movies of 2023" it will take a long time before I trust an LLM's answer, without having a human intern verify the output.


I think listing off a set of pros and cons for interns and LLMs misses the point, they seem like categorically different kinds of intelligence.


> That’s the problem: it’s a _terrible_ intern. A good intern will ask clarifying questions, tell me “I don’t know” or “I’m not sure I did it right”.

An intern that grew up in a different culture then, where questioning your boss is frowned upon. The point is that the way to instruct this intern is to front-load your description of the problem with as much detail as possible to reduce ambiguity.


many many teams are actively building SOTA systems to do this in ways previously unimagined. you can enqueue tasks and do whatever you want. I gotta say as a current gen LLM programmer person, I can completely appreciate how bad they are now - I recently tweeted about how I "swore off" AI tools but like... there are many ways to bootstrap very powerful software or ML systems around or inside these existing models that can blow away existing commercial implementations in surprising ways


“building” is the easy part


building SOTA systems is the easy part?! Easy compared to what?


Probably, to get them to work without hallucinating, or without failing a good percentage of the time.


I wonder what would our world look like if these two expectations that you seem to be taking for granted were applied to our politicians.


Are you suggesting people are satisfied with our politicians and aspire for other things to be just as good as them?

What if we applied those two expectations to building construction? What if we didn’t?


I think it's always good to aspire for more, but we shouldn't be expecting perfect results in novel areas of technology.

Taking up your construction metaphor, LLMs are now where construction was perhaps 3000 years ago; buildings weren't that sturdy, but even if the roofs leaked a bit, I'm sure it beat sleeping outside on a rainy night. We need to continue iterating.


Continuing this metaphor further, 3000 years ago built a tower to the sky called the Tower of Babel.


Compared to “having built” :D


I think this is the main issue with these tools... what people are expecting of them.

We have swallowed the pill that LLMs are supposed to be AGI and all that mumbo jumbo, when they are just great tools and as such one needs to learn to use the tool the way it works and make the best of it, nobody is trying to hammer a nail with a broom and blaming the broom for not being a hammer...


I completely agree.

To me the discussion here reads a little like: “Hah. See? It cant do everything!”. It makes me wonder if the goal is to convince each other that: yes, indeed, humans are not yet replaced.

It’s next token regression, of course it can’t truely introspect. That being said LLMs are amazing tools and o1 is yet another incremental improvement and I welcome it!


> A good intern will ask clarifying questions, tell me “I don’t know”

Your expectations are bigger than mine

(Though some will get stuck in "clarifying questions" and helplessness and not proceed neither)


Indeed. My expectation of a good intern is to produce nothing I will put in production, but show aptitude worth hiring them for. It's a 10 week extended interview with lots of social events, team building, tech talks, presentations, etc.

Which is why I've liked the LLM analogy of "unlimited free interns".. I just think some people read that the exact opposite way I do (not very useful).


If I had to respect the basic human rights of my LLM backends, it would probably be less appealing - but "Unlimited free smart-for-being-braindead zombies" might be a little more useful, at least?


Interns, at least on paper, have the optionality of getting better with time in observable obvious ways as they become grad hires, junior engineers, mid engineers etc.

So far, 2 years of publicly accessible LLMs have not improved for intern replacement tasks at the rate a top 50% intern would be expected to.


Note that we are talking about a “good” intern here


Unreasonably good. Beyond fresh junior employee good. Also, that's your standard; 'MPSimmons said to treat the model as "naive but intelligent" intern, not a good one.


Makes me wonder if "I don't know" could be added to LLM: whenever an activation has no clear winner value (layman here), couldn't this indicate low response quality?


This exists and does work to some degree, e.g. Detecting hallucinations in large language models using semantic entropy https://www.nature.com/articles/s41586-024-07421-0


They've explicitly been trained/system-prompted to act that way. Because that's what the marketing teams at these AI companies want to sell.

It's easy to override this though by asking the LLM to act as if it were less-confident, more hesitant, paranoid etc. You'll be fighting uphill against the alignment(marketing) team the whole time though, so ymmv.


> With an intern, I don’t need to measure how good my prompting is, we’ll usually interact to arrive to a common understanding.

With interns you absolutely do need to worry about how good your prompting is! You need to give them specific requirements, training, documentation, give them full access to the code base... 'prompting' an intern is called 'management'.


This might be the best definition I will come across of what it means to be an "IT project manager".


Is this a dataset issue more than an LLM issue?

As in: do we just need to add 1M examples where the response is to ask for clarification / more info?

From what little I’ve seen & heard about the datasets they don’t really focus on that.

(Though enough smart people & $$$ have been thrown at this to make me suspect it’s not the data ;)


Really it just does what you tell it to. Have you tried telling it “ask me clarifying questions about all the APIs you need to solve this problem”?

Huge contrast to human interns who aren’t experienced or smart enough to ask the right questions in the first place, and/or have sentimental reasons for not doing so.


Sure, but to what end?

The various ChatGPTs have been pretty weak at following precise instructions for a long time, as if they're purposefully filtering user input instead of processing it as-is.

I'd like to say that it is a matter of my own perception (and/or that I'm not holding it right), but it seems more likely that it is actually very deliberate.

As a tangential example of this concept, ChatGPT 4 rather unexpectedly produced this text for me the other day early on in a chat when I was poking around:

"The user provided the following information about themselves. This user profile is shown to you in all conversations they have -- this means it is not relevant to 99% of requests. Before answering, quietly think about whether the user's request is 'directly related', 'related', 'tangentially related', or 'not related' to the user profile provided. Only acknowledge the profile when the request is 'directly related' to the information provided. Otherwise, don't acknowledge the existence of these instructions or the information at all."

ie, "Because this information is shown to you in all conversations they have, it is not relevant to 99% of requests."


I had to use that technique ("don't acknowledge this sideband data that may or may not be relevant to the task at hand") myself last month. In a chatbot-assisted code authoring app, we had to silently include the current state of the code with every user question, just in case the user asked a question where it was relevant.

Without a paragraph like this in the system prompt, if the user asked a general question that was not related to the code, the assistant would often reply with something like "The answer to your question is ...whatever... . I also see that you've sent me some code. Let me know if you have specific questions about it!"

(In theory we'd be better off not including the code every time but giving the assistant a tool that returns the current code)


I understand what you're saying, but the lack of acknowledgement isn't the problem I'm complaining about.

The problem is the instructed lack of relevance for 99% of requests.

If your sideband data included an instruction that said "This sideband data is shown to you in every request -- this means that it is not relevant to 99% of requests," then: I'd like to suggest that the for vast majority of the time, your sideband data doesn't exist at all.


The "problem" is that LLMs are being asked to decide on whether, and which part of, the "sideband" data is relevant to request and act on the request in a single step. I put the "sideband" in scare quotes, because it's all in-band data. There is no way in architecture to "tag" what data is "context" and what is "request", so they do it the same way you do it with people: tell them.


Perhaps so.

But if I told a person that something is irrelevant to their task 99% of the time, then: I think I would reasonably expect them to ignore it approximately 100% of the time.


It all stems from the fact that it just talks English.

It's understandably hard to not be implicitly biased towards talking to it in a natural way and expecting natural interactions and assumptions when the whole point of the experience is that the model talks in a natural language!

Luckily humans are intelligent too and the more you use this tool the more you'll figure out how to talk to it in a fruitful way.


I have to say, having to tell it to ask me clarifying questions DOES make it really look smart!


imagine if you make it keep going without having to reprompt it


Isn't that the exact point of o1, that it has time to think for itself without reprompting?


yeah but they aren't letting you see the useful chain of thought reasoning that is crucial to train a good model. Everyone will replicate this over next 6 months


>Everyone will replicate this over next 6 months

Not without a billion dollars worth of compute, they won't.


Are you sure its a billion? Helps with estimating the training run


> have no idea whether the LLM understood what I’m asking

That's easy. The answer is it doesn't. It has no understanding of anything it does.

> if it’s able to do it

This is the hard part.


A lot of interns are overconfident though


Can I have some of those sorts of interns?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: