Multimodal is a farce. It still can’t see anything, it just generates a as list ...

		heyjamesknight 6 months ago \| parent \| context \| favorite \| on: Andrej Karpathy – It will take a decade to work th... Multimodal is a farce. It still can’t see anything, it just generates a as list of descriptors that the LLM part can LLM about. Humans got by for hundreds of thousands of years without language. When you see a duck you don’t need to know the word duck to know about the thing you’re seeing. That’s not true for “multimodal” models.