Multimodal is a farce. It still can’t see anything, it just generates a as list of descriptors that the LLM part can LLM about.
Humans got by for hundreds of thousands of years without language. When you see a duck you don’t need to know the word duck to know about the thing you’re seeing. That’s not true for “multimodal” models.
Humans got by for hundreds of thousands of years without language. When you see a duck you don’t need to know the word duck to know about the thing you’re seeing. That’s not true for “multimodal” models.