I remain unconvinced they're (the whole LLM/"Attention Is All You Need" industry) even barking up the right tree to build anything usefully-close to "AGI".
The idea that any situation or sensory input can be broken down into a sequence of tokens, and that action choice can be characterized by predicting a subsequent sequence of tokens in the same space, may well bear fruit.
But I think that a lot of people also buy into the idea that "text and image data from the web, and from historical chats, is the right/only way to generate the data set required," and it's a dangerous trap to fall into.
It can answer specialized PhD level questions correctly, yet cannot perform tasks that an average 10 year old could. I don't consider that generally intelligent.