I think API for matrix multiplication is just a part of the issue. CUDA tooling ...

I think API for matrix multiplication is just a part of the issue. CUDA tooling has better ergonomics, it's easier to set up and treated as first class citizen in tools like Tensorflow and Pytorch.

So, while I can't talk about the hardware differences in detail, developer experience is greatly on nVidia side and now AMD has a moat to overcome to catch up.