Couldn't this be written in a C-pure way so that compilers can take advantadge of vector optimization and produce equally optimized code?
I have been discouraged to write hand-written assembly SIMD code, because netizents say you can barely outsmart compiler-optimized assembly code nowadays..
I have been discouraged to write hand-written assembly SIMD code, because netizents say you can barely outsmart compiler-optimized assembly code nowadays..