When I looked into a NASM documentation MMX (and XMM) instruction operate in only 64b registers and use integers (8b, 16b, 32b). OTOH SSE (SSE2) use 128b registers with floating points (32b, 64b). So maximum vector dimension is 8. This can speed up 8-point butterfly but hardly 16-point butt. because matrix multiplication is in n^2 and now used algorithm is n.log(n). I think that such comparsion in 16-point butt. would show the speed of specialized scalar algorithm over conventional vector oriented algorithm. Not all processors have MMX instruction thus there should be some versions for other platforms which can bring more chaos (unmaintainability) than speed.