Nils Pipenbrinck wrote:> Here is the patch with my changes.
>
> Most work went into the decoder. I just changed on the encoder if
> something was nessesary to build the library.
I notice your patch does not use the port to intrinsics you said you
did, except for a few small bits in mmxstate.c (and thus, everything
else will not support x86-64). Did you test the speed of the intrinsics
version against your hand-rolled version? What were the results?
I also notice you made lots of minor changes, which will make it more
difficult to keep the code in sync with the gcc version. I'd like to
keep things as consistent as possible. E.g., what's the rational for
expanding out all of the macros for the IDCT, other than, "it was easier
that way"? Does MSVC really not unroll loops with inline asm in them for
you? Many of the changes do seem like improvements, but there's always
the underlying question, "Is it actually any faster?" I'm happy to
incorporate improvements to the gcc asm if there's actually
justification for it.
I'm also confused by your bit-twiddling average:
average = (a & b) + (((a ^ b) & 0xfe) >> 1);
What on earth is the purpose of the AND if you're just going to shift
off the lower bit anyway?