Hi, I've been giving a look at the archives of the mailing list and I've seen that you have rewritten a lot of functions using mmx to make them faster. I'm currently trying to optimize some code, but I'm have some problems, because I work with 16 bit per component and not 8 like theora. I know that it is off topic, but I'm posting to ask you a little help. I've got this function that calculates the sad: si32 sad_4x4 (macroblock_t * mb, ui8 x, ui8 y) { ui8 i, j; si32 corner_x, corner_y, sad; corner_x = x << 2; corner_y = y << 2; sad = 0; for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) sad + abs (mb->orig_mb[corner_x + i][corner_y + j] - mb->pred_mb[corner_x + i][corner_y + j]); return sad; } where mb->orig_mb and mb->pred_mb are arrays of short int and not unsigned char. I cannot therefore use psadbw, because it works on 8 bit data. I've currently rewritten the function in this way: si32 sad_4x4 (macroblock_t * mb, ui8 x, ui8 y) { zeros = _mm_setzero_si64 (); ones = _mm_set1_pi16 (1); orig = *((__m64*) &mb->orig_mb[corner_x][corner_y]); pred = *((__m64*) &mb->pred_mb[corner_x][corner_y]); diff = _m_psubw (orig, pred); cmp = _m_pcmpgtw (zeros, diff); sign = _m_paddw (ones, cmp); sign = _m_paddw (sign, cmp); sad = _m_pmaddwd (diff, sign); orig = *((__m64*) &mb->orig_mb[corner_x+1][corner_y]); pred = *((__m64*) &mb->pred_mb[corner_x+1][corner_y]); diff = _m_psubw (orig, pred); cmp = _m_pcmpgtw (zeros, diff); sign = _m_paddw (ones, cmp); sign = _m_paddw (sign, cmp); cmp = _m_pmaddwd (diff, sign); sad = _m_paddd (sad, cmp); orig = *((__m64*) &mb->orig_mb[corner_x+2][corner_y]); pred = *((__m64*) &mb->pred_mb[corner_x+2][corner_y]); diff = _m_psubw (orig, pred); cmp = _m_pcmpgtw (zeros, diff); sign = _m_paddw (ones, cmp); sign = _m_paddw (sign, cmp); cmp = _m_pmaddwd (diff, sign); sad = _m_paddd (sad, cmp); orig = *((__m64*) &mb->orig_mb[corner_x+3][corner_y]); pred = *((__m64*) &mb->pred_mb[corner_x+3][corner_y]); diff = _m_psubw (orig, pred); cmp = _m_pcmpgtw (zeros, diff); sign = _m_paddw (ones, cmp); sign = _m_paddw (sign, cmp); cmp = _m_pmaddwd (diff, sign); sad = _m_paddd (sad, cmp); return _m_to_int (sad) + _m_to_int (_m_psrlqi (sad, 32)); } but it isn't faster. Does anyone of you have got a hint to make it faster? I've got another question: why don't you call _mm_empty when you use intrinsic asm? Thank you and excuse me for the OT. -- Ottavio Campana Telecommunication Engineer Lab. Immagini Dept. of Information Engineering University of Padova Via Gradenigo 6/B 35131 Padova Italy