On Thu, 15 Jan 2004, Segher Boessenkool wrote:
> > Please note that dot products of simple vector floats are usually
> > faster
> > in the scalar units. The add across and transfer to scalar is just too
> > expensive.
>
> Or do four at once, with some shuffling (which is basically free);
> almost the same code as a 4x4 matrix/vector multiply.
Yes, that is a better way to do it. It is essentially what you'd do to
make longer vector dot products such as your 40-160 sample dots work
quickly. Do them as 4 parallel partial vector dots and then sum across the
vector containing the four results. On MacOS X, there is also a hand tuned
dot product in vecLib/vDSP.h, dotpr(), if you'd rather just call that.
Personally, I don't think much of PNI. The complex arithmetic stuff they
added sets you up for a lot of permute overhead that is inefficient --
especially on a processor that is already weak on permute. In my opinion,
its a big ISA trojan horse. The better and more efficient way to do that
is store the real and imaginary parts in separate arrays and operate on
more than one complex pair at a time in the spirit of SIMD. You can
already do that with SSE or SSE2. Too bad they didn't instead use the
opportunity to beef up integer multiplication, add better support for
signed bytes and unsigned shorts, or add a decent permute unit. My guess
is that this involved almost no change to the silicon to squeeze these in,
since you are just twiddling the sign bit, so they viewed it as
essentially free.
Ian
---------------------------------------------------
Ian Ollmann, Ph.D. iano@cco.caltech.edu
---------------------------------------------------
--- >8 ----
List archives: http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body. No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.