Displaying 3 results from an estimated 3 matches for "vaddd".
Did you mean:
vaddr
2018 Jul 23
4
[LoopVectorizer] Improving the performance of dot product reduction loop
...0, %xmm0, %xmm0 is being unnecessarily
carried across the loop. It's then redundantly added twice in the reduction
after the loop despite it being 0. This happens because we basically
tricked the backend into generating a 256-bit vpmaddwd concated with a
256-bit zero vector going into a 512-bit vaddd before type legalization.
The 512-bit concat and vpaddd get split during type legalization, and the
high half of the add gets constant folded away. I'm guessing we probably
finished with 4 vpxors before the loop but MachineCSE(or some other pass?)
combined two of them when it figured out the lo...
2018 Jul 24
4
[LoopVectorizer] Improving the performance of dot product reduction loop
...eing unnecessarily
> carried across the loop. It's then redundantly added twice in the reduction
> after the loop despite it being 0. This happens because we basically
> tricked the backend into generating a 256-bit vpmaddwd concated with a
> 256-bit zero vector going into a 512-bit vaddd before type legalization.
> The 512-bit concat and vpaddd get split during type legalization, and the
> high half of the add gets constant folded away. I'm guessing we probably
> finished with 4 vpxors before the loop but MachineCSE(or some other pass?)
> combined two of them when i...
2018 Jul 23
3
[LoopVectorizer] Improving the performance of dot product reduction loop
Hello all,
This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
multipying sign extended 16-bit values to produce a 32-bit accumulated
result. The x86 backend is currently not able to optimize it as well as gcc
and icc. The IR we are getting from the loop vectorizer has several v8i32
adds and muls inside the loop. These are fed by v8i16 loads and sexts from
v8i16 to v8i32. The