thr3ads.net - search: "vaddd"

Displaying 3 results from an estimated 3 matches for "vaddd".

Did you mean: vaddr

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...0, %xmm0, %xmm0 is being unnecessarily carried across the loop. It's then redundantly added twice in the reduction after the loop despite it being 0. This happens because we basically tricked the backend into generating a 256-bit vpmaddwd concated with a 256-bit zero vector going into a 512-bit vaddd before type legalization. The 512-bit concat and vpaddd get split during type legalization, and the high half of the add gets constant folded away. I'm guessing we probably finished with 4 vpxors before the loop but MachineCSE(or some other pass?) combined two of them when it figured out the lo...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 24

[LoopVectorizer] Improving the performance of dot product reduction loop

...eing unnecessarily > carried across the loop. It's then redundantly added twice in the reduction > after the loop despite it being 0. This happens because we basically > tricked the backend into generating a 256-bit vpmaddwd concated with a > 256-bit zero vector going into a 512-bit vaddd before type legalization. > The 512-bit concat and vpaddd get split during type legalization, and the > high half of the add gets constant folded away. I'm guessing we probably > finished with 4 vpxors before the loop but MachineCSE(or some other pass?) > combined two of them when i...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

Hello all, This code https://godbolt.org/g/tTyxpf is a dot product reduction loop multipying sign extended 16-bit values to produce a 32-bit accumulated result. The x86 backend is currently not able to optimize it as well as gcc and icc. The IR we are getting from the loop vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The

search for: vaddd