thr3ads.net - search: "vpmaddwd"

Displaying 5 results from an estimated 5 matches for "vpmaddwd".

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...y not able to optimize it as well as gcc and icc. The IR we are getting from the loop vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend recognizes that these are addition reductions of multiplication so we use the vpmaddwd instruction which calculates 32-bit products from 16-bit inputs and does a horizontal add of adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32 result. In the example code, because we are reducing the number of elements from 8->4 in the vpmaddwd step we are left with a width...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...gcc and icc. The IR we are getting from the loop >> vectorizer has several v8i32 adds and muls inside the loop. These are >> fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend >> recognizes that these are addition reductions of multiplication so we >> use the vpmaddwd instruction which calculates 32-bit products from >> 16-bit inputs and does a horizontal add of adjacent pairs. A vpmaddwd >> given two v8i16 inputs will produce a v4i32 result. >> >> In the example code, because we are reducing the number of elements >> from 8->4 i...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...ze it as well as gcc > and icc. The IR we are getting from the loop vectorizer has several v8i32 > adds and muls inside the loop. These are fed by v8i16 loads and sexts from > v8i16 to v8i32. The x86 backend recognizes that these are addition > reductions of multiplication so we use the vpmaddwd instruction which > calculates 32-bit products from 16-bit inputs and does a horizontal add of > adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32 > result. > > That godbolt link seems wrong. It wasn't supposed to be clang IR. This should be right. > &gt...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 24

[LoopVectorizer] Improving the performance of dot product reduction loop

...gcc >> and icc. The IR we are getting from the loop vectorizer has several v8i32 >> adds and muls inside the loop. These are fed by v8i16 loads and sexts from >> v8i16 to v8i32. The x86 backend recognizes that these are addition >> reductions of multiplication so we use the vpmaddwd instruction which >> calculates 32-bit products from 16-bit inputs and does a horizontal add of >> adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32 >> result. >> >> > That godbolt link seems wrong. It wasn't supposed to be clang IR. This &g...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 24

[LoopVectorizer] Improving the performance of dot product reduction loop

...e IR we are getting from > the loop vectorizer has several v8i32 adds and muls inside the > loop. These are fed by v8i16 loads and sexts from v8i16 to > v8i32. The x86 backend recognizes that these are addition > reductions of multiplication so we use the vpmaddwd > instruction which calculates 32-bit products from 16-bit > inputs and does a horizontal add of adjacent pairs. A vpmaddwd > given two v8i16 inputs will produce a v4i32 result. > > > > In the example code, because we are reducing the n...

search for: vpmaddwd