Michael Kuperstein via llvm-dev
2016-Dec-14 17:55 UTC
[llvm-dev] Enabling scalarized conditional stores in the loop vectorizer
I haven't verified what Matt described is what actually happens, but assuming it is - that is a known issue in the x86 cost model. Vectorizing interleaved memory accesses on x86 was, until recently, disabled by default. It's been enabled since r284779, but the cost model is very conservative, and basically assumes we're going to scalarize interleaved ops. I believe Farhana is working on improving that. Michael On Wed, Dec 14, 2016 at 8:44 AM, Das, Dibyendu <Dibyendu.Das at amd.com> wrote:> Hi Matt- > > > > Yeah I used a pretty recent llvm (post 3.9) on an x86-64 ( both AMD and > Intel ). > > > > -dibyendu > > > > *From:* Matthew Simpson [mailto:mssimpso at codeaurora.org] > *Sent:* Wednesday, December 14, 2016 10:03 PM > *To:* Das, Dibyendu <Dibyendu.Das at amd.com> > *Cc:* Michael Kuperstein <mkuper at google.com>; llvm-dev at lists.llvm.org > > *Subject:* Re: [llvm-dev] Enabling scalarized conditional stores in the > loop vectorizer > > > > Hi Dibyendu, > > > > Are you using a recent compiler? What architecture are you targeting? The > target will determine whether the vectorizer thinks vectorization is > profitable without having to manually force the vector width. > > > > For example, top-of-trunk vectorizes your snippet with "clang -O2 -mllvm > -enable-cond-stores-vec" and "--target=aarch64-unknown-linux-gnu". > However, with "--target=x86_64-unknown-linux-gnu" the vectorizer doesn't > find the snippet profitable to vectorize. > > > > This is probably due to the interleaved load in the loop. When targeting > AArch64, the cost model reports the interleaved load as inexpensive > (AArch64 has dedicated instructions for interleaved memory accesses), but > when targeting X86 it doesn't. You can take a look at the costs with > "-mllvm -debug-only=loop-vectorize" > > > > Hope that helps. > > > > -- Matt > > > > On Wed, Dec 14, 2016 at 12:59 AM, Das, Dibyendu via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Hi Michael- > > > > Since you bring up libquantum performance can you let me know what the IR > will look like for this small code snippet (libquantum-like) with > –enable-cond-stores-vec ? I ask because I don’t see vectorization kicking > in unless -force-vector-width=<> is specified. Let me know if I am missing > something. > > > > -Thx > > > > struct nodeTy > > { > > unsigned int c1; > > unsigned int c2; > > unsigned int state; > > }; > > > > struct quantum_reg > > { > > struct nodeTy node[32]; > > unsigned int size; > > }; > > > > void > > quantum_toffoli(int control1, int control2, int target, struct quantum_reg > *reg, int n) > > { > > int i; > > > > int N = reg->size; > > for(i=0; i < N; i++) > > { > > if(reg->node[i].state & ((unsigned int)1 << control1)) > > if(reg->node[i].state & ((unsigned int)1 << control2)) > > reg->node[i].state ^= ((unsigned int)1 << target); > > } > > } > > *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Matthew > Simpson via llvm-dev > *Sent:* Tuesday, December 13, 2016 7:12 PM > *To:* Michael Kuperstein <mkuper at google.com> > *Cc:* llvm-dev <llvm-dev at lists.llvm.org> > *Subject:* Re: [llvm-dev] Enabling scalarized conditional stores in the > loop vectorizer > > > > Hi Michael, > > > > Thanks for testing this on your benchmarks and target. I think the results > will help guide the direction we go. I tested the feature with spec2k/2k6 > on AArch64/Kryo and saw minor performance swings, aside from a large (30%) > improvement in spec2k6/libquantum. The primary loop in that benchmark has a > conditional store, so I expected it to benefit. > > > > Regarding the cost model, I think the vectorizer's modeling of the > conditional stores is good. We could potentially improve it by using > profile information if available. But I'm not sure of the quality of the > individual TTI implementations other than AArch64. I assume they are > adequate. > > > > Since the conditional stores remain scalar in the vector loop, their cost > is essentially the same as it is in the scalar loop (aside from > scalarization overhead, which we account for). So when we compare the cost > of the scalar and vector loops when deciding to vectorize, we're basically > comparing the cost of everything else. > > > > -- Matt > > > > On Mon, Dec 12, 2016 at 7:03 PM, Michael Kuperstein via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Conceptually speaking, I think we really ought to enable this. > > > > Practically, I'm going to test it on our benchmarks (on x86), and see if > we have any regressions - this seems like a fairly major change. > > Re targets - let's see where we stand w.r.t regressions first. What kind > of performance testing have you already run on this? Do you know of > specific targets where the cost model is known to be good enough, so it's > clearly beneficial? > > > > (+Arnold, who probably knows why this is disabled by default. :-) ) > > > > Thanks, > > Michael > > > > On Mon, Dec 12, 2016 at 2:52 PM, Matthew Simpson <mssimpso at codeaurora.org> > wrote: > > Hi, > > > > I'd like to enable the scalarized conditional stores feature in the loop > vectorizer (-enable-cond-stores-vec=true). The feature allows us to > vectorize loops containing conditional stores that must be scalarized and > predicated in the vectorized loop. > > > > Note that this flag does not affect the decision to generate masked vector > stores. That is a separate feature and is guarded by a TTI hook. Currently, > we give up on loops containing conditional stores that must be scalarized > (i.e., conditional stores that can't be represented with masked vector > stores). If the feature is enabled, we attempt to vectorize those loops if > profitable, while scalarizing and predicating the conditional stores. > > > > I think these stores are fairly well modeled in the cost model at this > point using the static estimates. They're modeled similar to the way we > model other non-store conditional instructions that must be scalarized and > predicated (e.g., instructions that may divide by zero); however, only the > conditional stores are currently disabled by default. > > > > I'd appreciate any opinions on how/if we can enable this feature. For > example, can we enable it for all targets or would a target-by-target > opt-in mechanism using a TTI hook be preferable? If you'd like to test the > feature on your target, please report any significant regressions and > improvements you find. > > > > Thanks! > > > > -- Matt > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161214/ecb8fa51/attachment.html>
Matthew Simpson via llvm-dev
2016-Dec-15 16:09 UTC
[llvm-dev] Enabling scalarized conditional stores in the loop vectorizer
If there are no objections, I'll submit a patch for review that sets the default value of "-enable-cond-stores-vec" to "true". Thanks! -- Matt On Wed, Dec 14, 2016 at 12:55 PM, Michael Kuperstein via llvm-dev < llvm-dev at lists.llvm.org> wrote:> I haven't verified what Matt described is what actually happens, but > assuming it is - that is a known issue in the x86 cost model. > > Vectorizing interleaved memory accesses on x86 was, until recently, > disabled by default. It's been enabled since r284779, but the cost model is > very conservative, and basically assumes we're going to scalarize > interleaved ops. > > I believe Farhana is working on improving that. > > Michael > > > On Wed, Dec 14, 2016 at 8:44 AM, Das, Dibyendu <Dibyendu.Das at amd.com> > wrote: > >> Hi Matt- >> >> >> >> Yeah I used a pretty recent llvm (post 3.9) on an x86-64 ( both AMD and >> Intel ). >> >> >> >> -dibyendu >> >> >> >> *From:* Matthew Simpson [mailto:mssimpso at codeaurora.org] >> *Sent:* Wednesday, December 14, 2016 10:03 PM >> *To:* Das, Dibyendu <Dibyendu.Das at amd.com> >> *Cc:* Michael Kuperstein <mkuper at google.com>; llvm-dev at lists.llvm.org >> >> *Subject:* Re: [llvm-dev] Enabling scalarized conditional stores in the >> loop vectorizer >> >> >> >> Hi Dibyendu, >> >> >> >> Are you using a recent compiler? What architecture are you targeting? The >> target will determine whether the vectorizer thinks vectorization is >> profitable without having to manually force the vector width. >> >> >> >> For example, top-of-trunk vectorizes your snippet with "clang -O2 -mllvm >> -enable-cond-stores-vec" and "--target=aarch64-unknown-linux-gnu". >> However, with "--target=x86_64-unknown-linux-gnu" the vectorizer doesn't >> find the snippet profitable to vectorize. >> >> >> >> This is probably due to the interleaved load in the loop. When targeting >> AArch64, the cost model reports the interleaved load as inexpensive >> (AArch64 has dedicated instructions for interleaved memory accesses), but >> when targeting X86 it doesn't. You can take a look at the costs with >> "-mllvm -debug-only=loop-vectorize" >> >> >> >> Hope that helps. >> >> >> >> -- Matt >> >> >> >> On Wed, Dec 14, 2016 at 12:59 AM, Das, Dibyendu via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> Hi Michael- >> >> >> >> Since you bring up libquantum performance can you let me know what the IR >> will look like for this small code snippet (libquantum-like) with >> –enable-cond-stores-vec ? I ask because I don’t see vectorization kicking >> in unless -force-vector-width=<> is specified. Let me know if I am missing >> something. >> >> >> >> -Thx >> >> >> >> struct nodeTy >> >> { >> >> unsigned int c1; >> >> unsigned int c2; >> >> unsigned int state; >> >> }; >> >> >> >> struct quantum_reg >> >> { >> >> struct nodeTy node[32]; >> >> unsigned int size; >> >> }; >> >> >> >> void >> >> quantum_toffoli(int control1, int control2, int target, struct >> quantum_reg *reg, int n) >> >> { >> >> int i; >> >> >> >> int N = reg->size; >> >> for(i=0; i < N; i++) >> >> { >> >> if(reg->node[i].state & ((unsigned int)1 << control1)) >> >> if(reg->node[i].state & ((unsigned int)1 << control2)) >> >> reg->node[i].state ^= ((unsigned int)1 << target); >> >> } >> >> } >> >> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Matthew >> Simpson via llvm-dev >> *Sent:* Tuesday, December 13, 2016 7:12 PM >> *To:* Michael Kuperstein <mkuper at google.com> >> *Cc:* llvm-dev <llvm-dev at lists.llvm.org> >> *Subject:* Re: [llvm-dev] Enabling scalarized conditional stores in the >> loop vectorizer >> >> >> >> Hi Michael, >> >> >> >> Thanks for testing this on your benchmarks and target. I think the >> results will help guide the direction we go. I tested the feature with >> spec2k/2k6 on AArch64/Kryo and saw minor performance swings, aside from a >> large (30%) improvement in spec2k6/libquantum. The primary loop in that >> benchmark has a conditional store, so I expected it to benefit. >> >> >> >> Regarding the cost model, I think the vectorizer's modeling of the >> conditional stores is good. We could potentially improve it by using >> profile information if available. But I'm not sure of the quality of the >> individual TTI implementations other than AArch64. I assume they are >> adequate. >> >> >> >> Since the conditional stores remain scalar in the vector loop, their cost >> is essentially the same as it is in the scalar loop (aside from >> scalarization overhead, which we account for). So when we compare the cost >> of the scalar and vector loops when deciding to vectorize, we're basically >> comparing the cost of everything else. >> >> >> >> -- Matt >> >> >> >> On Mon, Dec 12, 2016 at 7:03 PM, Michael Kuperstein via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> Conceptually speaking, I think we really ought to enable this. >> >> >> >> Practically, I'm going to test it on our benchmarks (on x86), and see if >> we have any regressions - this seems like a fairly major change. >> >> Re targets - let's see where we stand w.r.t regressions first. What kind >> of performance testing have you already run on this? Do you know of >> specific targets where the cost model is known to be good enough, so it's >> clearly beneficial? >> >> >> >> (+Arnold, who probably knows why this is disabled by default. :-) ) >> >> >> >> Thanks, >> >> Michael >> >> >> >> On Mon, Dec 12, 2016 at 2:52 PM, Matthew Simpson <mssimpso at codeaurora.org> >> wrote: >> >> Hi, >> >> >> >> I'd like to enable the scalarized conditional stores feature in the loop >> vectorizer (-enable-cond-stores-vec=true). The feature allows us to >> vectorize loops containing conditional stores that must be scalarized and >> predicated in the vectorized loop. >> >> >> >> Note that this flag does not affect the decision to generate masked >> vector stores. That is a separate feature and is guarded by a TTI hook. >> Currently, we give up on loops containing conditional stores that must be >> scalarized (i.e., conditional stores that can't be represented with masked >> vector stores). If the feature is enabled, we attempt to vectorize those >> loops if profitable, while scalarizing and predicating the conditional >> stores. >> >> >> >> I think these stores are fairly well modeled in the cost model at this >> point using the static estimates. They're modeled similar to the way we >> model other non-store conditional instructions that must be scalarized and >> predicated (e.g., instructions that may divide by zero); however, only the >> conditional stores are currently disabled by default. >> >> >> >> I'd appreciate any opinions on how/if we can enable this feature. For >> example, can we enable it for all targets or would a target-by-target >> opt-in mechanism using a TTI hook be preferable? If you'd like to test the >> feature on your target, please report any significant regressions and >> improvements you find. >> >> >> >> Thanks! >> >> >> >> -- Matt >> >> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161215/8167c549/attachment-0001.html>
Aleen, Farhana A via llvm-dev
2016-Dec-15 16:23 UTC
[llvm-dev] Enabling scalarized conditional stores in the loop vectorizer
Thanks Michael and Dibyendu for doing the experimentation and bringing this up to our attention. It might be the case what Matt described here. I will take a look at it. Farhana From: Michael Kuperstein [mailto:mkuper at google.com] Sent: Wednesday, December 14, 2016 9:56 AM To: Das, Dibyendu <Dibyendu.Das at amd.com>; Aleen, Farhana A <farhana.a.aleen at intel.com> Cc: Matthew Simpson <mssimpso at codeaurora.org>; llvm-dev at lists.llvm.org Subject: Re: [llvm-dev] Enabling scalarized conditional stores in the loop vectorizer I haven't verified what Matt described is what actually happens, but assuming it is - that is a known issue in the x86 cost model. Vectorizing interleaved memory accesses on x86 was, until recently, disabled by default. It's been enabled since r284779, but the cost model is very conservative, and basically assumes we're going to scalarize interleaved ops. I believe Farhana is working on improving that. Michael On Wed, Dec 14, 2016 at 8:44 AM, Das, Dibyendu <Dibyendu.Das at amd.com<mailto:Dibyendu.Das at amd.com>> wrote: Hi Matt- Yeah I used a pretty recent llvm (post 3.9) on an x86-64 ( both AMD and Intel ). -dibyendu From: Matthew Simpson [mailto:mssimpso at codeaurora.org<mailto:mssimpso at codeaurora.org>] Sent: Wednesday, December 14, 2016 10:03 PM To: Das, Dibyendu <Dibyendu.Das at amd.com<mailto:Dibyendu.Das at amd.com>> Cc: Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] Enabling scalarized conditional stores in the loop vectorizer Hi Dibyendu, Are you using a recent compiler? What architecture are you targeting? The target will determine whether the vectorizer thinks vectorization is profitable without having to manually force the vector width. For example, top-of-trunk vectorizes your snippet with "clang -O2 -mllvm -enable-cond-stores-vec" and "--target=aarch64-unknown-linux-gnu". However, with "--target=x86_64-unknown-linux-gnu" the vectorizer doesn't find the snippet profitable to vectorize. This is probably due to the interleaved load in the loop. When targeting AArch64, the cost model reports the interleaved load as inexpensive (AArch64 has dedicated instructions for interleaved memory accesses), but when targeting X86 it doesn't. You can take a look at the costs with "-mllvm -debug-only=loop-vectorize" Hope that helps. -- Matt On Wed, Dec 14, 2016 at 12:59 AM, Das, Dibyendu via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Hi Michael- Since you bring up libquantum performance can you let me know what the IR will look like for this small code snippet (libquantum-like) with –enable-cond-stores-vec ? I ask because I don’t see vectorization kicking in unless -force-vector-width=<> is specified. Let me know if I am missing something. -Thx struct nodeTy { unsigned int c1; unsigned int c2; unsigned int state; }; struct quantum_reg { struct nodeTy node[32]; unsigned int size; }; void quantum_toffoli(int control1, int control2, int target, struct quantum_reg *reg, int n) { int i; int N = reg->size; for(i=0; i < N; i++) { if(reg->node[i].state & ((unsigned int)1 << control1)) if(reg->node[i].state & ((unsigned int)1 << control2)) reg->node[i].state ^= ((unsigned int)1 << target); } } From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces at lists.llvm.org>] On Behalf Of Matthew Simpson via llvm-dev Sent: Tuesday, December 13, 2016 7:12 PM To: Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>> Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: Re: [llvm-dev] Enabling scalarized conditional stores in the loop vectorizer Hi Michael, Thanks for testing this on your benchmarks and target. I think the results will help guide the direction we go. I tested the feature with spec2k/2k6 on AArch64/Kryo and saw minor performance swings, aside from a large (30%) improvement in spec2k6/libquantum. The primary loop in that benchmark has a conditional store, so I expected it to benefit. Regarding the cost model, I think the vectorizer's modeling of the conditional stores is good. We could potentially improve it by using profile information if available. But I'm not sure of the quality of the individual TTI implementations other than AArch64. I assume they are adequate. Since the conditional stores remain scalar in the vector loop, their cost is essentially the same as it is in the scalar loop (aside from scalarization overhead, which we account for). So when we compare the cost of the scalar and vector loops when deciding to vectorize, we're basically comparing the cost of everything else. -- Matt On Mon, Dec 12, 2016 at 7:03 PM, Michael Kuperstein via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Conceptually speaking, I think we really ought to enable this. Practically, I'm going to test it on our benchmarks (on x86), and see if we have any regressions - this seems like a fairly major change. Re targets - let's see where we stand w.r.t regressions first. What kind of performance testing have you already run on this? Do you know of specific targets where the cost model is known to be good enough, so it's clearly beneficial? (+Arnold, who probably knows why this is disabled by default. :-) ) Thanks, Michael On Mon, Dec 12, 2016 at 2:52 PM, Matthew Simpson <mssimpso at codeaurora.org<mailto:mssimpso at codeaurora.org>> wrote: Hi, I'd like to enable the scalarized conditional stores feature in the loop vectorizer (-enable-cond-stores-vec=true). The feature allows us to vectorize loops containing conditional stores that must be scalarized and predicated in the vectorized loop. Note that this flag does not affect the decision to generate masked vector stores. That is a separate feature and is guarded by a TTI hook. Currently, we give up on loops containing conditional stores that must be scalarized (i.e., conditional stores that can't be represented with masked vector stores). If the feature is enabled, we attempt to vectorize those loops if profitable, while scalarizing and predicating the conditional stores. I think these stores are fairly well modeled in the cost model at this point using the static estimates. They're modeled similar to the way we model other non-store conditional instructions that must be scalarized and predicated (e.g., instructions that may divide by zero); however, only the conditional stores are currently disabled by default. I'd appreciate any opinions on how/if we can enable this feature. For example, can we enable it for all targets or would a target-by-target opt-in mechanism using a TTI hook be preferable? If you'd like to test the feature on your target, please report any significant regressions and improvements you find. Thanks! -- Matt _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161215/b1662f48/attachment.html>
Michael Kuperstein via llvm-dev
2016-Dec-15 16:49 UTC
[llvm-dev] Enabling scalarized conditional stores in the loop vectorizer
I haven't done any experimentation, it's all Matt. :-) On Dec 15, 2016 08:23, "Aleen, Farhana A" <farhana.a.aleen at intel.com> wrote: Thanks Michael and Dibyendu for doing the experimentation and bringing this up to our attention. It might be the case what Matt described here. I will take a look at it. Farhana *From:* Michael Kuperstein [mailto:mkuper at google.com] *Sent:* Wednesday, December 14, 2016 9:56 AM *To:* Das, Dibyendu <Dibyendu.Das at amd.com>; Aleen, Farhana A < farhana.a.aleen at intel.com> *Cc:* Matthew Simpson <mssimpso at codeaurora.org>; llvm-dev at lists.llvm.org *Subject:* Re: [llvm-dev] Enabling scalarized conditional stores in the loop vectorizer I haven't verified what Matt described is what actually happens, but assuming it is - that is a known issue in the x86 cost model. Vectorizing interleaved memory accesses on x86 was, until recently, disabled by default. It's been enabled since r284779, but the cost model is very conservative, and basically assumes we're going to scalarize interleaved ops. I believe Farhana is working on improving that. Michael On Wed, Dec 14, 2016 at 8:44 AM, Das, Dibyendu <Dibyendu.Das at amd.com> wrote: Hi Matt- Yeah I used a pretty recent llvm (post 3.9) on an x86-64 ( both AMD and Intel ). -dibyendu *From:* Matthew Simpson [mailto:mssimpso at codeaurora.org] *Sent:* Wednesday, December 14, 2016 10:03 PM *To:* Das, Dibyendu <Dibyendu.Das at amd.com> *Cc:* Michael Kuperstein <mkuper at google.com>; llvm-dev at lists.llvm.org *Subject:* Re: [llvm-dev] Enabling scalarized conditional stores in the loop vectorizer Hi Dibyendu, Are you using a recent compiler? What architecture are you targeting? The target will determine whether the vectorizer thinks vectorization is profitable without having to manually force the vector width. For example, top-of-trunk vectorizes your snippet with "clang -O2 -mllvm -enable-cond-stores-vec" and "--target=aarch64-unknown-linux-gnu". However, with "--target=x86_64-unknown-linux-gnu" the vectorizer doesn't find the snippet profitable to vectorize. This is probably due to the interleaved load in the loop. When targeting AArch64, the cost model reports the interleaved load as inexpensive (AArch64 has dedicated instructions for interleaved memory accesses), but when targeting X86 it doesn't. You can take a look at the costs with "-mllvm -debug-only=loop-vectorize" Hope that helps. -- Matt On Wed, Dec 14, 2016 at 12:59 AM, Das, Dibyendu via llvm-dev < llvm-dev at lists.llvm.org> wrote: Hi Michael- Since you bring up libquantum performance can you let me know what the IR will look like for this small code snippet (libquantum-like) with –enable-cond-stores-vec ? I ask because I don’t see vectorization kicking in unless -force-vector-width=<> is specified. Let me know if I am missing something. -Thx struct nodeTy { unsigned int c1; unsigned int c2; unsigned int state; }; struct quantum_reg { struct nodeTy node[32]; unsigned int size; }; void quantum_toffoli(int control1, int control2, int target, struct quantum_reg *reg, int n) { int i; int N = reg->size; for(i=0; i < N; i++) { if(reg->node[i].state & ((unsigned int)1 << control1)) if(reg->node[i].state & ((unsigned int)1 << control2)) reg->node[i].state ^= ((unsigned int)1 << target); } } *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Matthew Simpson via llvm-dev *Sent:* Tuesday, December 13, 2016 7:12 PM *To:* Michael Kuperstein <mkuper at google.com> *Cc:* llvm-dev <llvm-dev at lists.llvm.org> *Subject:* Re: [llvm-dev] Enabling scalarized conditional stores in the loop vectorizer Hi Michael, Thanks for testing this on your benchmarks and target. I think the results will help guide the direction we go. I tested the feature with spec2k/2k6 on AArch64/Kryo and saw minor performance swings, aside from a large (30%) improvement in spec2k6/libquantum. The primary loop in that benchmark has a conditional store, so I expected it to benefit. Regarding the cost model, I think the vectorizer's modeling of the conditional stores is good. We could potentially improve it by using profile information if available. But I'm not sure of the quality of the individual TTI implementations other than AArch64. I assume they are adequate. Since the conditional stores remain scalar in the vector loop, their cost is essentially the same as it is in the scalar loop (aside from scalarization overhead, which we account for). So when we compare the cost of the scalar and vector loops when deciding to vectorize, we're basically comparing the cost of everything else. -- Matt On Mon, Dec 12, 2016 at 7:03 PM, Michael Kuperstein via llvm-dev < llvm-dev at lists.llvm.org> wrote: Conceptually speaking, I think we really ought to enable this. Practically, I'm going to test it on our benchmarks (on x86), and see if we have any regressions - this seems like a fairly major change. Re targets - let's see where we stand w.r.t regressions first. What kind of performance testing have you already run on this? Do you know of specific targets where the cost model is known to be good enough, so it's clearly beneficial? (+Arnold, who probably knows why this is disabled by default. :-) ) Thanks, Michael On Mon, Dec 12, 2016 at 2:52 PM, Matthew Simpson <mssimpso at codeaurora.org> wrote: Hi, I'd like to enable the scalarized conditional stores feature in the loop vectorizer (-enable-cond-stores-vec=true). The feature allows us to vectorize loops containing conditional stores that must be scalarized and predicated in the vectorized loop. Note that this flag does not affect the decision to generate masked vector stores. That is a separate feature and is guarded by a TTI hook. Currently, we give up on loops containing conditional stores that must be scalarized (i.e., conditional stores that can't be represented with masked vector stores). If the feature is enabled, we attempt to vectorize those loops if profitable, while scalarizing and predicating the conditional stores. I think these stores are fairly well modeled in the cost model at this point using the static estimates. They're modeled similar to the way we model other non-store conditional instructions that must be scalarized and predicated (e.g., instructions that may divide by zero); however, only the conditional stores are currently disabled by default. I'd appreciate any opinions on how/if we can enable this feature. For example, can we enable it for all targets or would a target-by-target opt-in mechanism using a TTI hook be preferable? If you'd like to test the feature on your target, please report any significant regressions and improvements you find. Thanks! -- Matt _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161215/ea2c7111/attachment-0001.html>
Michael Kuperstein via llvm-dev
2016-Dec-15 16:49 UTC
[llvm-dev] Enabling scalarized conditional stores in the loop vectorizer
SGTM. On Dec 15, 2016 08:09, "Matthew Simpson" <mssimpso at codeaurora.org> wrote:> If there are no objections, I'll submit a patch for review that sets the > default value of "-enable-cond-stores-vec" to "true". Thanks! > > -- Matt > > > On Wed, Dec 14, 2016 at 12:55 PM, Michael Kuperstein via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> I haven't verified what Matt described is what actually happens, but >> assuming it is - that is a known issue in the x86 cost model. >> >> Vectorizing interleaved memory accesses on x86 was, until recently, >> disabled by default. It's been enabled since r284779, but the cost model is >> very conservative, and basically assumes we're going to scalarize >> interleaved ops. >> >> I believe Farhana is working on improving that. >> >> Michael >> >> >> On Wed, Dec 14, 2016 at 8:44 AM, Das, Dibyendu <Dibyendu.Das at amd.com> >> wrote: >> >>> Hi Matt- >>> >>> >>> >>> Yeah I used a pretty recent llvm (post 3.9) on an x86-64 ( both AMD and >>> Intel ). >>> >>> >>> >>> -dibyendu >>> >>> >>> >>> *From:* Matthew Simpson [mailto:mssimpso at codeaurora.org] >>> *Sent:* Wednesday, December 14, 2016 10:03 PM >>> *To:* Das, Dibyendu <Dibyendu.Das at amd.com> >>> *Cc:* Michael Kuperstein <mkuper at google.com>; llvm-dev at lists.llvm.org >>> >>> *Subject:* Re: [llvm-dev] Enabling scalarized conditional stores in the >>> loop vectorizer >>> >>> >>> >>> Hi Dibyendu, >>> >>> >>> >>> Are you using a recent compiler? What architecture are you targeting? >>> The target will determine whether the vectorizer thinks vectorization is >>> profitable without having to manually force the vector width. >>> >>> >>> >>> For example, top-of-trunk vectorizes your snippet with "clang -O2 -mllvm >>> -enable-cond-stores-vec" and "--target=aarch64-unknown-linux-gnu". >>> However, with "--target=x86_64-unknown-linux-gnu" the vectorizer >>> doesn't find the snippet profitable to vectorize. >>> >>> >>> >>> This is probably due to the interleaved load in the loop. When targeting >>> AArch64, the cost model reports the interleaved load as inexpensive >>> (AArch64 has dedicated instructions for interleaved memory accesses), but >>> when targeting X86 it doesn't. You can take a look at the costs with >>> "-mllvm -debug-only=loop-vectorize" >>> >>> >>> >>> Hope that helps. >>> >>> >>> >>> -- Matt >>> >>> >>> >>> On Wed, Dec 14, 2016 at 12:59 AM, Das, Dibyendu via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>> Hi Michael- >>> >>> >>> >>> Since you bring up libquantum performance can you let me know what the >>> IR will look like for this small code snippet (libquantum-like) with >>> –enable-cond-stores-vec ? I ask because I don’t see vectorization kicking >>> in unless -force-vector-width=<> is specified. Let me know if I am missing >>> something. >>> >>> >>> >>> -Thx >>> >>> >>> >>> struct nodeTy >>> >>> { >>> >>> unsigned int c1; >>> >>> unsigned int c2; >>> >>> unsigned int state; >>> >>> }; >>> >>> >>> >>> struct quantum_reg >>> >>> { >>> >>> struct nodeTy node[32]; >>> >>> unsigned int size; >>> >>> }; >>> >>> >>> >>> void >>> >>> quantum_toffoli(int control1, int control2, int target, struct >>> quantum_reg *reg, int n) >>> >>> { >>> >>> int i; >>> >>> >>> >>> int N = reg->size; >>> >>> for(i=0; i < N; i++) >>> >>> { >>> >>> if(reg->node[i].state & ((unsigned int)1 << control1)) >>> >>> if(reg->node[i].state & ((unsigned int)1 << control2)) >>> >>> reg->node[i].state ^= ((unsigned int)1 << target); >>> >>> } >>> >>> } >>> >>> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of >>> *Matthew Simpson via llvm-dev >>> *Sent:* Tuesday, December 13, 2016 7:12 PM >>> *To:* Michael Kuperstein <mkuper at google.com> >>> *Cc:* llvm-dev <llvm-dev at lists.llvm.org> >>> *Subject:* Re: [llvm-dev] Enabling scalarized conditional stores in the >>> loop vectorizer >>> >>> >>> >>> Hi Michael, >>> >>> >>> >>> Thanks for testing this on your benchmarks and target. I think the >>> results will help guide the direction we go. I tested the feature with >>> spec2k/2k6 on AArch64/Kryo and saw minor performance swings, aside from a >>> large (30%) improvement in spec2k6/libquantum. The primary loop in that >>> benchmark has a conditional store, so I expected it to benefit. >>> >>> >>> >>> Regarding the cost model, I think the vectorizer's modeling of the >>> conditional stores is good. We could potentially improve it by using >>> profile information if available. But I'm not sure of the quality of the >>> individual TTI implementations other than AArch64. I assume they are >>> adequate. >>> >>> >>> >>> Since the conditional stores remain scalar in the vector loop, their >>> cost is essentially the same as it is in the scalar loop (aside from >>> scalarization overhead, which we account for). So when we compare the cost >>> of the scalar and vector loops when deciding to vectorize, we're basically >>> comparing the cost of everything else. >>> >>> >>> >>> -- Matt >>> >>> >>> >>> On Mon, Dec 12, 2016 at 7:03 PM, Michael Kuperstein via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>> Conceptually speaking, I think we really ought to enable this. >>> >>> >>> >>> Practically, I'm going to test it on our benchmarks (on x86), and see if >>> we have any regressions - this seems like a fairly major change. >>> >>> Re targets - let's see where we stand w.r.t regressions first. What kind >>> of performance testing have you already run on this? Do you know of >>> specific targets where the cost model is known to be good enough, so it's >>> clearly beneficial? >>> >>> >>> >>> (+Arnold, who probably knows why this is disabled by default. :-) ) >>> >>> >>> >>> Thanks, >>> >>> Michael >>> >>> >>> >>> On Mon, Dec 12, 2016 at 2:52 PM, Matthew Simpson < >>> mssimpso at codeaurora.org> wrote: >>> >>> Hi, >>> >>> >>> >>> I'd like to enable the scalarized conditional stores feature in the loop >>> vectorizer (-enable-cond-stores-vec=true). The feature allows us to >>> vectorize loops containing conditional stores that must be scalarized and >>> predicated in the vectorized loop. >>> >>> >>> >>> Note that this flag does not affect the decision to generate masked >>> vector stores. That is a separate feature and is guarded by a TTI hook. >>> Currently, we give up on loops containing conditional stores that must be >>> scalarized (i.e., conditional stores that can't be represented with masked >>> vector stores). If the feature is enabled, we attempt to vectorize those >>> loops if profitable, while scalarizing and predicating the conditional >>> stores. >>> >>> >>> >>> I think these stores are fairly well modeled in the cost model at this >>> point using the static estimates. They're modeled similar to the way we >>> model other non-store conditional instructions that must be scalarized and >>> predicated (e.g., instructions that may divide by zero); however, only the >>> conditional stores are currently disabled by default. >>> >>> >>> >>> I'd appreciate any opinions on how/if we can enable this feature. For >>> example, can we enable it for all targets or would a target-by-target >>> opt-in mechanism using a TTI hook be preferable? If you'd like to test the >>> feature on your target, please report any significant regressions and >>> improvements you find. >>> >>> >>> >>> Thanks! >>> >>> >>> >>> -- Matt >>> >>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161215/5719c189/attachment.html>
Possibly Parallel Threads
- Enabling scalarized conditional stores in the loop vectorizer
- Enabling scalarized conditional stores in the loop vectorizer
- Enabling scalarized conditional stores in the loop vectorizer
- Enabling scalarized conditional stores in the loop vectorizer
- Enabling scalarized conditional stores in the loop vectorizer