Zaks, Ayal via llvm-dev
2016-Sep-01 23:26 UTC
[llvm-dev] enabling interleaved access loop vectorization
So turns out it is a full reproducer after all (choosing to vectorize on AVX), good.> The details are in PR29025.Interesting. (So we should carefully insert unconditional branches inside shuffle sequences, eh? ;-)> But if we modify the program by adding "*out++ = 0" right after "*out++ = q;" (thus eliminating the pesky <12 x i8>), we get:Indeed such padding is a known (programmer) optimization to effectively have power-of-2 strides and/or alignment.> So, unfortunately, it turns out I don't have access to DENBench.If you like we could test your patch to see how it (mis)behaves. From: Michael Kuperstein [mailto:mkuper at google.com] Sent: Thursday, August 18, 2016 03:57 To: Zaks, Ayal <ayal.zaks at intel.com> Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com>; Renato Golin <renato.golin at linaro.org>; Matthew Simpson <mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] enabling interleaved access loop vectorization So, at least for this example, it looks like we actually want to vectorize with -enable-interleaved-mem-accesses, we just need the backend to generate good code for the vector types that produces, specifically, in this case, <12 x i8>. The details are in PR29025. The upshot of this is that for the original program (with an outer loop around it): $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe real 0m2.229s user 0m2.224s $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe real 0m2.590s user 0m2.584s This indicates that we do have a slight cost modeling issue - the cost model is not quite conservative enough in case we really do use inserts and extracts. One thing we're probably not accounting for is a bunch of GPR spills - although I'm not sure *why* we end up spilling so much. So perhaps this should also be fixed in regalloc. But if we modify the program by adding "*out++ = 0" right after "*out++ = q;" (thus eliminating the pesky <12 x i8>), we get: $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe real 0m2.257s user 0m2.256s $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe real 0m0.958s user 0m0.956s On Wed, Aug 17, 2016 at 2:56 PM, Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>> wrote: Thanks Ayal! On Wed, Aug 17, 2016 at 2:14 PM, Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>> wrote: Hi Michael, Don’t quite have a full reproducer for you yet. You’re welcome to try and see what’s happening in 32 bit mode when enabling interleaving for the following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”: void rgb2yik (char * in, char * out, int N) { int j; for (j = 0; j < N; ++j) { unsigned char r = *in++; unsigned char g = *in++; unsigned char b = *in++; unsigned char y = 0.299*r + 0.587*g + 0.114*b; signed char i = 0.596*r + -0.274*g + -0.321*b; signed char q = 0.211*r + -0.523*g + 0.312*b; *out++ = y; *out++ = (unsigned char)i; *out++ = (unsigned char)q; } } but you’d currently need to force it to vectorize to overcome its expected cost. Ayal. From: Michael Kuperstein [mailto:mkuper at google.com<mailto:mkuper at google.com>] Sent: Wednesday, August 17, 2016 00:51 To: Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>>; Demikhovsky, Elena <elena.demikhovsky at intel.com<mailto:elena.demikhovsky at intel.com>> Cc: Renato Golin <renato.golin at linaro.org<mailto:renato.golin at linaro.org>>; Matthew Simpson <mssimpso at codeaurora.org<mailto:mssimpso at codeaurora.org>>; Nema, Ashutosh <Ashutosh.Nema at amd.com<mailto:Ashutosh.Nema at amd.com>>; Sanjay Patel <spatel at rotateright.com<mailto:spatel at rotateright.com>>; llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: Re: [llvm-dev] enabling interleaved access loop vectorization Hi Ayal, Elena, I'd really like to enable this by default. As I wrote above, I didn't see any regressions in internal benchmarks, and there doesn't seem to be anything in SPEC2006 either. I do see a performance improvement in an internal benchmark (that is, a real workload). Would you be able to provide an example that gets pessimized? I have no doubt you've seen regressions related to this, but the fact they exist doesn't help me analyze them as long as I can't see them. :-) I'd really rather look at regressions before making the change - and either try to make the necessary improvements to the cost model, or abandon this as unfeasible for now (pending Ashutosh's work). If you can't, an alternative is to turn this on, and then, if regressions show up on anyone's radar (where we can actually get a reproducer), turn it off again and go back to analysis. But I'd strongly prefer to "prefetch" the problem. Thanks, Michael On Wed, Aug 10, 2016 at 4:32 PM, Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>> wrote: So, unfortunately, it turns out I don't have access to DENBench. Do you happen to have a reduced example that gets pessimized by this? On Tue, Aug 9, 2016 at 11:25 AM, Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>> wrote: Thanks Ayal! I'll take a look at DENBench. As another data point - I tried enabling this on our internal benchmarks. I'm seeing one regression, and it seems to be a regression of the "good" kind - without interleaving we don't vectorize the innermost loop, and with interleaving we do. The vectorized loop is actually significantly faster when benchmarked in isolation, but in this specific instance, the static loop count is unknown, and the dynamic loop count happens to almost always be 1 - and this lives inside a hot outer loop. That's something we ought to be handling through PGO (or, conceivably, outer loop vectorization :-) ). Michael On Mon, Aug 8, 2016 at 3:21 PM, Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>> wrote:> We also need to understand what to do with edge elements in the vector if their loading is not required. We, probably, should issue a masked load in this case.The existing code solves such edge cases where the last element of an InterleaveGroup is absent by making sure the last iteration (and up to last VF iterations) are peeled and executed scalarly; see requiresScalarEpilogue.> All regressions that we see are in 32-bit mode.One place to find them, using the default BaseT::getInterleavedMemoryOpCost(), is DENBench’s RGB conversions. Ayal. From: Demikhovsky, Elena Sent: Monday, August 08, 2016 00:09 To: Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>>; Renato Golin <renato.golin at linaro.org<mailto:renato.golin at linaro.org>> Cc: Matthew Simpson <mssimpso at codeaurora.org<mailto:mssimpso at codeaurora.org>>; Nema, Ashutosh <Ashutosh.Nema at amd.com<mailto:Ashutosh.Nema at amd.com>>; Sanjay Patel <spatel at rotateright.com<mailto:spatel at rotateright.com>>; llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>; Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>> Subject: RE: [llvm-dev] enabling interleaved access loop vectorization We checked the gathered data again. All regressions that we see are in 32-bit mode. The 64-bit mode looks good overall. - Elena From: Michael Kuperstein [mailto:mkuper at google.com] Sent: Saturday, August 06, 2016 02:56 To: Renato Golin <renato.golin at linaro.org<mailto:renato.golin at linaro.org>> Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com<mailto:elena.demikhovsky at intel.com>>; Matthew Simpson <mssimpso at codeaurora.org<mailto:mssimpso at codeaurora.org>>; Nema, Ashutosh <Ashutosh.Nema at amd.com<mailto:Ashutosh.Nema at amd.com>>; Sanjay Patel <spatel at rotateright.com<mailto:spatel at rotateright.com>>; llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>; Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>> Subject: Re: [llvm-dev] enabling interleaved access loop vectorization On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato.golin at linaro.org<mailto:renato.golin at linaro.org>> wrote: On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>> wrote:> I agree that we can get *more* improvement with better cost modeling, but > I'd expect to be able to get *some* improvement the way things are right > now.Elena said she saw "some" improvements. :) I didn't mean "some improvements, some regressions", I meant "some of the improvement we'd expect from the full solution". :-)> That's why I'm curious about where we saw regressions - I'm wondering > whether there's really a significant cost modeling issue I'm missing, or > it's something that's easy to fix so that we can make forward progress, > while Ashutosh is working on the longer-term solution.Sounds like a task to try a few patterns and fiddle with the cost model. Arnold did a lot of those during the first months of the vectorizer, so it might be just a matter of finding the right heuristics, at least for the low hanging fruits. Of course, that'd also involve benchmarking everything else, to make sure the new heuristics doesn't introduce regressions on non-interleaved vectorisation. I don't disagree with you. All I'm saying is that before fiddling with the heuristics, it'd be good to understand what exactly breaks if we simply flip the flag. If the answer happens to be "nothing" - well, problem solved. Unfortunately, according to Elena, that's not the answer. I'm going to play with it with our internal benchmarks, but it's my understanding that Elena/Ayal already have some idea of what the problems are. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160901/a4df1236/attachment-0001.html>
Michael Kuperstein via llvm-dev
2016-Sep-01 23:47 UTC
[llvm-dev] enabling interleaved access loop vectorization
Yes, carefully inserting branches is the way to go! Seriously though - you probably saw that I just committed a fix for PR29025 (r280418). For the reproducer you provided, we now have (without forcing vectorization, and without "padding" to have power-of-2 stride): $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe real 0m2.290s user 0m2.289s sys 0m0.003s $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe real 0m1.095s user 0m1.095s sys 0m0.002s Care to give it a spin internally? Note that this is not a full solution - we still won't vectorize PR27619, and force-vectorizing it is still a bad idea. Getting that right will require more lowering improvements as well as cost model adjustments. But hopefully post-r280418 things should be good enough to avoid regressions for the cases we will vectorize. If you still see regressions, more reproducers will be appreciated. :-) If there are no more regressions, let me know, and I'll post a patch to enable interleaved access for x86. Thanks, Michael On Thu, Sep 1, 2016 at 4:26 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote:> So turns out it is a full reproducer after all (choosing to vectorize on > AVX), good. > > > > > > > The details are in PR29025. > > > > Interesting. (So we should carefully insert unconditional branches inside > shuffle sequences, eh? ;-) > > > > > > > But if we modify the program by adding "*out++ = 0" right after "*out++ > = q;" (thus eliminating the pesky <12 x i8>), we get: > > > > Indeed such padding is a known (programmer) optimization to effectively > have power-of-2 strides and/or alignment. > > > > > > > So, unfortunately, it turns out I don't have access to DENBench. > > > > If you like we could test your patch to see how it (mis)behaves. > > > > > > > > *From:* Michael Kuperstein [mailto:mkuper at google.com] > *Sent:* Thursday, August 18, 2016 03:57 > *To:* Zaks, Ayal <ayal.zaks at intel.com> > *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>; Renato Golin < > renato.golin at linaro.org>; Matthew Simpson <mssimpso at codeaurora.org>; > Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay Patel < > spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org> > > *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization > > > > So, at least for this example, it looks like we actually want to vectorize > with -enable-interleaved-mem-accesses, we just need the backend to > generate good code for the vector types that produces, specifically, in > this case, <12 x i8>. The details are in PR29025. > > > > The upshot of this is that for the original program (with an outer loop > around it): > > > > $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c > -mavx && time ~/llvm/temp/rgb2yik.exe > > real 0m2.229s > > user 0m2.224s > > $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c > -mavx -mllvm -enable-interleaved-mem-accesses && time > ~/llvm/temp/rgb2yik.exe > > real 0m2.590s > > user 0m2.584s > > > > This indicates that we do have a slight cost modeling issue - the cost > model is not quite conservative enough in case we really do use inserts and > extracts. One thing we're probably not accounting for is a bunch of GPR > spills - although I'm not sure *why* we end up spilling so much. So > perhaps this should also be fixed in regalloc. > > > > But if we modify the program by adding "*out++ = 0" right after "*out++ > q;" (thus eliminating the pesky <12 x i8>), we get: > > > > $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c > -mavx && time ~/llvm/temp/rgb2yik.exe > > real 0m2.257s > > user 0m2.256s > > $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c > -mavx -mllvm -enable-interleaved-mem-accesses && time > ~/llvm/temp/rgb2yik.exe > > real 0m0.958s > > user 0m0.956s > > > > On Wed, Aug 17, 2016 at 2:56 PM, Michael Kuperstein <mkuper at google.com> > wrote: > > Thanks Ayal! > > > > On Wed, Aug 17, 2016 at 2:14 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote: > > Hi Michael, > > > > Don’t quite have a full reproducer for you yet. You’re welcome to try and > see what’s happening in 32 bit mode when enabling interleaving for the > following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”: > > > > void rgb2yik (char * in, char * out, int N) > > { > > int j; > > for (j = 0; j < N; ++j) { > > unsigned char r = *in++; > > unsigned char g = *in++; > > unsigned char b = *in++; > > unsigned char y = 0.299*r + 0.587*g + 0.114*b; > > signed char i = 0.596*r + -0.274*g + -0.321*b; > > signed char q = 0.211*r + -0.523*g + 0.312*b; > > *out++ = y; > > *out++ = (unsigned char)i; > > *out++ = (unsigned char)q; > > } > > } > > > > but you’d currently need to force it to vectorize to overcome its expected > cost. > > > > Ayal. > > > > *From:* Michael Kuperstein [mailto:mkuper at google.com] > *Sent:* Wednesday, August 17, 2016 00:51 > *To:* Zaks, Ayal <ayal.zaks at intel.com>; Demikhovsky, Elena < > elena.demikhovsky at intel.com> > *Cc:* Renato Golin <renato.golin at linaro.org>; Matthew Simpson < > mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay > Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org> > > > *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization > > > > Hi Ayal, Elena, > > > > I'd really like to enable this by default. > > > > As I wrote above, I didn't see any regressions in internal benchmarks, and > there doesn't seem to be anything in SPEC2006 either. I do see a > performance improvement in an internal benchmark (that is, a real > workload). > > > > Would you be able to provide an example that gets pessimized? I have no > doubt you've seen regressions related to this, but the fact they exist > doesn't help me analyze them as long as I can't see them. :-) I'd really > rather look at regressions before making the change - and either try to > make the necessary improvements to the cost model, or abandon this as > unfeasible for now (pending Ashutosh's work). > > > > If you can't, an alternative is to turn this on, and then, if regressions > show up on anyone's radar (where we can actually get a reproducer), turn it > off again and go back to analysis. But I'd strongly prefer to "prefetch" > the problem. > > > > Thanks, > > Michael > > > > > > > > > > On Wed, Aug 10, 2016 at 4:32 PM, Michael Kuperstein <mkuper at google.com> > wrote: > > So, unfortunately, it turns out I don't have access to DENBench. > > > > Do you happen to have a reduced example that gets pessimized by this? > > > > On Tue, Aug 9, 2016 at 11:25 AM, Michael Kuperstein <mkuper at google.com> > wrote: > > Thanks Ayal! > > > > I'll take a look at DENBench. > > > > As another data point - I tried enabling this on our internal benchmarks. > I'm seeing one regression, and it seems to be a regression of the "good" > kind - without interleaving we don't vectorize the innermost loop, and with > interleaving we do. The vectorized loop is actually significantly faster > when benchmarked in isolation, but in this specific instance, the static > loop count is unknown, and the dynamic loop count happens to almost always > be 1 - and this lives inside a hot outer loop. > > That's something we ought to be handling through PGO (or, conceivably, > outer loop vectorization :-) ). > > > > Michael > > > > On Mon, Aug 8, 2016 at 3:21 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote: > > > We also need to understand what to do with edge elements in the vector > if their loading is not required. We, probably, should issue a masked load > in this case. > > > > The existing code solves such edge cases where the last element of an > InterleaveGroup is absent by making sure the last iteration (and up to last > VF iterations) are peeled and executed scalarly; see requiresScalarEpilogue. > > > > > > > All regressions that we see are in 32-bit mode. > > > > One place to find them, using the default BaseT::getInterleavedMemoryOpCost(), > is DENBench’s RGB conversions. > > > > Ayal. > > > > *From:* Demikhovsky, Elena > *Sent:* Monday, August 08, 2016 00:09 > *To:* Michael Kuperstein <mkuper at google.com>; Renato Golin < > renato.golin at linaro.org> > *Cc:* Matthew Simpson <mssimpso at codeaurora.org>; Nema, Ashutosh < > Ashutosh.Nema at amd.com>; Sanjay Patel <spatel at rotateright.com>; llvm-dev < > llvm-dev at lists.llvm.org>; Zaks, Ayal <ayal.zaks at intel.com> > *Subject:* RE: [llvm-dev] enabling interleaved access loop vectorization > > > > We checked the gathered data again. All regressions that we see are in > 32-bit mode. The 64-bit mode looks good overall. > > > > - * Elena* > > > > *From:* Michael Kuperstein [mailto:mkuper at google.com <mkuper at google.com>] > *Sent:* Saturday, August 06, 2016 02:56 > *To:* Renato Golin <renato.golin at linaro.org> > *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>; Matthew Simpson < > mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay > Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org>; Zaks, > Ayal <ayal.zaks at intel.com> > *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization > > > > > > > > On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato.golin at linaro.org> > wrote: > > On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at google.com> wrote: > > I agree that we can get *more* improvement with better cost modeling, but > > I'd expect to be able to get *some* improvement the way things are right > > now. > > Elena said she saw "some" improvements. :) > > > > I didn't mean "some improvements, some regressions", I meant "some of the > improvement we'd expect from the full solution". :-) > > > > > > That's why I'm curious about where we saw regressions - I'm wondering > > whether there's really a significant cost modeling issue I'm missing, or > > it's something that's easy to fix so that we can make forward progress, > > while Ashutosh is working on the longer-term solution. > > Sounds like a task to try a few patterns and fiddle with the cost model. > > Arnold did a lot of those during the first months of the vectorizer, > so it might be just a matter of finding the right heuristics, at least > for the low hanging fruits. > > Of course, that'd also involve benchmarking everything else, to make > sure the new heuristics doesn't introduce regressions on > non-interleaved vectorisation. > > > > I don't disagree with you. > > > > All I'm saying is that before fiddling with the heuristics, it'd be good > to understand what exactly breaks if we simply flip the flag. If the answer > happens to be "nothing" - well, problem solved. Unfortunately, according to > Elena, that's not the answer. > > I'm going to play with it with our internal benchmarks, but it's my > understanding that Elena/Ayal already have some idea of what the problems > are. > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > > > > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160901/046c3cde/attachment.html>
Zaks, Ayal via llvm-dev
2016-Sep-04 21:09 UTC
[llvm-dev] enabling interleaved access loop vectorization
> Seriously though - you probably saw that I just committed a fix for PR29025 (r280418). > Care to give it a spin internally?Sure; spinning with r280423 and the patch below (*) indeed takes care of the slowdowns observed in 32 bit mode for AVX ☺.> If you still see regressions, more reproducers will be appreciated. :-) > If there are no more regressions, let me know, and I'll post a patch to enable interleaved access for x86.Unfortunately, we’re still observing severe slowdowns in 32 bit mode for SSE with -march=slm for the same rgb conversion workloads. Seems like we’ll need a different reproducer for that, as rgb2yik.c below is left unvectorized when compiled to slm. Ayal. (*) used the following in anticipation of your patch(?), effectively equivalent to -enable-interleaved-mem-accesses: Index: lib/Target/X86/X86TargetTransformInfo.cpp ==================================================================--- lib/Target/X86/X86TargetTransformInfo.cpp (revision 280423) +++ lib/Target/X86/X86TargetTransformInfo.cpp (working copy) @@ -41,6 +41,10 @@ return ST->hasPOPCNT() ? TTI::PSK_FastHardware : TTI::PSK_Software; } +bool X86TTIImpl::enableInterleavedAccessVectorization() { + return true; +} + unsigned X86TTIImpl::getNumberOfRegisters(bool Vector) { if (Vector && !ST->hasSSE1()) return 0; Index: lib/Target/X86/X86TargetTransformInfo.h ==================================================================--- lib/Target/X86/X86TargetTransformInfo.h (revision 280423) +++ lib/Target/X86/X86TargetTransformInfo.h (working copy) @@ -59,6 +59,7 @@ /// \name Vector TTI Implementations /// @{ + bool enableInterleavedAccessVectorization(); unsigned getNumberOfRegisters(bool Vector); unsigned getRegisterBitWidth(bool Vector); unsigned getMaxInterleaveFactor(unsigned VF); From: Michael Kuperstein [mailto:mkuper at google.com] Sent: Friday, September 02, 2016 02:47 To: Zaks, Ayal <ayal.zaks at intel.com> Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com>; Renato Golin <renato.golin at linaro.org>; Matthew Simpson <mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] enabling interleaved access loop vectorization Yes, carefully inserting branches is the way to go! Seriously though - you probably saw that I just committed a fix for PR29025 (r280418). For the reproducer you provided, we now have (without forcing vectorization, and without "padding" to have power-of-2 stride): $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe real 0m2.290s user 0m2.289s sys 0m0.003s $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe real 0m1.095s user 0m1.095s sys 0m0.002s Care to give it a spin internally? Note that this is not a full solution - we still won't vectorize PR27619, and force-vectorizing it is still a bad idea. Getting that right will require more lowering improvements as well as cost model adjustments. But hopefully post-r280418 things should be good enough to avoid regressions for the cases we will vectorize. If you still see regressions, more reproducers will be appreciated. :-) If there are no more regressions, let me know, and I'll post a patch to enable interleaved access for x86. Thanks, Michael On Thu, Sep 1, 2016 at 4:26 PM, Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>> wrote: So turns out it is a full reproducer after all (choosing to vectorize on AVX), good.> The details are in PR29025.Interesting. (So we should carefully insert unconditional branches inside shuffle sequences, eh? ;-)> But if we modify the program by adding "*out++ = 0" right after "*out++ = q;" (thus eliminating the pesky <12 x i8>), we get:Indeed such padding is a known (programmer) optimization to effectively have power-of-2 strides and/or alignment.> So, unfortunately, it turns out I don't have access to DENBench.If you like we could test your patch to see how it (mis)behaves. From: Michael Kuperstein [mailto:mkuper at google.com<mailto:mkuper at google.com>] Sent: Thursday, August 18, 2016 03:57 To: Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>> Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com<mailto:elena.demikhovsky at intel.com>>; Renato Golin <renato.golin at linaro.org<mailto:renato.golin at linaro.org>>; Matthew Simpson <mssimpso at codeaurora.org<mailto:mssimpso at codeaurora.org>>; Nema, Ashutosh <Ashutosh.Nema at amd.com<mailto:Ashutosh.Nema at amd.com>>; Sanjay Patel <spatel at rotateright.com<mailto:spatel at rotateright.com>>; llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: Re: [llvm-dev] enabling interleaved access loop vectorization So, at least for this example, it looks like we actually want to vectorize with -enable-interleaved-mem-accesses, we just need the backend to generate good code for the vector types that produces, specifically, in this case, <12 x i8>. The details are in PR29025. The upshot of this is that for the original program (with an outer loop around it): $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe real 0m2.229s user 0m2.224s $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe real 0m2.590s user 0m2.584s This indicates that we do have a slight cost modeling issue - the cost model is not quite conservative enough in case we really do use inserts and extracts. One thing we're probably not accounting for is a bunch of GPR spills - although I'm not sure *why* we end up spilling so much. So perhaps this should also be fixed in regalloc. But if we modify the program by adding "*out++ = 0" right after "*out++ = q;" (thus eliminating the pesky <12 x i8>), we get: $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx && time ~/llvm/temp/rgb2yik.exe real 0m2.257s user 0m2.256s $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx -mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe real 0m0.958s user 0m0.956s On Wed, Aug 17, 2016 at 2:56 PM, Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>> wrote: Thanks Ayal! On Wed, Aug 17, 2016 at 2:14 PM, Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>> wrote: Hi Michael, Don’t quite have a full reproducer for you yet. You’re welcome to try and see what’s happening in 32 bit mode when enabling interleaving for the following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”: void rgb2yik (char * in, char * out, int N) { int j; for (j = 0; j < N; ++j) { unsigned char r = *in++; unsigned char g = *in++; unsigned char b = *in++; unsigned char y = 0.299*r + 0.587*g + 0.114*b; signed char i = 0.596*r + -0.274*g + -0.321*b; signed char q = 0.211*r + -0.523*g + 0.312*b; *out++ = y; *out++ = (unsigned char)i; *out++ = (unsigned char)q; } } but you’d currently need to force it to vectorize to overcome its expected cost. Ayal. From: Michael Kuperstein [mailto:mkuper at google.com<mailto:mkuper at google.com>] Sent: Wednesday, August 17, 2016 00:51 To: Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>>; Demikhovsky, Elena <elena.demikhovsky at intel.com<mailto:elena.demikhovsky at intel.com>> Cc: Renato Golin <renato.golin at linaro.org<mailto:renato.golin at linaro.org>>; Matthew Simpson <mssimpso at codeaurora.org<mailto:mssimpso at codeaurora.org>>; Nema, Ashutosh <Ashutosh.Nema at amd.com<mailto:Ashutosh.Nema at amd.com>>; Sanjay Patel <spatel at rotateright.com<mailto:spatel at rotateright.com>>; llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: Re: [llvm-dev] enabling interleaved access loop vectorization Hi Ayal, Elena, I'd really like to enable this by default. As I wrote above, I didn't see any regressions in internal benchmarks, and there doesn't seem to be anything in SPEC2006 either. I do see a performance improvement in an internal benchmark (that is, a real workload). Would you be able to provide an example that gets pessimized? I have no doubt you've seen regressions related to this, but the fact they exist doesn't help me analyze them as long as I can't see them. :-) I'd really rather look at regressions before making the change - and either try to make the necessary improvements to the cost model, or abandon this as unfeasible for now (pending Ashutosh's work). If you can't, an alternative is to turn this on, and then, if regressions show up on anyone's radar (where we can actually get a reproducer), turn it off again and go back to analysis. But I'd strongly prefer to "prefetch" the problem. Thanks, Michael On Wed, Aug 10, 2016 at 4:32 PM, Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>> wrote: So, unfortunately, it turns out I don't have access to DENBench. Do you happen to have a reduced example that gets pessimized by this? On Tue, Aug 9, 2016 at 11:25 AM, Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>> wrote: Thanks Ayal! I'll take a look at DENBench. As another data point - I tried enabling this on our internal benchmarks. I'm seeing one regression, and it seems to be a regression of the "good" kind - without interleaving we don't vectorize the innermost loop, and with interleaving we do. The vectorized loop is actually significantly faster when benchmarked in isolation, but in this specific instance, the static loop count is unknown, and the dynamic loop count happens to almost always be 1 - and this lives inside a hot outer loop. That's something we ought to be handling through PGO (or, conceivably, outer loop vectorization :-) ). Michael On Mon, Aug 8, 2016 at 3:21 PM, Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>> wrote:> We also need to understand what to do with edge elements in the vector if their loading is not required. We, probably, should issue a masked load in this case.The existing code solves such edge cases where the last element of an InterleaveGroup is absent by making sure the last iteration (and up to last VF iterations) are peeled and executed scalarly; see requiresScalarEpilogue.> All regressions that we see are in 32-bit mode.One place to find them, using the default BaseT::getInterleavedMemoryOpCost(), is DENBench’s RGB conversions. Ayal. From: Demikhovsky, Elena Sent: Monday, August 08, 2016 00:09 To: Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>>; Renato Golin <renato.golin at linaro.org<mailto:renato.golin at linaro.org>> Cc: Matthew Simpson <mssimpso at codeaurora.org<mailto:mssimpso at codeaurora.org>>; Nema, Ashutosh <Ashutosh.Nema at amd.com<mailto:Ashutosh.Nema at amd.com>>; Sanjay Patel <spatel at rotateright.com<mailto:spatel at rotateright.com>>; llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>; Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>> Subject: RE: [llvm-dev] enabling interleaved access loop vectorization We checked the gathered data again. All regressions that we see are in 32-bit mode. The 64-bit mode looks good overall. - Elena From: Michael Kuperstein [mailto:mkuper at google.com] Sent: Saturday, August 06, 2016 02:56 To: Renato Golin <renato.golin at linaro.org<mailto:renato.golin at linaro.org>> Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com<mailto:elena.demikhovsky at intel.com>>; Matthew Simpson <mssimpso at codeaurora.org<mailto:mssimpso at codeaurora.org>>; Nema, Ashutosh <Ashutosh.Nema at amd.com<mailto:Ashutosh.Nema at amd.com>>; Sanjay Patel <spatel at rotateright.com<mailto:spatel at rotateright.com>>; llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>; Zaks, Ayal <ayal.zaks at intel.com<mailto:ayal.zaks at intel.com>> Subject: Re: [llvm-dev] enabling interleaved access loop vectorization On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato.golin at linaro.org<mailto:renato.golin at linaro.org>> wrote: On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at google.com<mailto:mkuper at google.com>> wrote:> I agree that we can get *more* improvement with better cost modeling, but > I'd expect to be able to get *some* improvement the way things are right > now.Elena said she saw "some" improvements. :) I didn't mean "some improvements, some regressions", I meant "some of the improvement we'd expect from the full solution". :-)> That's why I'm curious about where we saw regressions - I'm wondering > whether there's really a significant cost modeling issue I'm missing, or > it's something that's easy to fix so that we can make forward progress, > while Ashutosh is working on the longer-term solution.Sounds like a task to try a few patterns and fiddle with the cost model. Arnold did a lot of those during the first months of the vectorizer, so it might be just a matter of finding the right heuristics, at least for the low hanging fruits. Of course, that'd also involve benchmarking everything else, to make sure the new heuristics doesn't introduce regressions on non-interleaved vectorisation. I don't disagree with you. All I'm saying is that before fiddling with the heuristics, it'd be good to understand what exactly breaks if we simply flip the flag. If the answer happens to be "nothing" - well, problem solved. Unfortunately, according to Elena, that's not the answer. I'm going to play with it with our internal benchmarks, but it's my understanding that Elena/Ayal already have some idea of what the problems are. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160904/07c0713a/attachment-0001.html>