Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a safe assumption to make that vmovsd has the same stats as well. Michael On Jul 26, 2012, at 11:46 AM, Cameron McInally wrote:> Ah, bad example. This is a general problem for all (maybe most) SSE and AVX SS/SD patterns though, which is why I mentioned Sandybridge. You can swap out VFMADDSD in my example for VADDSD or whatever you like. > > I have a lion's share of such a change implemented already and performance is greatly affected. If the community is interested in this change, I would be happy to prepare a patch. > > -Cameron > > On Thu, Jul 26, 2012 at 2:27 PM, Jan Sjodin <jan_sjodin at yahoo.com> wrote: > You can't execute FMA4 instructions on Intel processors, so it doesn't really matter what the impact of the move instructions would be, since it would end up with an illegal instruction regardless. :) It does perhaps bring up an issue of tuning for different architectures, but that is something nobody is really looking into at the moment afaik. > > > - Jan > > >________________________________ > > From: Cameron McInally <cameron.mcinally at nyu.edu> > >To: Jan Sjodin <jan_sjodin at yahoo.com> > >Cc: "dag at cray.com" <dag at cray.com>; "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu> > >Sent: Thursday, July 26, 2012 10:49 AM > >Subject: Re: [LLVMdev] X86 FMA4 > > > > > >Hey Jan and Dave, > > > > > >It's not obvious, but there is a significant scalar performance issue following the GCC intrinsics. > > > > > >Let's look at the VFMADDSD pattern. We're operating on scalars with undefineds as the remaining vector elements of the operands. This sounds okay, but when one looks closer... > > > > vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647 > > vmovaps %xmm3, 18560(%rsp) # fpppp.f:647 <= 16-byte spill > > vfmaddsd %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647 > > > > > >The spill here is 16-bytes. But, we're only using the low 8-bytes of xmm3. Changing the intrinsics and patterns to accept scalar operands, we end up with... > > > > vmovsd fp4_+1056(%rip), %xmm0 # fpppp.f:666 > > vmovsd %xmm0, 10088(%rsp) # fpppp.f:666 <= 8-byte spill > > vfmaddsd %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666 > > > > > >I do not know the actual number of cycles offhand, but I believe on Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as a vmovsd if it involves memory. > > > > > >-Cameron > > > > > >On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com> wrote: > > > >Because the intrinsics uses vector types (same as gcc). > >> > >> > >>- Jan > >> > >> > >> > >>----- Original Message ----- > >>> From: "dag at cray.com" <dag at cray.com> > >>> To: llvmdev at cs.uiuc.edu > >>> Cc: > >>> Sent: Wednesday, July 25, 2012 3:26 PM > >>> Subject: [LLVMdev] X86 FMA4 > >>> > >>> We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns. > >>> > >>> Why is VFMADDSD4 defined with vector types? Is this simply because the > >>> gcc intrinsic uses vector types? It's quite unnatural if you have a > >>> compiler that generates FMAs as opposed to requiring user intrinsics. > >> > > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120727/a71693bb/attachment.html>
Hey Michael, Thanks for the legwork! It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding. As I am sure you are aware, we cannot use SSE (movaps) instructions in an AVX context, or else we'll pay the context switch penalty. It might be too big an assumption to assume that movaps and vmovaps have the same timings. Same for movsd. Also, I'm sure you are aware that the Sandybridge optimization guide suggests that unaligned stores be split into a 128b store and 128b extract. This does argue against my above assumption. For full disclosure, I have not timed the individual instructions; just kernels. So, my performance gains may be coming from another source related to this change. Most likely, my gains are from better use of cache, since we would not be moving unneeded bytes around. In the context of shared cache, this savings may be enough to keep the other cores more busy. Not to mention the stack space saved. But, I cannot say for sure right now. On Fri, Jul 27, 2012 at 3:45 AM, Michael Gottesman <mgottesman at apple.com>wrote:> Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc > for loading/storing from memory. > > vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput > of 0.5. > vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, > 3 latency, with a reciprocal throughput of 1. > > He does not list vmovsd, but movsd has the same stats as vmovaps, so I > feel it is a safe assumption to make that vmovsd has the same stats as well. > > Michael > > On Jul 26, 2012, at 11:46 AM, Cameron McInally wrote: > > Ah, bad example. This is a general problem for all (maybe most) SSE and > AVX SS/SD patterns though, which is why I mentioned Sandybridge. You can > swap out VFMADDSD in my example for VADDSD or whatever you like. > > I have a lion's share of such a change implemented already and performance > is greatly affected. If the community is interested in this change, I would > be happy to prepare a patch. > > -Cameron > > On Thu, Jul 26, 2012 at 2:27 PM, Jan Sjodin <jan_sjodin at yahoo.com> wrote: > >> You can't execute FMA4 instructions on Intel processors, so it doesn't >> really matter what the impact of the move instructions would be, since it >> would end up with an illegal instruction regardless. :) It does perhaps >> bring up an issue of tuning for different architectures, but that is >> something nobody is really looking into at the moment afaik. >> >> >> - Jan >> >> >________________________________ >> > From: Cameron McInally <cameron.mcinally at nyu.edu> >> >To: Jan Sjodin <jan_sjodin at yahoo.com> >> >Cc: "dag at cray.com" <dag at cray.com>; "llvmdev at cs.uiuc.edu" < >> llvmdev at cs.uiuc.edu> >> >Sent: Thursday, July 26, 2012 10:49 AM >> >Subject: Re: [LLVMdev] X86 FMA4 >> > >> > >> >Hey Jan and Dave, >> > >> > >> >It's not obvious, but there is a significant scalar performance issue >> following the GCC intrinsics. >> > >> > >> >Let's look at the VFMADDSD pattern. We're operating on scalars with >> undefineds as the remaining vector elements of the operands. This sounds >> okay, but when one looks closer... >> > >> > vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647 >> > vmovaps %xmm3, 18560(%rsp) # fpppp.f:647 <= 16-byte spill >> > vfmaddsd %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647 >> > >> > >> >The spill here is 16-bytes. But, we're only using the low 8-bytes of >> xmm3. Changing the intrinsics and patterns to accept scalar operands, we >> end up with... >> > >> > vmovsd fp4_+1056(%rip), %xmm0 # fpppp.f:666 >> > vmovsd %xmm0, 10088(%rsp) # fpppp.f:666 <= 8-byte spill >> > vfmaddsd %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666 >> > >> > >> >I do not know the actual number of cycles offhand, but I believe on >> Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as >> a vmovsd if it involves memory. >> > >> > >> >-Cameron >> > >> > >> >On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com> >> wrote: >> > >> >Because the intrinsics uses vector types (same as gcc). >> >> >> >> >> >>- Jan >> >> >> >> >> >> >> >>----- Original Message ----- >> >>> From: "dag at cray.com" <dag at cray.com> >> >>> To: llvmdev at cs.uiuc.edu >> >>> Cc: >> >>> Sent: Wednesday, July 25, 2012 3:26 PM >> >>> Subject: [LLVMdev] X86 FMA4 >> >>> >> >>> We're migrating to LLVM 3.1 and trying to use the upstream FMA >> patterns. >> >>> >> >>> Why is VFMADDSD4 defined with vector types? Is this simply because >> the >> >>> gcc intrinsic uses vector types? It's quite unnatural if you have a >> >>> compiler that generates FMAs as opposed to requiring user intrinsics. >> >> >> > >> > >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120727/36376696/attachment.html>
> It appears that the stats you listed are for movaps [SSE], not vmovaps [AVX]. I would *assume* that vmovaps(m128) is closer to vmovaps(m256), since they are both AVX instructions. Although, yes, I agree that this is not clear from Agner's report. Please correct me if I am misunderstanding.You are misunderstanding [no worries, happens to everyone = )]. The timings I listed were for vmovaps of the form, vmovaps %xmm0, (mem) i.e., its form as a 128 bit AVX instruction. Let me explain. There are 3 categories of instructions we are discussing: 1. Normal SSE instructions. 2. 128 bit AVX instructions which are just the same SSE instructions except encoded using the VEX prefix (and thus are non-destructive*). I will always refer to these as the 128 bit AVX instructions, never as SSE instructions. 3. 256 bit AVX instructions which are the true AVX instructions (not that the 128 bit AVX instructions are not AVX instructions if you define AVX instructions via the presence of a VEX prefix, but I am speaking about how AVX in the mind of most programmers are associated with 256 bit operations). First note that 1,2 are exactly the same performance wise. The difference in between 1,2 is as follows: When you use a 256 bit AVX instruction, you cause a ``dirty state'' to be entered**. After that occurs, every SSE instruction used will cause the processor to save/restore the upper 128 bits of the ymm register aliased onto the output xmm register of the SSE instruction, resulting in bad performance. On the other hand, if you use the 128 bit AVX form of the SSE instructions, you are signaling to the processor that you do not care about the upper 128 bits of the aliased ymm register and thus it can just zero the top bits. Thus the bad performance is avoided. That is the whole point of the 128 bit AVX form of the SSE instructions, to enable you to mix SSE/AVX instructions without paying said penalty. Calling vzeroupper restores the ymm registers to a clean state allowing you to use the normal SSE instructions again without slowdown. Additionally note that the 128 bit AVX instructions do not cause the ``dirty state'' to be entered allowing you to mix/match with normal SSE and take advantage of the lack of implicit arguments/nice non-destructive encoding if you choose to (in case you can't tell I like the non-destructive encoding a lot). * This is important since SSE instructions with implicit operands (i.e. vblendvps) now have an explicit operand when instantiated as a 128 bit AVX instruction. ** NOTE The dirty state is not synonymous with the upper bits of all of the ymm registers being zero. See the Intel AVX optimization guide.> As I am sure you are aware, we cannot use SSE (movaps) instructions in an AVX context, or else we'll pay the context switch penalty. It might be too big an assumption to assume that movaps and vmovaps have the same timings. Same for moved.See above.> Also, I'm sure you are aware that the Sandybridge optimization guide suggests that unaligned stores be split into a 128b store and 128b extract. This does argue against my above assumption.This is true. The reason that they suggest that is so that you avoid storing over a page boundary which causes obscene slow downs. As an aside if you are doing any vector coding, you should always align the stores and use unaligned loads.> For full disclosure, I have not timed the individual instructions; just kernels. So, my performance gains may be coming from another source related to this change. Most likely, my gains are from better use of cache, since we would not be moving unneeded bytes around. In the context of shared cache, this savings may be enough to keep the other cores more busy. Not to mention the stack space saved. But, I cannot say for sure right now.I have actually timed said instructions in the past and reproduced Agner Fog's results. I just prefer to speak by referring to facts that can not be misconstrued as hearsay = ). But if you don't believe me, time the instructions yourself (its an important thing to have in your toolbox anyways since sometimes Intel's documentation can be non-specific). I have a small instruction timing project lying around somewhere, if you want it I can send it to you privately. Michael> > > On Fri, Jul 27, 2012 at 3:45 AM, Michael Gottesman <mgottesman at apple.com> wrote: > Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for loading/storing from memory. > > vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of 0.5. > vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3 latency, with a reciprocal throughput of 1. > > He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it is a safe assumption to make that vmovsd has the same stats as well. > > Michael > > On Jul 26, 2012, at 11:46 AM, Cameron McInally wrote: > >> Ah, bad example. This is a general problem for all (maybe most) SSE and AVX SS/SD patterns though, which is why I mentioned Sandybridge. You can swap out VFMADDSD in my example for VADDSD or whatever you like. >> >> I have a lion's share of such a change implemented already and performance is greatly affected. If the community is interested in this change, I would be happy to prepare a patch. >> >> -Cameron >> >> On Thu, Jul 26, 2012 at 2:27 PM, Jan Sjodin <jan_sjodin at yahoo.com> wrote: >> You can't execute FMA4 instructions on Intel processors, so it doesn't really matter what the impact of the move instructions would be, since it would end up with an illegal instruction regardless. :) It does perhaps bring up an issue of tuning for different architectures, but that is something nobody is really looking into at the moment afaik. >> >> >> - Jan >> >> >________________________________ >> > From: Cameron McInally <cameron.mcinally at nyu.edu> >> >To: Jan Sjodin <jan_sjodin at yahoo.com> >> >Cc: "dag at cray.com" <dag at cray.com>; "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu> >> >Sent: Thursday, July 26, 2012 10:49 AM >> >Subject: Re: [LLVMdev] X86 FMA4 >> > >> > >> >Hey Jan and Dave, >> > >> > >> >It's not obvious, but there is a significant scalar performance issue following the GCC intrinsics. >> > >> > >> >Let's look at the VFMADDSD pattern. We're operating on scalars with undefineds as the remaining vector elements of the operands. This sounds okay, but when one looks closer... >> > >> > vmovsd fp4_+1088(%rip), %xmm3 # fpppp.f:647 >> > vmovaps %xmm3, 18560(%rsp) # fpppp.f:647 <= 16-byte spill >> > vfmaddsd %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647 >> > >> > >> >The spill here is 16-bytes. But, we're only using the low 8-bytes of xmm3. Changing the intrinsics and patterns to accept scalar operands, we end up with... >> > >> > vmovsd fp4_+1056(%rip), %xmm0 # fpppp.f:666 >> > vmovsd %xmm0, 10088(%rsp) # fpppp.f:666 <= 8-byte spill >> > vfmaddsd %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666 >> > >> > >> >I do not know the actual number of cycles offhand, but I believe on Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as a vmovsd if it involves memory. >> > >> > >> >-Cameron >> > >> > >> >On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com> wrote: >> > >> >Because the intrinsics uses vector types (same as gcc). >> >> >> >> >> >>- Jan >> >> >> >> >> >> >> >>----- Original Message ----- >> >>> From: "dag at cray.com" <dag at cray.com> >> >>> To: llvmdev at cs.uiuc.edu >> >>> Cc: >> >>> Sent: Wednesday, July 25, 2012 3:26 PM >> >>> Subject: [LLVMdev] X86 FMA4 >> >>> >> >>> We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns. >> >>> >> >>> Why is VFMADDSD4 defined with vector types? Is this simply because the >> >>> gcc intrinsic uses vector types? It's quite unnatural if you have a >> >>> compiler that generates FMAs as opposed to requiring user intrinsics. >> >> >> > >> > >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120727/bba779dd/attachment.html>