thr3ads.net - llvm dev - [LLVMdev] X86 FMA4 [Jul 2012]

If this information is useful, please help other people find it:
Share via:

dag at cray.com

2012-Jul-25 19:26 UTC

[LLVMdev] X86 FMA4

We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns.

Why is VFMADDSD4 defined with vector types?  Is this simply because the
gcc intrinsic uses vector types?  It's quite unnatural if you have a
compiler that generates FMAs as opposed to requiring user intrinsics.

                               -Dave

Jan Sjodin

2012-Jul-26 13:41 UTC

head link

[LLVMdev] X86 FMA4

Because the intrinsics uses vector types (same as gcc).


- Jan



----- Original Message -----> From: "dag at cray.com" <dag at cray.com>
> To: llvmdev at cs.uiuc.edu
> Cc: 
> Sent: Wednesday, July 25, 2012 3:26 PM
> Subject: [LLVMdev] X86 FMA4
> 
> We're migrating to LLVM 3.1 and trying to use the upstream FMA
patterns.
> 
> Why is VFMADDSD4 defined with vector types?  Is this simply because the
> gcc intrinsic uses vector types?  It's quite unnatural if you have a
> compiler that generates FMAs as opposed to requiring user intrinsics.
> 
>                                -Dave
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Cameron McInally

2012-Jul-26 14:49 UTC

head link

[LLVMdev] X86 FMA4

Hey Jan and Dave,

It's not obvious, but there is a significant scalar performance issue
following the GCC intrinsics.

Let's look at the VFMADDSD pattern. We're operating on scalars with
undefineds as the remaining vector elements of the operands. This sounds
okay, but when one looks closer...

       vmovsd  fp4_+1088(%rip), %xmm3  # fpppp.f:647
       vmovaps %xmm3, 18560(%rsp)      # fpppp.f:647 <= 16-byte spill
       vfmaddsd        %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647

The spill here is 16-bytes. But, we're only using the low 8-bytes of
xmm3. Changing the intrinsics and patterns to accept scalar operands, we
end up with...

       vmovsd  fp4_+1056(%rip), %xmm0  # fpppp.f:666
       vmovsd  %xmm0, 10088(%rsp)      # fpppp.f:666 <= 8-byte spill
       vfmaddsd        %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666

I do not know the actual number of cycles offhand, but I believe on
Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as
a vmovsd if it involves memory.

-Cameron

On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com>
wrote:
> Because the intrinsics uses vector types (same as gcc).
>
>
> - Jan
>
>
>
> ----- Original Message -----
> > From: "dag at cray.com" <dag at cray.com>
> > To: llvmdev at cs.uiuc.edu
> > Cc:
> > Sent: Wednesday, July 25, 2012 3:26 PM
> > Subject: [LLVMdev] X86 FMA4
> >
> > We're migrating to LLVM 3.1 and trying to use the upstream FMA
patterns.
> >
> > Why is VFMADDSD4 defined with vector types?  Is this simply because
the
> > gcc intrinsic uses vector types?  It's quite unnatural if you have
a
> > compiler that generates FMAs as opposed to requiring user intrinsics.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120726/23bfe115/attachment.html>

dag at cray.com

2012-Jul-26 15:18 UTC

head link

[LLVMdev] X86 FMA4

Jan Sjodin <jan_sjodin at yahoo.com> writes:
> Because the intrinsics uses vector types (same as gcc).
Ok, so there's no fundamental reason.  We are working on fixing this so
that compilers that auto-generate FMAs can produce efficient code.
We'll be sending this work up once our 3.1 integration is finished.
It's just a matter of adding some new patterns.

                                -Dave

Cameron McInally

2012-Jul-26 18:46 UTC

head link

[LLVMdev] X86 FMA4

Ah, bad example. This is a general problem for all (maybe most) SSE and AVX
SS/SD patterns though, which is why I mentioned Sandybridge. You can swap
out VFMADDSD in my example for VADDSD or whatever you like.

I have a lion's share of such a change implemented already and performance
is greatly affected. If the community is interested in this change, I would
be happy to prepare a patch.

-Cameron

On Thu, Jul 26, 2012 at 2:27 PM, Jan Sjodin <jan_sjodin at yahoo.com>
wrote:
> You can't execute FMA4 instructions on Intel processors, so it
doesn't
> really matter what the impact of the move instructions would be, since it
> would end up with an illegal instruction regardless. :) It does perhaps
> bring up an issue of tuning for different architectures, but that is
> something nobody is really looking into at the moment afaik.
>
>
> - Jan
>
> >________________________________
> > From: Cameron McInally <cameron.mcinally at nyu.edu>
> >To: Jan Sjodin <jan_sjodin at yahoo.com>
> >Cc: "dag at cray.com" <dag at cray.com>; "llvmdev
at cs.uiuc.edu" <
> llvmdev at cs.uiuc.edu>
> >Sent: Thursday, July 26, 2012 10:49 AM
> >Subject: Re: [LLVMdev] X86 FMA4
> >
> >
> >Hey Jan and Dave,
> >
> >
> >It's not obvious, but there is a significant scalar performance
issue
> following the GCC intrinsics.
> >
> >
> >Let's look at the VFMADDSD pattern. We're operating on scalars
with
> undefineds as the remaining vector elements of the operands. This sounds
> okay, but when one looks closer...
> >
> >       vmovsd  fp4_+1088(%rip), %xmm3  # fpppp.f:647
> >       vmovaps %xmm3, 18560(%rsp)      # fpppp.f:647 <= 16-byte
spill
> >       vfmaddsd        %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 #
fpppp.f:647
> >
> >
> >The spill here is 16-bytes. But, we're only using the low 8-bytes
of
> xmm3. Changing the intrinsics and patterns to accept scalar operands, we
> end up with...
> >
> >       vmovsd  fp4_+1056(%rip), %xmm0  # fpppp.f:666
> >       vmovsd  %xmm0, 10088(%rsp)      # fpppp.f:666 <= 8-byte spill
> >       vfmaddsd        %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 #
fpppp.f:666
> >
> >
> >I do not know the actual number of cycles offhand, but I believe on
> Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as
> a vmovsd if it involves memory.
> >
> >
> >-Cameron
> >
> >
> >On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at
yahoo.com> wrote:
> >
> >Because the intrinsics uses vector types (same as gcc).
> >>
> >>
> >>- Jan
> >>
> >>
> >>
> >>----- Original Message -----
> >>> From: "dag at cray.com" <dag at cray.com>
> >>> To: llvmdev at cs.uiuc.edu
> >>> Cc:
> >>> Sent: Wednesday, July 25, 2012 3:26 PM
> >>> Subject: [LLVMdev] X86 FMA4
> >>>
> >>> We're migrating to LLVM 3.1 and trying to use the upstream
FMA
> patterns.
> >>>
> >>> Why is VFMADDSD4 defined with vector types?  Is this simply
because the
> >>> gcc intrinsic uses vector types?  It's quite unnatural if
you have a
> >>> compiler that generates FMAs as opposed to requiring user
intrinsics.
> >>
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120726/cdf6fef6/attachment.html>

dag at cray.com

2012-Jul-26 19:00 UTC

head link

[LLVMdev] X86 FMA4

Jan Sjodin <jan_sjodin at yahoo.com> writes:
> You can't execute FMA4 instructions on Intel processors, so it
doesn't
> really matter what the impact of the move instructions would be, since
> it would end up with an illegal instruction regardless. :) 
Interlagos?  All the world is not Intel.
> It does perhaps bring up an issue of tuning for different
> architectures, but that is something nobody is really looking into at
> the moment afaik.
*ahem*

:)

                           -Dave

Michael Gottesman

2012-Jul-27 07:45 UTC

head link

[LLVMdev] X86 FMA4

Just looked up the numbers from Agner Fog for Sandy Bridge for vmovaps/etc for
loading/storing from memory.

vmovaps - load takes 1 load mu op, 3 latency, with a reciprocal throughput of
0.5.
vmovaps - store takes 1 store mu op, 1 load mu op for address calculation, 3
latency, with a reciprocal throughput of 1.

He does not list vmovsd, but movsd has the same stats as vmovaps, so I feel it
is a safe assumption to make that vmovsd has the same stats as well.

Michael

On Jul 26, 2012, at 11:46 AM, Cameron McInally wrote:
> Ah, bad example. This is a general problem for all (maybe most) SSE and AVX
SS/SD patterns though, which is why I mentioned Sandybridge. You can swap out
VFMADDSD in my example for VADDSD or whatever you like.
> 
> I have a lion's share of such a change implemented already and
performance is greatly affected. If the community is interested in this change,
I would be happy to prepare a patch.
> 
> -Cameron
> 
> On Thu, Jul 26, 2012 at 2:27 PM, Jan Sjodin <jan_sjodin at yahoo.com>
wrote:
> You can't execute FMA4 instructions on Intel processors, so it
doesn't really matter what the impact of the move instructions would be,
since it would end up with an illegal instruction regardless. :) It does perhaps
bring up an issue of tuning for different architectures, but that is something
nobody is really looking into at the moment afaik.
> 
> 
> - Jan
> 
> >________________________________
> > From: Cameron McInally <cameron.mcinally at nyu.edu>
> >To: Jan Sjodin <jan_sjodin at yahoo.com>
> >Cc: "dag at cray.com" <dag at cray.com>; "llvmdev
at cs.uiuc.edu" <llvmdev at cs.uiuc.edu>
> >Sent: Thursday, July 26, 2012 10:49 AM
> >Subject: Re: [LLVMdev] X86 FMA4
> >
> >
> >Hey Jan and Dave,
> >
> >
> >It's not obvious, but there is a significant scalar performance
issue following the GCC intrinsics.
> >
> >
> >Let's look at the VFMADDSD pattern. We're operating on scalars
with undefineds as the remaining vector elements of the operands. This sounds
okay, but when one looks closer...
> >
> >       vmovsd  fp4_+1088(%rip), %xmm3  # fpppp.f:647
> >       vmovaps %xmm3, 18560(%rsp)      # fpppp.f:647 <= 16-byte
spill
> >       vfmaddsd        %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 #
fpppp.f:647
> >
> >
> >The spill here is 16-bytes. But, we're only using the low 8-bytes
of xmm3. Changing the intrinsics and patterns to accept scalar operands, we end
up with...
> >
> >       vmovsd  fp4_+1056(%rip), %xmm0  # fpppp.f:666
> >       vmovsd  %xmm0, 10088(%rsp)      # fpppp.f:666 <= 8-byte spill
> >       vfmaddsd        %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 #
fpppp.f:666
> >
> >
> >I do not know the actual number of cycles offhand, but I believe on
Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as a
vmovsd if it involves memory.
> >
> >
> >-Cameron
> >
> >
> >On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at
yahoo.com> wrote:
> >
> >Because the intrinsics uses vector types (same as gcc).
> >>
> >>
> >>- Jan
> >>
> >>
> >>
> >>----- Original Message -----
> >>> From: "dag at cray.com" <dag at cray.com>
> >>> To: llvmdev at cs.uiuc.edu
> >>> Cc:
> >>> Sent: Wednesday, July 25, 2012 3:26 PM
> >>> Subject: [LLVMdev] X86 FMA4
> >>>
> >>> We're migrating to LLVM 3.1 and trying to use the upstream
FMA patterns.
> >>>
> >>> Why is VFMADDSD4 defined with vector types?  Is this simply
because the
> >>> gcc intrinsic uses vector types?  It's quite unnatural if
you have a
> >>> compiler that generates FMAs as opposed to requiring user
intrinsics.
> >>
> >
> >
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120727/a71693bb/attachment.html>

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Jul 2012 - [LLVMdev] X86 FMA4

[LLVMdev] X86 FMA4

[LLVMdev] X86 FMA4

[LLVMdev] X86 FMA4

[LLVMdev] X86 FMA4

[LLVMdev] X86 FMA4

[LLVMdev] X86 FMA4

[LLVMdev] X86 FMA4

Apparently Analagous Threads