thr3ads.net - llvm dev - [LLVMdev] Generate scalar SSE instructions instead of packed instructions [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Cameron McInally

2013-Feb-21 23:38 UTC

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

On Thu, Feb 21, 2013 at 12:14 PM, Nadav Rotem <nrotem at apple.com> wrote:
> You can change the input LLVM-IR.
>
> On Feb 21, 2013, at 7:16 AM, "Nowicki, Tyler" <tyler.nowicki
at intel.com>
> wrote:
>
>  Hi,****
>
> ** **
>
> I am interested in evaluating the performance of packed vs scalar
> double-precision floating point instructions on x86-atom and I was
> wondering if anyone knows more precisely where to modify llvm to use one or
> the other. I know I probably need to change something in the type
> legalizer. Could anyone provide more details than that?****
>
> **
>
> Hey Tyler,
Nadav is correct. Un-vectorizing would best be done before the IR level.

If one split the vectors at the ISel level, one would incur unnecessary
extracts, which would skew the timing data.

To digress a bit, I've found that it's necessary to rewrite the scalar
SSE
patterns to accept true scalar operands; not fake vector operands like the
GNU built-ins. This topic was discussed a while back and the popular belief
is that partial register updates would cause a performance hit when
operating on true scalars. However, my empirical evidence suggests that the
extra memory traffic of stuffing vectors is more of a performance hit than
the partial register updates. Unfortunately, this is counter-intuitive to
the documentation available. And, this may only be true for the benchmarks
that hold my interest.

For completeness, I'm mainly interested in Interlagos and Sandybridge, so
this conjecture may not hold for other processors such as Atom.

Hope this helps,
Cameron
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130221/acf6a6c9/attachment.html>

Nowicki, Tyler

2013-Feb-26 20:38 UTC

head link

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

Thanks for the reply, they were very helpful.

Is it enough to prevent BBVectorize from packing together double precision
instructions? If a non-clang frontend is used, such as ISPC, is it possible that
the IR may contain packed double instruction?

Tyler
From: Cameron McInally [mailto:cameron.mcinally at nyu.edu]
Sent: Thursday, February 21, 2013 6:39 PM
To: Nowicki, Tyler
Cc: Nadav Rotem; LLVM Developers Mailing List
Subject: Re: [LLVMdev] Generate scalar SSE instructions instead of packed
instructions

On Thu, Feb 21, 2013 at 12:14 PM, Nadav Rotem <nrotem at
apple.com<mailto:nrotem at apple.com>> wrote:
You can change the input LLVM-IR.

On Feb 21, 2013, at 7:16 AM, "Nowicki, Tyler" <tyler.nowicki at
intel.com<mailto:tyler.nowicki at intel.com>> wrote:

Hi,

I am interested in evaluating the performance of packed vs scalar
double-precision floating point instructions on x86-atom and I was wondering if
anyone knows more precisely where to modify llvm to use one or the other. I know
I probably need to change something in the type legalizer. Could anyone provide
more details than that?

Hey Tyler,

Nadav is correct. Un-vectorizing would best be done before the IR level.

If one split the vectors at the ISel level, one would incur unnecessary
extracts, which would skew the timing data.

To digress a bit, I've found that it's necessary to rewrite the scalar
SSE patterns to accept true scalar operands; not fake vector operands like the
GNU built-ins. This topic was discussed a while back and the popular belief is
that partial register updates would cause a performance hit when operating on
true scalars. However, my empirical evidence suggests that the extra memory
traffic of stuffing vectors is more of a performance hit than the partial
register updates. Unfortunately, this is counter-intuitive to the documentation
available. And, this may only be true for the benchmarks that hold my interest.

For completeness, I'm mainly interested in Interlagos and Sandybridge, so
this conjecture may not hold for other processors such as Atom.

Hope this helps,
Cameron

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130226/51f7e8aa/attachment.html>

Cameron McInally

2013-Feb-26 21:39 UTC

head link

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

On Tue, Feb 26, 2013 at 3:38 PM, Nowicki, Tyler <tyler.nowicki at
intel.com>wrote:
>  Thanks for the reply, they were very helpful.****
>
> ** **
>
> Is it enough to prevent BBVectorize from packing together double precision
> instructions? If a non-clang frontend is used, such as ISPC, is it possible
> that the IR may contain packed double instruction?
>
Yes, it could be possible that the IR includes packed SSE instructions.

I am not familiar with the ISPC frontend or Atom. But, in the general case,
a frontend could be using the SSE intrinsics, which can make use of packed
operands. For example:
>  def int_x86_sse_min_ps :
GCCBuiltin<"__builtin_ia32_minps">,
>              Intrinsic<[llvm_v4f32_ty], [llvm_v4f32_ty,
>                         llvm_v4f32_ty], [IntrNoMem]>;
The compiler I work on has a proprietary vectorizer that runs before the
LLVM IR level. So, in our case, we have an extended set of proprietary
packed intrinsics similar to the GNU SSE built-ins.

-Cameron
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130226/b8ae64f3/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Feb 2013 - [LLVMdev] Generate scalar SSE instructions instead of packed instructions

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

Possibly Parallel Threads