thr3ads.net - llvm dev - [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon! [Sep 2014]

If this information is useful, please help other people find it:
Share via:

Andrea Di Biagio

2014-Sep-10 10:36 UTC

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at
google.com> wrote:> Awesome, thanks for all the information!
>
> See below:
>
> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com>
> wrote:
>>
>> You have already mentioned how the new shuffle lowering is missing
>> some features; for example, you explicitly said that we currently lack
>> of SSE4.1 blend support. Unfortunately, this seems to be one of the
>> main reasons for the slowdown we are seeing.
>>
>> Here is a list of what we found so far that we think is causing most
>> of the slowdown:
>> 1) shufps is always emitted in cases where we could emit a single
>> blendps; in these cases, blendps is preferable because it has better
>> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>
>
> Yep. I think this is actually super easy. I'll add support for blendps
> shortly.
Thanks Chandler!
>
>> 3) When a shuffle performs an insert at index 0 we always generate an
>> insertps, while a movss would do a better job.
>> ;;;
>> define <4 x float> @baz(<4 x float> %A, <4 x float>
%B) {
>>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4
x i32> <i32 4,
>> i32 1, i32 2, i32 3>
>>   ret <4 x float> %1
>> }
>> ;;;
>>
>> llc (-mcpu=corei7-avx):
>>   vmovss %xmm1, %xmm0, %xmm0
>>
>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>
>
> So, this is hard. I think we should do this in MC after register allocation
> because movss is the worst instruction ever: it switches from blending with
> the destination to zeroing the destination when the source switches from a
> register to a memory operand. =[ I would like to not emit movss in the DAG
> *ever*, and teach the MC combine pass to run after register allocation (and
> thus spills) have been emitted. This way we can match both patterns: when
> insertps is zeroing the other lanes and the operand is from memory, and
when
> insertps is blending into the other lanes and the operand is in a register.
>
> Does that make sense? If so, would you be up for looking at this side of
> things? It seems nicely separable.
I think it is a good idea and it makes sense to me.
I will start investigating on this and see what can be done.

Cheers,
Andrea

Chandler Carruth

2014-Sep-15 12:57 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Andrea, Quentin:

Ok, everything for blendps, insertps, movddup, movsldup, movshdup,
unpcklps, and unpckhps is committed and should generally be working. I've
not tested it *super* thoroughly (will do this ASAP) so if you run into
something fishy, don't burn lots of time on it.

I've also fixed a number of issues I found in the nightly test suite and
things like gcc-loops. I think there are still a couple of regressions I
spotted in the nightly test suite, but haven't gotten to them yet.

I've got very rhudimentary support for pblendw finished and committed.
There is a much more fundamental change that is really needed for pblendw
support though -- currently, the blend lowering strategy assumes this
instruction doesn't exist and thus picks a deeply wrong strategy in some
cases... Not sure how much this is even relevant though.


Anyways, it's almost certainly useful to look into any non-test-suite
benchmarks you have, or to run the benchmarks on non-intel hardware. Let me
know how it goes! So far, with the fixes I've landed recently, I'm
seeing
more improvements than regressions on the nightly test suite. =]

-Chandler

On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com> wrote:
> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at
google.com>
> wrote:
> > Awesome, thanks for all the information!
> >
> > See below:
> >
> > On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <
> andrea.dibiagio at gmail.com>
> > wrote:
> >>
> >> You have already mentioned how the new shuffle lowering is missing
> >> some features; for example, you explicitly said that we currently
lack
> >> of SSE4.1 blend support. Unfortunately, this seems to be one of
the
> >> main reasons for the slowdown we are seeing.
> >>
> >> Here is a list of what we found so far that we think is causing
most
> >> of the slowdown:
> >> 1) shufps is always emitted in cases where we could emit a single
> >> blendps; in these cases, blendps is preferable because it has
better
> >> reciprocal throughput (this is true on all modern Intel and AMD
cpus).
> >
> >
> > Yep. I think this is actually super easy. I'll add support for
blendps
> > shortly.
>
> Thanks Chandler!
>
> >
> >> 3) When a shuffle performs an insert at index 0 we always generate
an
> >> insertps, while a movss would do a better job.
> >> ;;;
> >> define <4 x float> @baz(<4 x float> %A, <4 x
float> %B) {
> >>   %1 = shufflevector <4 x float> %A, <4 x float> %B,
<4 x i32> <i32 4,
> >> i32 1, i32 2, i32 3>
> >>   ret <4 x float> %1
> >> }
> >> ;;;
> >>
> >> llc (-mcpu=corei7-avx):
> >>   vmovss %xmm1, %xmm0, %xmm0
> >>
> >> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
> >>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
> >
> >
> > So, this is hard. I think we should do this in MC after register
> allocation
> > because movss is the worst instruction ever: it switches from blending
> with
> > the destination to zeroing the destination when the source switches
from
> a
> > register to a memory operand. =[ I would like to not emit movss in the
> DAG
> > *ever*, and teach the MC combine pass to run after register allocation
> (and
> > thus spills) have been emitted. This way we can match both patterns:
when
> > insertps is zeroing the other lanes and the operand is from memory,
and
> when
> > insertps is blending into the other lanes and the operand is in a
> register.
> >
> > Does that make sense? If so, would you be up for looking at this side
of
> > things? It seems nicely separable.
>
> I think it is a good idea and it makes sense to me.
> I will start investigating on this and see what can be done.
>
> Cheers,
> Andrea
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140915/137af0ed/attachment.html>

Andrea Di Biagio

2014-Sep-15 16:03 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <chandlerc at
google.com> wrote:> Andrea, Quentin:
>
> Ok, everything for blendps, insertps, movddup, movsldup, movshdup,
unpcklps,
> and unpckhps is committed and should generally be working. I've not
tested
> it *super* thoroughly (will do this ASAP) so if you run into something
> fishy, don't burn lots of time on it.
Ok.
>
> I've also fixed a number of issues I found in the nightly test suite
and
> things like gcc-loops. I think there are still a couple of regressions I
> spotted in the nightly test suite, but haven't gotten to them yet.
>
> I've got very rhudimentary support for pblendw finished and committed.
There
> is a much more fundamental change that is really needed for pblendw support
> though -- currently, the blend lowering strategy assumes this instruction
> doesn't exist and thus picks a deeply wrong strategy in some cases...
Not
> sure how much this is even relevant though.
>
>
> Anyways, it's almost certainly useful to look into any non-test-suite
> benchmarks you have, or to run the benchmarks on non-intel hardware. Let me
> know how it goes! So far, with the fixes I've landed recently, I'm
seeing
> more improvements than regressions on the nightly test suite. =]
Cool!
I'll have a look at it. I will let you know how it goes.
Thanks for working on this :-).

-Andrea
>
> -Chandler
>
> On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com> wrote:
>>
>> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at
google.com>
>> wrote:
>> > Awesome, thanks for all the information!
>> >
>> > See below:
>> >
>> > On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio
>> > <andrea.dibiagio at gmail.com>
>> > wrote:
>> >>
>> >> You have already mentioned how the new shuffle lowering is
missing
>> >> some features; for example, you explicitly said that we
currently lack
>> >> of SSE4.1 blend support. Unfortunately, this seems to be one
of the
>> >> main reasons for the slowdown we are seeing.
>> >>
>> >> Here is a list of what we found so far that we think is
causing most
>> >> of the slowdown:
>> >> 1) shufps is always emitted in cases where we could emit a
single
>> >> blendps; in these cases, blendps is preferable because it has
better
>> >> reciprocal throughput (this is true on all modern Intel and
AMD cpus).
>> >
>> >
>> > Yep. I think this is actually super easy. I'll add support for
blendps
>> > shortly.
>>
>> Thanks Chandler!
>>
>> >
>> >> 3) When a shuffle performs an insert at index 0 we always
generate an
>> >> insertps, while a movss would do a better job.
>> >> ;;;
>> >> define <4 x float> @baz(<4 x float> %A, <4 x
float> %B) {
>> >>   %1 = shufflevector <4 x float> %A, <4 x float>
%B, <4 x i32> <i32 4,
>> >> i32 1, i32 2, i32 3>
>> >>   ret <4 x float> %1
>> >> }
>> >> ;;;
>> >>
>> >> llc (-mcpu=corei7-avx):
>> >>   vmovss %xmm1, %xmm0, %xmm0
>> >>
>> >> llc -x86-experimental-vector-shuffle-lowering
(-mcpu=corei7-avx):
>> >>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm1[0],xmm0[1,2,3]
>> >
>> >
>> > So, this is hard. I think we should do this in MC after register
>> > allocation
>> > because movss is the worst instruction ever: it switches from
blending
>> > with
>> > the destination to zeroing the destination when the source
switches from
>> > a
>> > register to a memory operand. =[ I would like to not emit movss in
the
>> > DAG
>> > *ever*, and teach the MC combine pass to run after register
allocation
>> > (and
>> > thus spills) have been emitted. This way we can match both
patterns:
>> > when
>> > insertps is zeroing the other lanes and the operand is from
memory, and
>> > when
>> > insertps is blending into the other lanes and the operand is in a
>> > register.
>> >
>> > Does that make sense? If so, would you be up for looking at this
side of
>> > things? It seems nicely separable.
>>
>> I think it is a good idea and it makes sense to me.
>> I will start investigating on this and see what can be done.
>>
>> Cheers,
>> Andrea
>
>

Quentin Colombet

2014-Sep-17 18:10 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Chandler,

Here is a new test case.
With the new lowering, we miss to fold a load into the shuffle.
To reproduce:
llc -x86-experimental-vector-shuffle-lowering=true missing_folding.ll
llc -x86-experimental-vector-shuffle-lowering=false missing_folding.ll

-Quentin

On Sep 17, 2014, at 10:28 AM, Quentin Colombet <qcolombet at apple.com>
wrote:
> Hi Chandler,
> 
> I saw regressions in our internal testing. Some of them are avx/avx2
specific.
> 
> Should I send reduced test cases for those or is it something you haven’t
looked yet and thus, is expected?
> 
> Anyway, here is the biggest offender. This is avx-specific.
> 
> To reproduce:
> llc -x86-experimental-vector-shuffle-lowering=true -mattr=+avx
avx_test_case.ll
> llc -x86-experimental-vector-shuffle-lowering=false -mattr=+avx
avx_test_case.ll
> 
> I’ll send more test cases (first for non-avx specific) as I reduce the
regressions.
> 
> Thanks,
> -Quentin
> <avx_test_case.ll>
> 
> On Sep 15, 2014, at 9:03 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com> wrote:
> 
>> On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <chandlerc at
google.com> wrote:
>>> Andrea, Quentin:
>>> 
>>> Ok, everything for blendps, insertps, movddup, movsldup, movshdup,
unpcklps,
>>> and unpckhps is committed and should generally be working. I've
not tested
>>> it *super* thoroughly (will do this ASAP) so if you run into
something
>>> fishy, don't burn lots of time on it.
>> 
>> Ok.
>> 
>>> 
>>> I've also fixed a number of issues I found in the nightly test
suite and
>>> things like gcc-loops. I think there are still a couple of
regressions I
>>> spotted in the nightly test suite, but haven't gotten to them
yet.
>>> 
>>> I've got very rhudimentary support for pblendw finished and
committed. There
>>> is a much more fundamental change that is really needed for pblendw
support
>>> though -- currently, the blend lowering strategy assumes this
instruction
>>> doesn't exist and thus picks a deeply wrong strategy in some
cases... Not
>>> sure how much this is even relevant though.
>>> 
>>> 
>>> Anyways, it's almost certainly useful to look into any
non-test-suite
>>> benchmarks you have, or to run the benchmarks on non-intel
hardware. Let me
>>> know how it goes! So far, with the fixes I've landed recently,
I'm seeing
>>> more improvements than regressions on the nightly test suite. =]
>> 
>> Cool!
>> I'll have a look at it. I will let you know how it goes.
>> Thanks for working on this :-).
>> 
>> -Andrea
>> 
>>> 
>>> -Chandler
>>> 
>>> On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio
>>> <andrea.dibiagio at gmail.com> wrote:
>>>> 
>>>> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc
at google.com>
>>>> wrote:
>>>>> Awesome, thanks for all the information!
>>>>> 
>>>>> See below:
>>>>> 
>>>>> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio
>>>>> <andrea.dibiagio at gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> You have already mentioned how the new shuffle lowering
is missing
>>>>>> some features; for example, you explicitly said that we
currently lack
>>>>>> of SSE4.1 blend support. Unfortunately, this seems to
be one of the
>>>>>> main reasons for the slowdown we are seeing.
>>>>>> 
>>>>>> Here is a list of what we found so far that we think is
causing most
>>>>>> of the slowdown:
>>>>>> 1) shufps is always emitted in cases where we could
emit a single
>>>>>> blendps; in these cases, blendps is preferable because
it has better
>>>>>> reciprocal throughput (this is true on all modern Intel
and AMD cpus).
>>>>> 
>>>>> 
>>>>> Yep. I think this is actually super easy. I'll add
support for blendps
>>>>> shortly.
>>>> 
>>>> Thanks Chandler!
>>>> 
>>>>> 
>>>>>> 3) When a shuffle performs an insert at index 0 we
always generate an
>>>>>> insertps, while a movss would do a better job.
>>>>>> ;;;
>>>>>> define <4 x float> @baz(<4 x float> %A,
<4 x float> %B) {
>>>>>>  %1 = shufflevector <4 x float> %A, <4 x
float> %B, <4 x i32> <i32 4,
>>>>>> i32 1, i32 2, i32 3>
>>>>>>  ret <4 x float> %1
>>>>>> }
>>>>>> ;;;
>>>>>> 
>>>>>> llc (-mcpu=corei7-avx):
>>>>>>  vmovss %xmm1, %xmm0, %xmm0
>>>>>> 
>>>>>> llc -x86-experimental-vector-shuffle-lowering
(-mcpu=corei7-avx):
>>>>>>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm1[0],xmm0[1,2,3]
>>>>> 
>>>>> 
>>>>> So, this is hard. I think we should do this in MC after
register
>>>>> allocation
>>>>> because movss is the worst instruction ever: it switches
from blending
>>>>> with
>>>>> the destination to zeroing the destination when the source
switches from
>>>>> a
>>>>> register to a memory operand. =[ I would like to not emit
movss in the
>>>>> DAG
>>>>> *ever*, and teach the MC combine pass to run after register
allocation
>>>>> (and
>>>>> thus spills) have been emitted. This way we can match both
patterns:
>>>>> when
>>>>> insertps is zeroing the other lanes and the operand is from
memory, and
>>>>> when
>>>>> insertps is blending into the other lanes and the operand
is in a
>>>>> register.
>>>>> 
>>>>> Does that make sense? If so, would you be up for looking at
this side of
>>>>> things? It seems nicely separable.
>>>> 
>>>> I think it is a good idea and it makes sense to me.
>>>> I will start investigating on this and see what can be done.
>>>> 
>>>> Cheers,
>>>> Andrea
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/63f774cd/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: missing_folding.ll
Type: application/octet-stream
Size: 1132 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/63f774cd/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/63f774cd/attachment-0001.html>

Quentin Colombet

2014-Sep-17 19:51 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Chandler,

Yet another test case :).
We use two shuffles instead of 1 palign.

To reproduce:
llc -x86-experimental-vector-shuffle-lowering=true missing_palign.ll -mcpu=core2
llc -x86-experimental-vector-shuffle-lowering=false missing_palign.ll
-mcpu=core2

You can replace -mcpu=core2 by -mattr=+ssse3.

Q.

On Sep 17, 2014, at 11:10 AM, Quentin Colombet <qcolombet at apple.com>
wrote:
> Hi Chandler,
> 
> Here is a new test case.
> With the new lowering, we miss to fold a load into the shuffle.
> To reproduce:
> llc -x86-experimental-vector-shuffle-lowering=true missing_folding.ll
> llc -x86-experimental-vector-shuffle-lowering=false missing_folding.ll
> 
> -Quentin
> <missing_folding.ll>
> 
> On Sep 17, 2014, at 10:28 AM, Quentin Colombet <qcolombet at
apple.com> wrote:
> 
>> Hi Chandler,
>> 
>> I saw regressions in our internal testing. Some of them are avx/avx2
specific.
>> 
>> Should I send reduced test cases for those or is it something you
haven’t looked yet and thus, is expected?
>> 
>> Anyway, here is the biggest offender. This is avx-specific.
>> 
>> To reproduce:
>> llc -x86-experimental-vector-shuffle-lowering=true -mattr=+avx
avx_test_case.ll
>> llc -x86-experimental-vector-shuffle-lowering=false -mattr=+avx
avx_test_case.ll
>> 
>> I’ll send more test cases (first for non-avx specific) as I reduce the
regressions.
>> 
>> Thanks,
>> -Quentin
>> <avx_test_case.ll>
>> 
>> On Sep 15, 2014, at 9:03 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com> wrote:
>> 
>>> On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <chandlerc at
google.com> wrote:
>>>> Andrea, Quentin:
>>>> 
>>>> Ok, everything for blendps, insertps, movddup, movsldup,
movshdup, unpcklps,
>>>> and unpckhps is committed and should generally be working.
I've not tested
>>>> it *super* thoroughly (will do this ASAP) so if you run into
something
>>>> fishy, don't burn lots of time on it.
>>> 
>>> Ok.
>>> 
>>>> 
>>>> I've also fixed a number of issues I found in the nightly
test suite and
>>>> things like gcc-loops. I think there are still a couple of
regressions I
>>>> spotted in the nightly test suite, but haven't gotten to
them yet.
>>>> 
>>>> I've got very rhudimentary support for pblendw finished and
committed. There
>>>> is a much more fundamental change that is really needed for
pblendw support
>>>> though -- currently, the blend lowering strategy assumes this
instruction
>>>> doesn't exist and thus picks a deeply wrong strategy in
some cases... Not
>>>> sure how much this is even relevant though.
>>>> 
>>>> 
>>>> Anyways, it's almost certainly useful to look into any
non-test-suite
>>>> benchmarks you have, or to run the benchmarks on non-intel
hardware. Let me
>>>> know how it goes! So far, with the fixes I've landed
recently, I'm seeing
>>>> more improvements than regressions on the nightly test suite.
=]
>>> 
>>> Cool!
>>> I'll have a look at it. I will let you know how it goes.
>>> Thanks for working on this :-).
>>> 
>>> -Andrea
>>> 
>>>> 
>>>> -Chandler
>>>> 
>>>> On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio
>>>> <andrea.dibiagio at gmail.com> wrote:
>>>>> 
>>>>> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth
<chandlerc at google.com>
>>>>> wrote:
>>>>>> Awesome, thanks for all the information!
>>>>>> 
>>>>>> See below:
>>>>>> 
>>>>>> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio
>>>>>> <andrea.dibiagio at gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> You have already mentioned how the new shuffle
lowering is missing
>>>>>>> some features; for example, you explicitly said
that we currently lack
>>>>>>> of SSE4.1 blend support. Unfortunately, this seems
to be one of the
>>>>>>> main reasons for the slowdown we are seeing.
>>>>>>> 
>>>>>>> Here is a list of what we found so far that we
think is causing most
>>>>>>> of the slowdown:
>>>>>>> 1) shufps is always emitted in cases where we could
emit a single
>>>>>>> blendps; in these cases, blendps is preferable
because it has better
>>>>>>> reciprocal throughput (this is true on all modern
Intel and AMD cpus).
>>>>>> 
>>>>>> 
>>>>>> Yep. I think this is actually super easy. I'll add
support for blendps
>>>>>> shortly.
>>>>> 
>>>>> Thanks Chandler!
>>>>> 
>>>>>> 
>>>>>>> 3) When a shuffle performs an insert at index 0 we
always generate an
>>>>>>> insertps, while a movss would do a better job.
>>>>>>> ;;;
>>>>>>> define <4 x float> @baz(<4 x float> %A,
<4 x float> %B) {
>>>>>>>  %1 = shufflevector <4 x float> %A, <4 x
float> %B, <4 x i32> <i32 4,
>>>>>>> i32 1, i32 2, i32 3>
>>>>>>>  ret <4 x float> %1
>>>>>>> }
>>>>>>> ;;;
>>>>>>> 
>>>>>>> llc (-mcpu=corei7-avx):
>>>>>>>  vmovss %xmm1, %xmm0, %xmm0
>>>>>>> 
>>>>>>> llc -x86-experimental-vector-shuffle-lowering
(-mcpu=corei7-avx):
>>>>>>>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm1[0],xmm0[1,2,3]
>>>>>> 
>>>>>> 
>>>>>> So, this is hard. I think we should do this in MC after
register
>>>>>> allocation
>>>>>> because movss is the worst instruction ever: it
switches from blending
>>>>>> with
>>>>>> the destination to zeroing the destination when the
source switches from
>>>>>> a
>>>>>> register to a memory operand. =[ I would like to not
emit movss in the
>>>>>> DAG
>>>>>> *ever*, and teach the MC combine pass to run after
register allocation
>>>>>> (and
>>>>>> thus spills) have been emitted. This way we can match
both patterns:
>>>>>> when
>>>>>> insertps is zeroing the other lanes and the operand is
from memory, and
>>>>>> when
>>>>>> insertps is blending into the other lanes and the
operand is in a
>>>>>> register.
>>>>>> 
>>>>>> Does that make sense? If so, would you be up for
looking at this side of
>>>>>> things? It seems nicely separable.
>>>>> 
>>>>> I think it is a good idea and it makes sense to me.
>>>>> I will start investigating on this and see what can be
done.
>>>>> 
>>>>> Cheers,
>>>>> Andrea
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: missing_palign.ll
Type: application/octet-stream
Size: 1430 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment-0001.html>

Chandler Carruth

2014-Sep-17 20:30 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Wed, Sep 17, 2014 at 12:51 PM, Quentin Colombet <qcolombet at
apple.com>
wrote:
> We use two shuffles instead of 1 palign.

Doh! I just forgot to teach it about palign... This one should at least be
easy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/301846e6/attachment.html>

Quentin Colombet

2014-Sep-18 00:18 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Chandler,

Here are a few more test cases.
I’ve ordered them from the hottest to the coldest.

To reproduce:
llc <test case>  -x86-experimental-vector-shuffle-lowering=<true |
false> [specific feature]

1. avx2_vperm.ll avx2
We use a sequence of extract, 2 shuffle, insert instead of vperm when avx2 is
set.

2. avx_blend.ll avx
Instead of using one big blend, we use 2 extracts, one small blend, and one
insert.

3. avx2_extract2perm.ll avx2
We use a sequence of two instructions: extract, unpck, instead of one perm.

4. pxor.ll none
Instead of using pxor to set a register to zero, we use a sequence composed of
xorpd, shuffle.

5. sse4.1_pmovzxwd.ll sse4.1
Instead of using a single pmovzxwd, we use a movq followed by an unpck.

If you prefer, I can file PRs.

Cheers,
-Quentin

On Sep 17, 2014, at 1:30 PM, Chandler Carruth <chandlerc at google.com>
wrote:
> 
> On Wed, Sep 17, 2014 at 12:51 PM, Quentin Colombet <qcolombet at
apple.com> wrote:
> We use two shuffles instead of 1 palign.
> 
> Doh! I just forgot to teach it about palign... This one should at least be
easy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/68a0c42e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_cases.tgz
Type: application/octet-stream
Size: 1166 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/68a0c42e/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/68a0c42e/attachment-0001.html>

Quentin Colombet

2014-Sep-19 18:53 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Thanks Chandler!

Here are two more test cases :).

I kept the format of my previous email.

1. avx_unpck.ll avx
Instead of issueing a single unpck with use a sequence of extrats, inserts,
shuffles, and blends.

2. none_useless_shuflle none
Instead of using a single move to materialize a zero extended constant into a
vector register, we explicitly zeroed a vector register and use a shuffle.

Cheers,
-Quentin


On Sep 18, 2014, at 2:13 AM, Chandler Carruth <chandlerc at google.com>
wrote:
> As of r218038 we should get palign for all integer shuffles. That fixes the
test case you reduced for me. If you have any other regressions that point to
palignr, I'd be especially interested to have an actual test case. As I
noted in my commit log, there seems to be a clear place where using this could
be faster but it introduces domain crossing. I don't really have a good
model for the cost there and so am hesitant to go down that route without good
evidence of the need.
> 
> On Wed, Sep 17, 2014 at 1:30 PM, Chandler Carruth <chandlerc at
google.com> wrote:
> 
> On Wed, Sep 17, 2014 at 12:51 PM, Quentin Colombet <qcolombet at
apple.com> wrote:
> We use two shuffles instead of 1 palign.
> 
> Doh! I just forgot to teach it about palign... This one should at least be
easy.
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/7d04a0cf/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: none_useless_shuffle.ll
Type: application/octet-stream
Size: 210 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/7d04a0cf/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/7d04a0cf/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: avx_unpck.ll
Type: application/octet-stream
Size: 415 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/7d04a0cf/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/7d04a0cf/attachment-0002.html>

Andrea Di Biagio

2014-Sep-19 20:22 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Chandler,

I have tested the new shuffle lowering on a AMD Jaguar cpu (which is
AVX but not AVX2).

On this particular target, there is a delay when output data from an
execution unit is used as input to another execution unit of a
different cluster. For example, There are 6 executions units which are
divided into 3 execution clusters of Float(FPM,FPA), Vector Integer
(MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs
an addition 1 cycle latency penalty.
Your new shuffle lowering algorithm is very good at keeping the
computation inside clusters. This is an improvement with respect to
the "old" shuffle lowering algorithm.

I haven't observed any significant regression in our internal codebase.
In one particular case I observed a slowdown (around 1%); here is what
I found when investigating on this slowdown.

1.  With the new shuffle lowering, there is one case where we end up
producing the following sequence:
   vmovss .LCPxx(%rip), %xmm1
   vxorps %xmm0, %xmm0, %xmm0
   vblendps $1, %xmm1, %xmm0, %xmm0

Before, we used to generate a simpler:
   vmovss .LCPxx(%rip), %xmm1

In this particular case, the 'vblendps' is redundant since the vmovss
would zero the upper bits in %xmm1. I am not sure why we get this
poor-codegen with your new shuffle lowering. I will investigate more
on this bug (maybe we no longer trigger some ISel patterns?) and I
will try to give you a small reproducible for this paticular case.

2.  There are cases where we no longer fold a vector load in one of
the operands of a shuffle.
This is an example:

     vmovaps  320(%rsp), %xmm0
     vshufps $-27, %xmm0, %xmm0, %xmm0    # %xmm0 = %xmm0[1,1,2,3]

Before, we used to emit the following sequence:
     # 16-byte Folded reload.
     vpshufd $1, 320(%rsp), %xmm0      # %xmm0 = mem[1,0,0,0]

Note: the reason why the shuffle masks are different but still valid
is because the upper bits in %xmm0 are unused. Later on, the code uses
register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of
%xmm0 have a meaning in this context).
As for 1. I'll try to create a small reproducible.

3.  When zero extending 2 packed 32-bit integers, we should try to
emit a vpmovzxdq
Example:
  vmovq  20(%rbx), %xmm0
  vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]

Before:
   vpmovzxdq  20(%rbx), %xmm0

4.  We no longer emit a simpler 'vmovq' in the following case:
   vxorpd %xmm4, %xmm4, %xmm4
   vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1]

Before, we used to generate:
   vmovq %xmm2, %xmm4

Before, the vmovq implicitly zero-extended to 128 bits the quadword in
%xmm2. Now we always do this with a vxorpd+vblendps.

As I said, I will try to create smaller reproducible for each of the
problems I found.
I hope this helps. I will keep testing.

Thanks,
Andrea

Quentin Colombet

2014-Sep-19 20:36 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Andrea,

I think most if not all the regressions are covered by the previous test cases
I’ve provided.
Please double check if you want to avoid reducing them :).

On Sep 19, 2014, at 1:22 PM, Andrea Di Biagio <andrea.dibiagio at
gmail.com> wrote:
> Hi Chandler,
> 
> I have tested the new shuffle lowering on a AMD Jaguar cpu (which is
> AVX but not AVX2).
> 
> On this particular target, there is a delay when output data from an
> execution unit is used as input to another execution unit of a
> different cluster. For example, There are 6 executions units which are
> divided into 3 execution clusters of Float(FPM,FPA), Vector Integer
> (MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs
> an addition 1 cycle latency penalty.
> Your new shuffle lowering algorithm is very good at keeping the
> computation inside clusters. This is an improvement with respect to
> the "old" shuffle lowering algorithm.
> 
> I haven't observed any significant regression in our internal codebase.
> In one particular case I observed a slowdown (around 1%); here is what
> I found when investigating on this slowdown.
> 
> 1.  With the new shuffle lowering, there is one case where we end up
> producing the following sequence:
>   vmovss .LCPxx(%rip), %xmm1
>   vxorps %xmm0, %xmm0, %xmm0
>   vblendps $1, %xmm1, %xmm0, %xmm0
> 
> Before, we used to generate a simpler:
>   vmovss .LCPxx(%rip), %xmm1
> 
> In this particular case, the 'vblendps' is redundant since the
vmovss
> would zero the upper bits in %xmm1. I am not sure why we get this
> poor-codegen with your new shuffle lowering. I will investigate more
> on this bug (maybe we no longer trigger some ISel patterns?) and I
> will try to give you a small reproducible for this paticular case.
I think it should already be covered by one of the test case I provided:
none_useless_shuflle.ll
> 
> 2.  There are cases where we no longer fold a vector load in one of
> the operands of a shuffle.
> This is an example:
> 
>     vmovaps  320(%rsp), %xmm0
>     vshufps $-27, %xmm0, %xmm0, %xmm0    # %xmm0 = %xmm0[1,1,2,3]
> 
> Before, we used to emit the following sequence:
>     # 16-byte Folded reload.
>     vpshufd $1, 320(%rsp), %xmm0      # %xmm0 = mem[1,0,0,0]
> 
> Note: the reason why the shuffle masks are different but still valid
> is because the upper bits in %xmm0 are unused. Later on, the code uses
> register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits
of
> %xmm0 have a meaning in this context).
> As for 1. I'll try to create a small reproducible.
Same here, I think this is already covered by: missing_folding.ll
> 
> 3.  When zero extending 2 packed 32-bit integers, we should try to
> emit a vpmovzxdq
> Example:
>  vmovq  20(%rbx), %xmm0
>  vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]
> 
> Before:
>   vpmovzxdq  20(%rbx), %xmm0
Probably same logic as: sse4.1_pmovzxwd.ll
But you can double check it. 
> 
> 4.  We no longer emit a simpler 'vmovq' in the following case:
>   vxorpd %xmm4, %xmm4, %xmm4
>   vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1]
> 
> Before, we used to generate:
>   vmovq %xmm2, %xmm4
> 
> Before, the vmovq implicitly zero-extended to 128 bits the quadword in
> %xmm2. Now we always do this with a vxorpd+vblendps.
Probably same as: none_useless_shuflle.ll

Cheers,
Q.> 
> As I said, I will try to create smaller reproducible for each of the
> problems I found.
> I hope this helps. I will keep testing.
> 
> Thanks,
> Andrea

Chandler Carruth

2014-Sep-20 05:15 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

After some adding some serious ninja-ry to the new shuffle lowering...

On Fri, Sep 19, 2014 at 11:53 AM, Quentin Colombet <qcolombet at
apple.com>
wrote:
> 2. none_useless_shuflle none
> Instead of using a single move to materialize a zero extended constant
> into a vector register, we explicitly zeroed a vector register and use a
> shuffle.
>
... this test case is fixed, as is your 'pxor.ll' test case from
earlier,
and the 'movss' test cases from Andrea earlier in the thread (I
suspect).
Turns out that there is a trick that we can use in the existing tables to
get most of the memory-operand-movss optimizations.

I think this is all of the non-avx-specific issues raised thus far.... One
of the issues isn't avx specific but can only be solved with avx. Anyways,
I'll look into some of the AVX issues next.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/dcda6298/attachment.html>

Quentin Colombet

2014-Sep-29 16:33 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Chandler,

On Sep 29, 2014, at 6:05 AM, Chandler Carruth <chandlerc at google.com>
wrote:
> 
> On Tue, Sep 23, 2014 at 4:28 AM, Chandler Carruth <chandlerc at
google.com> wrote:
> AVX2 is still in-flight.
> 
> AVX2 is pretty much done.
> 
> All of the AVX and AVX2 lowering has now been heavily fuzz tested (a few
million test cases and counting). I believe it is correct.
> 
> I've added the basic framework for AVX-512. Nothing interesting is
implemented there, mostly because I think there are still very big unanswered
questions about how AVX-512 should work. For example, it would be good to lower
with index-destructive vs. table-destructive shuffles based on # of uses, but
that isn't really possible today. Even better would be to actually respect
any loop structure or other invariant properties.
> 
> There are still plenty of performance gains to be had in AVX or AVX2
(broadcast support, work to combine away intermediate shuffling such as can be
seen in the v32i8 test cases with interleaved unpacks, etc. etc.
> 
> However, I think essentially all of the test cases (other than broadcast
and shift test cases) have been fixed. I'd really like to enable this and
let folks submit patches for the few remaining cases that impact them
significantly.
Sounds good to me.
Just keep the old path until we are done with the triage of regressions.
Indeed, with r218454, I am seeing few regressions (working on test cases), but I
do not think those should hold up on your big refactoring.

Thanks for all your work.
-Quentin
> As far as I can tell, the new code paths offer very significant advantages
for hardware folks have today with only a few downsides. While they are less
implemented for AVX-512 than the current code, I don't really think that
should be the priority.
> 
> Are there any remaining objections?
> -Chandler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140929/1a4d225f/attachment.html>

Chandler Carruth

2014-Sep-30 05:47 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Awesome. I'll send a new email to the list shortly as a heads-up and then
start switching the default over.

On Mon, Sep 29, 2014 at 9:33 AM, Quentin Colombet <qcolombet at apple.com>
wrote:
> Just keep the old path until we are done with the triage of regressions.
>
Sure, as soon as you're done triaging, let me know. I'm eager to delete
several thousand lines of code to make up for adding so much. =]

> Indeed, with r218454, I am seeing few regressions (working on test cases),
> but I do not think those should hold up on your big refactoring.
>
Excellent. I would recommend filing PRs with remaining test cases so that
others can also take a look.

Also, I know you mentioned seeing some AVX failures. There was indeed a bad
patch of mine (r218600) which I've reverted. Hope that fixes things.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140929/ae6679ce/attachment.html>

Quentin Colombet

2014-Oct-02 20:41 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Chandler,

I’ve filed a few PRs regarding the latest regressions I found.

Here are the links if you want the details.
http://llvm.org/bugs/show_bug.cgi?id=21137
<http://llvm.org/bugs/show_bug.cgi?id=21137>
http://llvm.org/bugs/show_bug.cgi?id=21138
<http://llvm.org/bugs/show_bug.cgi?id=21138>
http://llvm.org/bugs/show_bug.cgi?id=21139
<http://llvm.org/bugs/show_bug.cgi?id=21139>
http://llvm.org/bugs/show_bug.cgi?id=21140
<http://llvm.org/bugs/show_bug.cgi?id=21140>

I've already reported the first one a while back.

This is just FYI, I do not expect you to handle all the work :).

Cheers,
-Quentin
> On Oct 1, 2014, at 11:24 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com> wrote:
> 
> Hi Chandler,
> 
> Not sure how important this can be, however I found a minor regression
> with the new shuffle lowering.
> Here is a reproducible test case:
> 
> ;;
> define <4 x i32> @test(<4 x i32> %V) {
>  %1 = shufflevector <4 x i32> %V, <4 x i32> <i32 0, i32 0,
i32 0, i32
> 0>, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
>  ret <4 x i32> %1
> }
> ;;
> 
> $ llc -mcpu=corei7-avx -o -
> 
>  vmovq %xmm0, %xmm0
>  retq
> 
> $ llc -mcpu=corei7-avx -x86-experimental-vector-shuffle-lowering -o -
>  vpxor  %xmm1, %xmm1, %xmm1
>  vpunpcklqdq  %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0]
>  retq
> 
> If we know that the upper 64-bits of the destination register are
> zero, we can try to emit a simpler vmovq instead of a vxor+vunpck.
> 
> As I said, this is a minor issue.
> I just wanted to post this finding so that we don't forget about it.
> 
> Cheers,
> Andrea
> 
> On Wed, Oct 1, 2014 at 9:23 AM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com <mailto:andrea.dibiagio at
gmail.com>> wrote:
>> On Wed, Oct 1, 2014 at 1:52 AM, Chandler Carruth <chandlerc at
google.com> wrote:
>>> This has been added in r218724.
>> Thanks Chandler!
>> 
>>> Based on the feedback here and from Quentin, I'm going to email
the list
>>> shortly with a heads-up, and then flip the default over to the new
shuffle
>>> lowering.
>> 
>> Nice.
>> Again, thanks for working on this!
>> 
>> -Andrea
>> 
>>> 
>>> On Mon, Sep 29, 2014 at 10:48 PM, Chandler Carruth <chandlerc at
google.com>
>>> wrote:
>>>> 
>>>> Wow. Somehow, I forgot about vbroadcast and vpbroadcast. =[
Sorry about
>>>> that. I'll fix those.
>>>> 
>>>> On Fri, Sep 26, 2014 at 3:39 AM, Andrea Di Biagio
>>>> <andrea.dibiagio at gmail.com> wrote:
>>>>> 
>>>>> Hi Chandler,
>>>>> 
>>>>> Here is another test.
>>>>> 
>>>>> When looking at the AVX codegen, I noticed that, when using
the new
>>>>> shuffle lowering, we no longer emit a single vbroadcastss
in the case
>>>>> where the shuffle performs a splat of a scalar float loaded
from
>>>>> memory.
>>>>> 
>>>>> For example:
>>>>> (with -mcpu=corei7-avx
-x86-experimental-vector-shuffle-lowering)
>>>>>   vmovss (%rdi), %xmm0
>>>>>   vpermilps $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]
>>>>> 
>>>>> Instead of:
>>>>> (with -mcpu=corei7-avx)
>>>>>  vbroadcastss (%rdi), %xmm0
>>>>> 
>>>>> I have attached a small reproducible for it.
>>>>> 
>>>>> Basically, the old shuffle lowering logic calls function
>>>>> 'NormalizeVectorShuffle' to handle shuffles that
perform a splat
>>>>> operation.
>>>>> On AVX, function 'NormalizeVectorShuffle' tries to
lower a splat where
>>>>> the splat value comes from a load into a X86ISD::VBROADCAST
dag node.
>>>>> Later on, during instruction selection, we emit a single
avx_broadcast
>>>>> for the load+splat sequence (basically, we end up folding
the load in
>>>>> the operand of the vbroadcastss).
>>>>> 
>>>>> What happens is that the new shuffle lowering doesn't
emit a
>>>>> vbroadcast node in this case and eventually we end up
selecting the
>>>>> sequence of vmovss+vpermilps.
>>>>> 
>>>>> I hope this helps.
>>>>> Andrea
>>>>> 
>>>>> On Tue, Sep 23, 2014 at 10:53 PM, Chandler Carruth
<chandlerc at google.com>
>>>>> wrote:
>>>>>> 
>>>>>> On Tue, Sep 23, 2014 at 2:35 PM, Simon Pilgrim
<llvm-dev at redking.me.uk>
>>>>>> wrote:
>>>>>>> 
>>>>>>> If you don’t want to spend time on this, I’d be
happy to create a
>>>>>>> candidate patch for review? I’ve been unclear if
you were taking
>>>>>>> patches for
>>>>>>> your shuffle work prior to it becoming the default.
>>>>>> 
>>>>>> 
>>>>>> While I'm happy to work on it, I'm even more
happy to have patches. =D
>>>>>> 
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>> 
>>>> 
>>>> 
>>> 
> <test.ll>_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141002/c9dd2cab/attachment.html>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Sep 2014 - [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Maybe Matching Threads