thr3ads.net - llvm dev - [llvm-dev] Opportunity to split store of shuffled vector. [Oct 2019]

If this information is useful, please help other people find it:
Share via:

Sanjay Patel via llvm-dev

2019-Oct-04 12:35 UTC

[llvm-dev] Opportunity to split store of shuffled vector.

Canonicalizing to a masked store intrinsic is possible, but we might have
to expand back to the load/shuffle/store sequence for targets that don't
support masked store. And then you'd likely have to do the store splitting
that I think is being requested for the original pattern anyway.

But I'd like to step back to the premise - "LLVM seems to always
generate
vector instructions for vector operations in C, even it's just simple
stores."
That's not correct. It's not LLVM that created the vector memory
accesses.
That's how the IR begins from clang:

define void @foo(<4 x float>* %a) #0 {
entry:
  %a.addr = alloca <4 x float>*, align 8
  store <4 x float>* %a, <4 x float>** %a.addr, align 8, !tbaa !3
  %0 = load <4 x float>*, <4 x float>** %a.addr, align 8, !tbaa !3
  %1 = load <4 x float>, <4 x float>* %0, align 16
  %vecins = insertelement <4 x float> %1, float 1.000000e+00, i32 0
  store <4 x float> %vecins, <4 x float>* %0, align 16
  %2 = load <4 x float>*, <4 x float>** %a.addr, align 8, !tbaa !3
  %3 = load <4 x float>, <4 x float>* %2, align 16
  %vecins1 = insertelement <4 x float> %3, float 2.000000e+00, i32 3
  store <4 x float> %vecins1, <4 x float>* %2, align 16
  ret void
}

Should this have been translated to GEP+scalar stores by clang rather than
vector load+store?


On Thu, Oct 3, 2019 at 12:33 PM Nemanja Ivanovic via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Just out of curiosity,
> would it perhaps make sense to canonicalize this to a masked store?
>
> Nemanja
>
> On Thu, Sep 26, 2019 at 10:59 PM Qiu Chaofan via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> > I may be missing something obvious, but what is `vector` defined
as
>> here? Can you provide a buildable example?
>>
>> Sorry, I should provide a cross-platform version using vector
>> extension of frontend :) `vector int` is a vector extension on
>> PowerPC, which is enabled if you set target to PowerPC platforms.
>> Example below should be successfully compiled in any platform:
>>
>>     typedef float v4sf __attribute__ ((vector_size(16)));
>>
>>     void foo(v4sf *a) {
>>       (*a)[0] = 1;
>>       (*a)[3] = 2;
>>     }
>>
>> And we can get the IR mentioned before:
>>
>>     %0 = load <4 x float>, <4 x float>* %a, align 16
>>     %vecins1 = shufflevector <4 x float> <float 1.000000e+00,
float
>> undef, float undef, float 2.000000e+00>, <4 x float> %0, <4
x i32>
>> <i32 0, i32 5, i32 6, i32 3>
>>     store <4 x float> %vecins1, <4 x float>* %a, align 16
>>
>> Regards,
>> Qiu Chaofan
>>
>>
>> Florian Hahn <florian_hahn at apple.com> 于2019年9月26日周四 下午7:15写道：
>> >
>> > Hi
>> >
>> > > On Sep 26, 2019, at 10:53, Qiu Chaofan via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>> > >
>> > > Hi there,
>> > >
>> > > I notice that LLVM seems to always generate vector
instructions for
>> > > vector operations in C, even it's just simple stores:
>> > >
>> > > void foo(vector int* c) {
>> > >  (*c)[0] = 1;
>> > >  (*c)[1] = 2;
>> > > }
>> > >
>> >
>> > I may be missing something obvious, but what is `vector` defined
as
>> here? Can you provide a buildable example?
>> >
>> > > %0 = load <4 x i32>, <4 x i32>* %c, align 16
>> > > %vecins1 = shufflevector <4 x i32> <i32 1, i32 2,
i32 undef, i32
>> > > undef>, <4 x i32> %0, <4 x i32> <i32 0, i32
1, i32 6, i32 7>
>> > > store <4 x i32> %vecins1, <4 x i32>* %c, align 16
>> > >
>> >
>> > For some reason, we load 4 elements from %c and write the last 2
>> elements back unchanged. This causes sub-optimal codegen here. We could
do
>> a better job at dropping the writes of unchanged elements. But from the
>> original code, it is not immediately obvious to me why we generate them
in
>> the first place. Maybe we could avoid generating them?
>> >
>> > Cheers,
>> > Florian
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191004/1ac25391/attachment.html>

Qiu Chaofan via llvm-dev

2019-Oct-08 17:58 UTC

head link

[llvm-dev] Opportunity to split store of shuffled vector.

> would it perhaps make sense to canonicalize this to a masked store?
Thanks for the reminder. It should help. There should also be
something missing/limited in `InstCombineVectorOps`.
> Should this have been translated to GEP+scalar stores by clang rather than
vector load+store?
Thanks for your comment! Implementing this in Clang (at `CGExpr`) is a
reasonable way. But it may break some existing assumptions/assertions,
and won't benefit other language.

In the same time, I think, if we want to do this in `InstCombine` or
other backend transformations, it's important to build a cost model
(like in SLP vectorizer). Otherwise, a series of STOREs split first
and then combined is meaningless. Can I use existing infrastructure in
`TargetTransformInfo` to determine if splitting them is better?

Regards,
Chaofan

Sanjay Patel <spatel at rotateright.com> 于2019年10月4日周五
下午8:36写道：>
> Canonicalizing to a masked store intrinsic is possible, but we might have
to expand back to the load/shuffle/store sequence for targets that don't
support masked store. And then you'd likely have to do the store splitting
that I think is being requested for the original pattern anyway.
>
> But I'd like to step back to the premise - "LLVM seems to always
generate vector instructions for vector operations in C, even it's just
simple stores."
> That's not correct. It's not LLVM that created the vector memory
accesses. That's how the IR begins from clang:
>
> define void @foo(<4 x float>* %a) #0 {
> entry:
>   %a.addr = alloca <4 x float>*, align 8
>   store <4 x float>* %a, <4 x float>** %a.addr, align 8, !tbaa
!3
>   %0 = load <4 x float>*, <4 x float>** %a.addr, align 8, !tbaa
!3
>   %1 = load <4 x float>, <4 x float>* %0, align 16
>   %vecins = insertelement <4 x float> %1, float 1.000000e+00, i32 0
>   store <4 x float> %vecins, <4 x float>* %0, align 16
>   %2 = load <4 x float>*, <4 x float>** %a.addr, align 8, !tbaa
!3
>   %3 = load <4 x float>, <4 x float>* %2, align 16
>   %vecins1 = insertelement <4 x float> %3, float 2.000000e+00, i32 3
>   store <4 x float> %vecins1, <4 x float>* %2, align 16
>   ret void
> }
>
> Should this have been translated to GEP+scalar stores by clang rather than
vector load+store?
>
>
> On Thu, Oct 3, 2019 at 12:33 PM Nemanja Ivanovic via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
>>
>> Just out of curiosity,
>> would it perhaps make sense to canonicalize this to a masked store?
>>
>> Nemanja
>>
>> On Thu, Sep 26, 2019 at 10:59 PM Qiu Chaofan via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
>>>
>>> > I may be missing something obvious, but what is `vector`
defined as here? Can you provide a buildable example?
>>>
>>> Sorry, I should provide a cross-platform version using vector
>>> extension of frontend :) `vector int` is a vector extension on
>>> PowerPC, which is enabled if you set target to PowerPC platforms.
>>> Example below should be successfully compiled in any platform:
>>>
>>>     typedef float v4sf __attribute__ ((vector_size(16)));
>>>
>>>     void foo(v4sf *a) {
>>>       (*a)[0] = 1;
>>>       (*a)[3] = 2;
>>>     }
>>>
>>> And we can get the IR mentioned before:
>>>
>>>     %0 = load <4 x float>, <4 x float>* %a, align 16
>>>     %vecins1 = shufflevector <4 x float> <float
1.000000e+00, float
>>> undef, float undef, float 2.000000e+00>, <4 x float> %0,
<4 x i32>
>>> <i32 0, i32 5, i32 6, i32 3>
>>>     store <4 x float> %vecins1, <4 x float>* %a, align
16
>>>
>>> Regards,
>>> Qiu Chaofan
>>>
>>>
>>> Florian Hahn <florian_hahn at apple.com> 于2019年9月26日周四
下午7:15写道：
>>> >
>>> > Hi
>>> >
>>> > > On Sep 26, 2019, at 10:53, Qiu Chaofan via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>> > >
>>> > > Hi there,
>>> > >
>>> > > I notice that LLVM seems to always generate vector
instructions for
>>> > > vector operations in C, even it's just simple stores:
>>> > >
>>> > > void foo(vector int* c) {
>>> > >  (*c)[0] = 1;
>>> > >  (*c)[1] = 2;
>>> > > }
>>> > >
>>> >
>>> > I may be missing something obvious, but what is `vector`
defined as here? Can you provide a buildable example?
>>> >
>>> > > %0 = load <4 x i32>, <4 x i32>* %c, align 16
>>> > > %vecins1 = shufflevector <4 x i32> <i32 1, i32
2, i32 undef, i32
>>> > > undef>, <4 x i32> %0, <4 x i32> <i32 0,
i32 1, i32 6, i32 7>
>>> > > store <4 x i32> %vecins1, <4 x i32>* %c,
align 16
>>> > >
>>> >
>>> > For some reason, we load 4 elements from %c and write the last
2 elements back unchanged. This causes sub-optimal codegen here. We could do a
better job at dropping the writes of unchanged elements. But from the original
code, it is not immediately obvious to me why we generate them in the first
place. Maybe we could avoid generating them?
>>> >
>>> > Cheers,
>>> > Florian
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Sanjay Patel via llvm-dev

2019-Oct-08 19:35 UTC

head link

[llvm-dev] Opportunity to split store of shuffled vector.

InstCombine is not intended to use a cost model, so I don't think that is
an option for store splitting.
This should probably be implemented as a pattern match starting from
DAGCombiner::visitSTORE() and using a TLI hook. The TLI function would
include parameters for the vector value type and a number to specify the
maximum scalar stores the target is willing to create to replace the vector
load+store. So for example, we could decide it is profitable to replace
load+store of v4f32 with two f32 stores, but it's not profitable to replace
load+store of v16i8 with fifteen i8 stores.

On Tue, Oct 8, 2019 at 1:58 PM Qiu Chaofan <fwage73 at gmail.com> wrote:
> > would it perhaps make sense to canonicalize this to a masked store?
>
> Thanks for the reminder. It should help. There should also be
> something missing/limited in `InstCombineVectorOps`.
>
> > Should this have been translated to GEP+scalar stores by clang rather
> than vector load+store?
>
> Thanks for your comment! Implementing this in Clang (at `CGExpr`) is a
> reasonable way. But it may break some existing assumptions/assertions,
> and won't benefit other language.
>
> In the same time, I think, if we want to do this in `InstCombine` or
> other backend transformations, it's important to build a cost model
> (like in SLP vectorizer). Otherwise, a series of STOREs split first
> and then combined is meaningless. Can I use existing infrastructure in
> `TargetTransformInfo` to determine if splitting them is better?
>
> Regards,
> Chaofan
>
> Sanjay Patel <spatel at rotateright.com> 于2019年10月4日周五 下午8:36写道：
> >
> > Canonicalizing to a masked store intrinsic is possible, but we might
> have to expand back to the load/shuffle/store sequence for targets that
> don't support masked store. And then you'd likely have to do the
store
> splitting that I think is being requested for the original pattern anyway.
> >
> > But I'd like to step back to the premise - "LLVM seems to
always
> generate vector instructions for vector operations in C, even it's just
> simple stores."
> > That's not correct. It's not LLVM that created the vector
memory
> accesses. That's how the IR begins from clang:
> >
> > define void @foo(<4 x float>* %a) #0 {
> > entry:
> >   %a.addr = alloca <4 x float>*, align 8
> >   store <4 x float>* %a, <4 x float>** %a.addr, align 8,
!tbaa !3
> >   %0 = load <4 x float>*, <4 x float>** %a.addr, align 8,
!tbaa !3
> >   %1 = load <4 x float>, <4 x float>* %0, align 16
> >   %vecins = insertelement <4 x float> %1, float 1.000000e+00,
i32 0
> >   store <4 x float> %vecins, <4 x float>* %0, align 16
> >   %2 = load <4 x float>*, <4 x float>** %a.addr, align 8,
!tbaa !3
> >   %3 = load <4 x float>, <4 x float>* %2, align 16
> >   %vecins1 = insertelement <4 x float> %3, float 2.000000e+00,
i32 3
> >   store <4 x float> %vecins1, <4 x float>* %2, align 16
> >   ret void
> > }
> >
> > Should this have been translated to GEP+scalar stores by clang rather
> than vector load+store?
> >
> >
> > On Thu, Oct 3, 2019 at 12:33 PM Nemanja Ivanovic via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >>
> >> Just out of curiosity,
> >> would it perhaps make sense to canonicalize this to a masked
store?
> >>
> >> Nemanja
> >>
> >> On Thu, Sep 26, 2019 at 10:59 PM Qiu Chaofan via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >>>
> >>> > I may be missing something obvious, but what is `vector`
defined as
> here? Can you provide a buildable example?
> >>>
> >>> Sorry, I should provide a cross-platform version using vector
> >>> extension of frontend :) `vector int` is a vector extension on
> >>> PowerPC, which is enabled if you set target to PowerPC
platforms.
> >>> Example below should be successfully compiled in any platform:
> >>>
> >>>     typedef float v4sf __attribute__ ((vector_size(16)));
> >>>
> >>>     void foo(v4sf *a) {
> >>>       (*a)[0] = 1;
> >>>       (*a)[3] = 2;
> >>>     }
> >>>
> >>> And we can get the IR mentioned before:
> >>>
> >>>     %0 = load <4 x float>, <4 x float>* %a, align
16
> >>>     %vecins1 = shufflevector <4 x float> <float
1.000000e+00, float
> >>> undef, float undef, float 2.000000e+00>, <4 x float>
%0, <4 x i32>
> >>> <i32 0, i32 5, i32 6, i32 3>
> >>>     store <4 x float> %vecins1, <4 x float>* %a,
align 16
> >>>
> >>> Regards,
> >>> Qiu Chaofan
> >>>
> >>>
> >>> Florian Hahn <florian_hahn at apple.com> 于2019年9月26日周四
下午7:15写道：
> >>> >
> >>> > Hi
> >>> >
> >>> > > On Sep 26, 2019, at 10:53, Qiu Chaofan via llvm-dev
<
> llvm-dev at lists.llvm.org> wrote:
> >>> > >
> >>> > > Hi there,
> >>> > >
> >>> > > I notice that LLVM seems to always generate vector
instructions for
> >>> > > vector operations in C, even it's just simple
stores:
> >>> > >
> >>> > > void foo(vector int* c) {
> >>> > >  (*c)[0] = 1;
> >>> > >  (*c)[1] = 2;
> >>> > > }
> >>> > >
> >>> >
> >>> > I may be missing something obvious, but what is `vector`
defined as
> here? Can you provide a buildable example?
> >>> >
> >>> > > %0 = load <4 x i32>, <4 x i32>* %c,
align 16
> >>> > > %vecins1 = shufflevector <4 x i32> <i32 1,
i32 2, i32 undef, i32
> >>> > > undef>, <4 x i32> %0, <4 x i32>
<i32 0, i32 1, i32 6, i32 7>
> >>> > > store <4 x i32> %vecins1, <4 x i32>* %c,
align 16
> >>> > >
> >>> >
> >>> > For some reason, we load 4 elements from %c and write the
last 2
> elements back unchanged. This causes sub-optimal codegen here. We could do
> a better job at dropping the writes of unchanged elements. But from the
> original code, it is not immediately obvious to me why we generate them in
> the first place. Maybe we could avoid generating them?
> >>> >
> >>> > Cheers,
> >>> > Florian
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> llvm-dev at lists.llvm.org
> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191008/2b780296/attachment.html>

llvm dev - Oct 2019 - Opportunity to split store of shuffled vector.

[llvm-dev] Opportunity to split store of shuffled vector.

[llvm-dev] Opportunity to split store of shuffled vector.

[llvm-dev] Opportunity to split store of shuffled vector.