thr3ads.net - llvm dev - [llvm-dev] Vectorizer has trouble with vpmovmskb and store [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Craig Topper via llvm-dev

2018-Nov-27 03:00 UTC

[llvm-dev] Vectorizer has trouble with vpmovmskb and store

We should handle this a lot better after r34763

~Craig


On Mon, Nov 26, 2018 at 3:13 PM Craig Topper <craig.topper at gmail.com>
wrote:
> Here's a quick patch that fixes this. I don't know to avoid it in
IR. I
> haven't checked any other tests, but it does fix your case. I'll
try to put
> up a real phabricator tonight or tomorrow.
>
> diff --git a/lib/Target/X86/X86ISelLowering.cpp
> b/lib/Target/X86/X86ISelLowering.cpp
> index e31f2a6..d79c0be 100644
> --- a/lib/Target/X86/X86ISelLowering.cpp
> +++ b/lib/Target/X86/X86ISelLowering.cpp
> @@ -4837,6 +4837,11 @@ bool X86TargetLowering::isCheapToSpeculateCtlz()
> const {
>
>  bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT,
>                                                  EVT BitcastVT) const {
> +  if (!LoadVT.isVector() && BitcastVT.isVector() &&
> +      BitcastVT.getVectorElementType() == MVT::i1 &&
> +      !Subtarget.hasAVX512())
> +    return false;
> +
>    if (!Subtarget.hasDQI() && BitcastVT == MVT::v8i1)
>      return false;
>
>
> ~Craig
>
>
> On Mon, Nov 26, 2018 at 2:51 PM Johan Engelen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi all,
>>   I've run into a case where the optimizer seems to be having
trouble
>> doing the "obvious" thing.
>>
>> Consider this code:
>> ```
>> define i16 @foo(<8 x i16>* dereferenceable(16) %egress, <16 x
i8> %a0) {
>>     %a1 = icmp slt <16 x i8> %a0, zeroinitializer
>>     %a2 = bitcast <16 x i1> %a1 to i16
>>     %astore = getelementptr inbounds <8 x i16>, <8 x i16>*
%egress, i64 0,
>> i64 7
>>     ;store i16 %a2, i16* %astore
>>     ret i16 %a2
>> }
>> ```
>> The optimizer recognizes this and llc nicely outputs a vpmovmskb
>> instruction:
>> ```
>> foo: # @foo
>>     vpmovmskb eax, xmm0
>>     ret
>> ```
>>
>> Writing to the output vector also works well:
>> ```
>> define void @writing(<8 x i16>* dereferenceable(16) %egress,
<16 x i8>
>> %a0) {
>>     %astore = getelementptr inbounds <8 x i16>, <8 x i16>*
%egress, i64 0,
>> i64 7
>>     store i16 123, i16* %astore
>>     ret void
>> }
>> ```
>> outputs:
>> ```
>> writing: # @writing
>>     mov word ptr [rdi + 14], 123
>>     ret
>> ```
>>
>> Now, combining these two by uncommenting the store in `foo()` suddenly
>> results in a very large function, instead of just:
>>     vpmovmskb eax, xmm0
>>     mov word ptr [rdi + 14], ax
>>     ret
>>
>> Is there something wrong with my IR code, or is the optimizer somehow
>> confused? Can I rewrite the code such that the optimizer does
understand?
>>
>> Godbolt link: https://llvm.godbolt.org/z/OgExDk
>>
>> Thanks a lot for the help.
>> Cheers,
>>   Johan
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181126/95fc98a1/attachment.html>

Johan Engelen via llvm-dev

2018-Dec-01 12:38 UTC

head link

[llvm-dev] Vectorizer has trouble with vpmovmskb and store

Hello Craig,
  Thank you for the quick response and fix.
However, the improvement turns out to be quite fragile. If I run `opt` on
the original testcase, and run the output through `llc` then the previous
very long assembly output results.  (things work for a bitcast from <16 x
i1> to i16, but not for a <16 x i1>* store)
Godbolt link: https://llvm.godbolt.org/z/j1ob9w

regards,
  Johan



On Tue, Nov 27, 2018 at 4:00 AM Craig Topper <craig.topper at gmail.com>
wrote:
> We should handle this a lot better after r34763
>
> ~Craig
>
>
> On Mon, Nov 26, 2018 at 3:13 PM Craig Topper <craig.topper at
gmail.com>
> wrote:
>
>> Here's a quick patch that fixes this. I don't know to avoid it
in IR. I
>> haven't checked any other tests, but it does fix your case.
I'll try to put
>> up a real phabricator tonight or tomorrow.
>>
>> diff --git a/lib/Target/X86/X86ISelLowering.cpp
>> b/lib/Target/X86/X86ISelLowering.cpp
>> index e31f2a6..d79c0be 100644
>> --- a/lib/Target/X86/X86ISelLowering.cpp
>> +++ b/lib/Target/X86/X86ISelLowering.cpp
>> @@ -4837,6 +4837,11 @@ bool X86TargetLowering::isCheapToSpeculateCtlz()
>> const {
>>
>>  bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT,
>>                                                  EVT BitcastVT) const {
>> +  if (!LoadVT.isVector() && BitcastVT.isVector() &&
>> +      BitcastVT.getVectorElementType() == MVT::i1 &&
>> +      !Subtarget.hasAVX512())
>> +    return false;
>> +
>>    if (!Subtarget.hasDQI() && BitcastVT == MVT::v8i1)
>>      return false;
>>
>>
>> ~Craig
>>
>>
>> On Mon, Nov 26, 2018 at 2:51 PM Johan Engelen via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hi all,
>>>   I've run into a case where the optimizer seems to be having
trouble
>>> doing the "obvious" thing.
>>>
>>> Consider this code:
>>> ```
>>> define i16 @foo(<8 x i16>* dereferenceable(16) %egress,
<16 x i8> %a0) {
>>>     %a1 = icmp slt <16 x i8> %a0, zeroinitializer
>>>     %a2 = bitcast <16 x i1> %a1 to i16
>>>     %astore = getelementptr inbounds <8 x i16>, <8 x
i16>* %egress, i64
>>> 0, i64 7
>>>     ;store i16 %a2, i16* %astore
>>>     ret i16 %a2
>>> }
>>> ```
>>> The optimizer recognizes this and llc nicely outputs a vpmovmskb
>>> instruction:
>>> ```
>>> foo: # @foo
>>>     vpmovmskb eax, xmm0
>>>     ret
>>> ```
>>>
>>> Writing to the output vector also works well:
>>> ```
>>> define void @writing(<8 x i16>* dereferenceable(16) %egress,
<16 x i8>
>>> %a0) {
>>>     %astore = getelementptr inbounds <8 x i16>, <8 x
i16>* %egress, i64
>>> 0, i64 7
>>>     store i16 123, i16* %astore
>>>     ret void
>>> }
>>> ```
>>> outputs:
>>> ```
>>> writing: # @writing
>>>     mov word ptr [rdi + 14], 123
>>>     ret
>>> ```
>>>
>>> Now, combining these two by uncommenting the store in `foo()`
suddenly
>>> results in a very large function, instead of just:
>>>     vpmovmskb eax, xmm0
>>>     mov word ptr [rdi + 14], ax
>>>     ret
>>>
>>> Is there something wrong with my IR code, or is the optimizer
somehow
>>> confused? Can I rewrite the code such that the optimizer does
understand?
>>>
>>> Godbolt link: https://llvm.godbolt.org/z/OgExDk
>>>
>>> Thanks a lot for the help.
>>> Cheers,
>>>   Johan
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181201/5c29155c/attachment.html>

Craig Topper via llvm-dev

2018-Dec-01 19:28 UTC

head link

[llvm-dev] Vectorizer has trouble with vpmovmskb and store

I was afraid of that. I thought I had checked whether InstCombine would
remove the bitcast here, but I guess I didn't or didn't do it right.
I'll
see what I can do to fix this.

~Craig


On Sat, Dec 1, 2018 at 4:39 AM Johan Engelen <jbc.engelen at gmail.com>
wrote:
> Hello Craig,
>   Thank you for the quick response and fix.
> However, the improvement turns out to be quite fragile. If I run `opt` on
> the original testcase, and run the output through `llc` then the previous
> very long assembly output results.  (things work for a bitcast from <16
x
> i1> to i16, but not for a <16 x i1>* store)
> Godbolt link: https://llvm.godbolt.org/z/j1ob9w
>
> regards,
>   Johan
>
>
>
> On Tue, Nov 27, 2018 at 4:00 AM Craig Topper <craig.topper at
gmail.com>
> wrote:
>
>> We should handle this a lot better after r34763
>>
>> ~Craig
>>
>>
>> On Mon, Nov 26, 2018 at 3:13 PM Craig Topper <craig.topper at
gmail.com>
>> wrote:
>>
>>> Here's a quick patch that fixes this. I don't know to avoid
it in IR. I
>>> haven't checked any other tests, but it does fix your case.
I'll try to put
>>> up a real phabricator tonight or tomorrow.
>>>
>>> diff --git a/lib/Target/X86/X86ISelLowering.cpp
>>> b/lib/Target/X86/X86ISelLowering.cpp
>>> index e31f2a6..d79c0be 100644
>>> --- a/lib/Target/X86/X86ISelLowering.cpp
>>> +++ b/lib/Target/X86/X86ISelLowering.cpp
>>> @@ -4837,6 +4837,11 @@ bool
X86TargetLowering::isCheapToSpeculateCtlz()
>>> const {
>>>
>>>  bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT,
>>>                                                  EVT BitcastVT)
const {
>>> +  if (!LoadVT.isVector() && BitcastVT.isVector()
&&
>>> +      BitcastVT.getVectorElementType() == MVT::i1 &&
>>> +      !Subtarget.hasAVX512())
>>> +    return false;
>>> +
>>>    if (!Subtarget.hasDQI() && BitcastVT == MVT::v8i1)
>>>      return false;
>>>
>>>
>>> ~Craig
>>>
>>>
>>> On Mon, Nov 26, 2018 at 2:51 PM Johan Engelen via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Hi all,
>>>>   I've run into a case where the optimizer seems to be
having trouble
>>>> doing the "obvious" thing.
>>>>
>>>> Consider this code:
>>>> ```
>>>> define i16 @foo(<8 x i16>* dereferenceable(16) %egress,
<16 x i8> %a0)
>>>> {
>>>>     %a1 = icmp slt <16 x i8> %a0, zeroinitializer
>>>>     %a2 = bitcast <16 x i1> %a1 to i16
>>>>     %astore = getelementptr inbounds <8 x i16>, <8 x
i16>* %egress, i64
>>>> 0, i64 7
>>>>     ;store i16 %a2, i16* %astore
>>>>     ret i16 %a2
>>>> }
>>>> ```
>>>> The optimizer recognizes this and llc nicely outputs a
vpmovmskb
>>>> instruction:
>>>> ```
>>>> foo: # @foo
>>>>     vpmovmskb eax, xmm0
>>>>     ret
>>>> ```
>>>>
>>>> Writing to the output vector also works well:
>>>> ```
>>>> define void @writing(<8 x i16>* dereferenceable(16)
%egress, <16 x i8>
>>>> %a0) {
>>>>     %astore = getelementptr inbounds <8 x i16>, <8 x
i16>* %egress, i64
>>>> 0, i64 7
>>>>     store i16 123, i16* %astore
>>>>     ret void
>>>> }
>>>> ```
>>>> outputs:
>>>> ```
>>>> writing: # @writing
>>>>     mov word ptr [rdi + 14], 123
>>>>     ret
>>>> ```
>>>>
>>>> Now, combining these two by uncommenting the store in `foo()`
suddenly
>>>> results in a very large function, instead of just:
>>>>     vpmovmskb eax, xmm0
>>>>     mov word ptr [rdi + 14], ax
>>>>     ret
>>>>
>>>> Is there something wrong with my IR code, or is the optimizer
somehow
>>>> confused? Can I rewrite the code such that the optimizer does
understand?
>>>>
>>>> Godbolt link: https://llvm.godbolt.org/z/OgExDk
>>>>
>>>> Thanks a lot for the help.
>>>> Cheers,
>>>>   Johan
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181201/570b5516/attachment.html>

llvm dev - Dec 2018 - Vectorizer has trouble with vpmovmskb and store

[llvm-dev] Vectorizer has trouble with vpmovmskb and store

[llvm-dev] Vectorizer has trouble with vpmovmskb and store

[llvm-dev] Vectorizer has trouble with vpmovmskb and store