thr3ads.net - llvm dev - [llvm-dev] MatchLoadCombine(): handling for vectorized loop. [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Jonas Paulsson via llvm-dev

2018-Dec-03 16:20 UTC

[llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Hi,

I have noticed some loops that build a wider element by loading small 
elements, zero-extending them, shifting them (with different amounts) to 
then 'or' them all together. They are either equivalent of a wider load,
or to that of a byte-swapped one.

DAGCombiner::MatchLoadCombine() will combine this to a single wide load, 
but only in the scalar cases of i16, i32 and i64. The result is that 
these loops (I have seen a dozen or so on SPEC) get vectorized with a 
lot of ugly code.

I have begun to experiment with handling the vectorized loop also, and 
would like to know if people think this would be a good idea? Also, am I 
right to assume that it probably should be run before type legalization?

/Jonas

Friedman, Eli via llvm-dev

2018-Dec-03 23:37 UTC

head link

[llvm-dev] MatchLoadCombine(): handling for vectorized loop.

On 12/3/2018 8:20 AM, Jonas Paulsson wrote:> Hi,
>
> I have noticed some loops that build a wider element by loading small 
> elements, zero-extending them, shifting them (with different amounts) 
> to then 'or' them all together. They are either equivalent of a
wider
> load, or to that of a byte-swapped one.
>
> DAGCombiner::MatchLoadCombine() will combine this to a single wide 
> load, but only in the scalar cases of i16, i32 and i64. The result is 
> that these loops (I have seen a dozen or so on SPEC) get vectorized 
> with a lot of ugly code.
>
> I have begun to experiment with handling the vectorized loop also, and 
> would like to know if people think this would be a good idea? Also, am 
> I right to assume that it probably should be run before type 
> legalization?
>You mean, trying to merge some combination of vector loads and shuffles 
into a single vector load in DAGCombine?  That seems sort of late, given 
the cost modeling involved in vectorization.

See also 
http://lists.llvm.org/pipermail/llvm-dev/2018-February/121000.html ?

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

Jonas Paulsson via llvm-dev

2018-Dec-04 09:01 UTC

head link

[llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Hi Eli,

On 2018-12-04 00:37, Friedman, Eli wrote:> On 12/3/2018 8:20 AM, Jonas Paulsson wrote:
>> Hi,
>>
>> I have noticed some loops that build a wider element by loading small 
>> elements, zero-extending them, shifting them (with different amounts) 
>> to then 'or' them all together. They are either equivalent of a
wider
>> load, or to that of a byte-swapped one.
>>
>> DAGCombiner::MatchLoadCombine() will combine this to a single wide 
>> load, but only in the scalar cases of i16, i32 and i64. The result is 
>> that these loops (I have seen a dozen or so on SPEC) get vectorized 
>> with a lot of ugly code.
>>
>> I have begun to experiment with handling the vectorized loop also, 
>> and would like to know if people think this would be a good idea? 
>> Also, am I right to assume that it probably should be run before type 
>> legalization?
>>
> You mean, trying to merge some combination of vector loads and 
> shuffles into a single vector load in DAGCombine?  That seems sort of 
> late, given the cost modeling involved in vectorization.
>What happens specifically is that a scalar loop that is written (most 
likely at source level) like

///  i8 *a = ...
///  i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] <<
24)

becomes during DAGCombine (this is from the comment in DAGCombiner.cpp:5750)

/// =>
///  i32 val = *((i32)a)

I also wondered why this is done so late but found in the commit message 
(b52af07 / r293036) that there had been a discussion where this had 
intentionally been addressed late: "...We came to the conclusion that we 
want to do this transformation late in the pipeline because in presence 
of atomic loads load widening is irreversible transformation and it 
might hinder other optimizations..."

The loop vectorizer looks at such a loop and thinks the scalar loop 
costs 13 and VF=4 costs 5. The cost for the vectorized loop is about 
right, but the scalar loop becomes much better with the help of 
MatchLoadCombine(): just 2 instructions (load + store). The vector loop 
could also have been just 2 instructions: Vector Load + Vector Store, 
but instead it does Vector Load + vector shuffles, vector zero 
extensions, vector element shifts, vector ors and the Vector Store :-(

It would be very far-fetched to correct the cost for the scalar loop in 
the LoopVectorizer. It seems like a better idea to implement the 
handling for the vectorized loop also in DAGCombiner. As an alternative, 
we could just ignore this I guess...

My patch does a recursive search starting at an ISD::Or node and then 
tries to prove that the sub-DAG with all the ors, shifts and extends is 
equivalent to the original Vector Load, possibly in reversed order. I 
think that's a somewhat simpler problem than what the scalar loop 
handling of MatchLoadCombine() is doing.

> See also 
> http://lists.llvm.org/pipermail/llvm-dev/2018-February/121000.html ?
>+ at Florian:

I think SLP awareness in LoopVectorizer would be great and wonder now 
what happened with this? (It seems however SLP is not able to handle 
this type of loop).

/Jonas

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Dec 2018 - MatchLoadCombine(): handling for vectorized loop.

[llvm-dev] MatchLoadCombine(): handling for vectorized loop.

[llvm-dev] MatchLoadCombine(): handling for vectorized loop.

[llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Reasonably Related Threads