thr3ads.net - llvm dev - [llvm-dev] Load combine pass [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Artur Pilipenko via llvm-dev

2016-Sep-28 15:22 UTC

[llvm-dev] Load combine pass

Hi,

I'm trying to optimize a pattern like this into a single i16 load:
  %1 = bitcast i16* %pData to i8*
  %2 = load i8, i8* %1, align 1
  %3 = zext i8 %2 to i16
  %4 = shl nuw i16 %3, 8
  %5 = getelementptr inbounds i8, i8* %1, i16 1
  %6 = load i8, i8* %5, align 1
  %7 = zext i8 %6 to i16
  %8 = shl nuw nsw i16 %7, 0
  %9 = or i16 %8, %4

I came across load combine pass which is motivated by virtualliy the same
pattern. Load combine optimizes the pattern by combining adjacent loads into one
load and lets the rest of the optimizer cleanup the rest. From what I see on the
initial review for load combine (https://reviews.llvm.org/D3580) it was not
enabled by default because it caused some performance regressions. It's not
very surprising, I see how this type of widening can obscure some facts for the
rest of the optimizer.

I can't find any backstory for this pass, why was it chosen to optimize the
pattern in question in this way? What is the current status of this pass?

I have an alternative implementation for it locally. I implemented an
instcombine rule similar to recognise bswap/bitreverse idiom. It relies on
collectBitParts (Local.cpp) to determine the origin of the bits in a given or
value. If all the bits are happen to be loaded from adjacent locations it
replaces the or with a single load or a load plus bswap.

If the alternative approach sounds reasonable I'll post my patches for
review.

Artur
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160928/e2a12676/attachment.html>

Philip Reames via llvm-dev

2016-Sep-28 15:50 UTC

head link

[llvm-dev] Load combine pass

There's a bit of additional context worth adding here...

Up until very recently, we had a form of widening implemented in GVN.  
We decided to remove this in https://reviews.llvm.org/D24096 precisely 
because its placement in the pass pipeline was inhibiting other 
optimizations. There's also a major problem with doing widening at the 
IR level which is that widening a pair of atomic loads into a single 
wider atomic load can not be undone. This creates a major pass ordering 
problem of its own.

At this point, my general view is that widening transformations of any 
kind should be done very late.  Ideally, this is something the backend 
would do, but doing it as a CGP like fixup pass over the IR is also 
reasonable.

With that in mind, I feel both the current placement of LoadCombine 
(within the inliner iteration) and the proposed InstCombine rule are 
undesirable.

Philip

On 09/28/2016 08:22 AM, Artur Pilipenko wrote:> Hi,
>
> I'm trying to optimize a pattern like this into a single i16 load:
>   %1 = bitcast i16* %pData to i8*
>   %2 = load i8, i8* %1, align 1
>   %3 = zext i8 %2 to i16
>   %4 = shl nuw i16 %3, 8
>   %5 = getelementptr inbounds i8, i8* %1, i16 1
>   %6 = load i8, i8* %5, align 1
>   %7 = zext i8 %6 to i16
>   %8 = shl nuw nsw i16 %7, 0
>   %9 = or i16 %8, %4
>
> I came across load combine pass which is motivated by virtualliy the 
> same pattern. Load combine optimizes the pattern by combining adjacent 
> loads into one load and lets the rest of the optimizer cleanup the 
> rest. From what I see on the initial review for load combine 
> (https://reviews.llvm.org/D3580) it was not enabled by default because 
> it caused some performance regressions. It's not very surprising, I 
> see how this type of widening can obscure some facts for the rest of 
> the optimizer.
>
> I can't find any backstory for this pass, why was it chosen to 
> optimize the pattern in question in this way? What is the current 
> status of this pass?
>
> I have an alternative implementation for it locally. I implemented an 
> instcombine rule similar to recognise bswap/bitreverse idiom. It 
> relies on collectBitParts (Local.cpp) to determine the origin of the 
> bits in a given or value. If all the bits are happen to be loaded from 
> adjacent locations it replaces the or with a single load or a load 
> plus bswap.
>
> If the alternative approach sounds reasonable I'll post my patches for 
> review.
>
> Artur

David Chisnall via llvm-dev

2016-Sep-28 15:54 UTC

head link

[llvm-dev] Load combine pass

On 28 Sep 2016, at 16:50, Philip Reames via llvm-dev <llvm-dev at
lists.llvm.org> wrote:> 
> At this point, my general view is that widening transformations of any kind
should be done very late.  Ideally, this is something the backend would do, but
doing it as a CGP like fixup pass over the IR is also reasonable.
I’m really glad to see that this is gone in GVN - it will reduce our diffs a lot
when we do the next import.  The GVN load widening is not sound in the presence
of hardware-enforced spacial memory safety, so we ended up with the compiler
inserting things that caused hardware bounds checks to fail and had to disable
it in a few places.

If you’re reintroducing it, please can we have a backend option to specify
whether it’s valid to widen loads beyond the extents (for example, for us it is
not valid to widen an i16 load followed by an i8 load to an i32 load).  Ideally,
we’d also want the back end to supply some cost function: on MIPS, even when
this is valid, the i32 load followed by the masks and shifts is a lot more
expensive than the two loads (and increases register pressure, so has knock-on
effects that mean that turning off this bit of GVN gave us *much* better code,
even without any of the later optimisation opportunities being exploited).

David

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3719 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160928/b3750c85/attachment.bin>

Artur Pilipenko via llvm-dev

2016-Sep-28 16:32 UTC

head link

[llvm-dev] Load combine pass

One of the arguments for doing this earlier is inline cost perception of the
original pattern. Reading i32/i64 by bytes look much more expensive than it is
and can prevent inlining of interesting function.

Inhibiting other optimizations concern can be addressed by careful selection of
the pattern we’d like to match. I limit the transformation to the case when all
the individual have no uses other than forming a wider load. In this case it’s
less likely to loose information during this transformation.

I didn’t think of atomicity aspect though.

Artur
> On 28 Sep 2016, at 18:50, Philip Reames <listmail at
philipreames.com> wrote:
> 
> There's a bit of additional context worth adding here...
> 
> Up until very recently, we had a form of widening implemented in GVN.  We
decided to remove this in https://reviews.llvm.org/D24096 precisely because its
placement in the pass pipeline was inhibiting other optimizations. There's
also a major problem with doing widening at the IR level which is that widening
a pair of atomic loads into a single wider atomic load can not be undone. This
creates a major pass ordering problem of its own.
> 
> At this point, my general view is that widening transformations of any kind
should be done very late.  Ideally, this is something the backend would do, but
doing it as a CGP like fixup pass over the IR is also reasonable.
> 
> With that in mind, I feel both the current placement of LoadCombine (within
the inliner iteration) and the proposed InstCombine rule are undesirable.
> 
> Philip
> 
> 
> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
>> Hi,
>> 
>> I'm trying to optimize a pattern like this into a single i16 load:
>>  %1 = bitcast i16* %pData to i8*
>>  %2 = load i8, i8* %1, align 1
>>  %3 = zext i8 %2 to i16
>>  %4 = shl nuw i16 %3, 8
>>  %5 = getelementptr inbounds i8, i8* %1, i16 1
>>  %6 = load i8, i8* %5, align 1
>>  %7 = zext i8 %6 to i16
>>  %8 = shl nuw nsw i16 %7, 0
>>  %9 = or i16 %8, %4
>> 
>> I came across load combine pass which is motivated by virtualliy the
same pattern. Load combine optimizes the pattern by combining adjacent loads
into one load and lets the rest of the optimizer cleanup the rest. From what I
see on the initial review for load combine (https://reviews.llvm.org/D3580) it
was not enabled by default because it caused some performance regressions.
It's not very surprising, I see how this type of widening can obscure some
facts for the rest of the optimizer.
>> 
>> I can't find any backstory for this pass, why was it chosen to
optimize the pattern in question in this way? What is the current status of this
pass?
>> 
>> I have an alternative implementation for it locally. I implemented an
instcombine rule similar to recognise bswap/bitreverse idiom. It relies on
collectBitParts (Local.cpp) to determine the origin of the bits in a given or
value. If all the bits are happen to be loaded from adjacent locations it
replaces the or with a single load or a load plus bswap.
>> 
>> If the alternative approach sounds reasonable I'll post my patches
for review.
>> 
>> Artur
>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Sep 2016 - Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

Apparently Analagous Threads