thr3ads.net - llvm dev - [llvm-dev] Load combine pass [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Artur Pilipenko via llvm-dev

2016-Sep-28 16:32 UTC

[llvm-dev] Load combine pass

One of the arguments for doing this earlier is inline cost perception of the
original pattern. Reading i32/i64 by bytes look much more expensive than it is
and can prevent inlining of interesting function.

Inhibiting other optimizations concern can be addressed by careful selection of
the pattern we’d like to match. I limit the transformation to the case when all
the individual have no uses other than forming a wider load. In this case it’s
less likely to loose information during this transformation.

I didn’t think of atomicity aspect though.

Artur
> On 28 Sep 2016, at 18:50, Philip Reames <listmail at
philipreames.com> wrote:
> 
> There's a bit of additional context worth adding here...
> 
> Up until very recently, we had a form of widening implemented in GVN.  We
decided to remove this in https://reviews.llvm.org/D24096 precisely because its
placement in the pass pipeline was inhibiting other optimizations. There's
also a major problem with doing widening at the IR level which is that widening
a pair of atomic loads into a single wider atomic load can not be undone. This
creates a major pass ordering problem of its own.
> 
> At this point, my general view is that widening transformations of any kind
should be done very late.  Ideally, this is something the backend would do, but
doing it as a CGP like fixup pass over the IR is also reasonable.
> 
> With that in mind, I feel both the current placement of LoadCombine (within
the inliner iteration) and the proposed InstCombine rule are undesirable.
> 
> Philip
> 
> 
> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
>> Hi,
>> 
>> I'm trying to optimize a pattern like this into a single i16 load:
>>  %1 = bitcast i16* %pData to i8*
>>  %2 = load i8, i8* %1, align 1
>>  %3 = zext i8 %2 to i16
>>  %4 = shl nuw i16 %3, 8
>>  %5 = getelementptr inbounds i8, i8* %1, i16 1
>>  %6 = load i8, i8* %5, align 1
>>  %7 = zext i8 %6 to i16
>>  %8 = shl nuw nsw i16 %7, 0
>>  %9 = or i16 %8, %4
>> 
>> I came across load combine pass which is motivated by virtualliy the
same pattern. Load combine optimizes the pattern by combining adjacent loads
into one load and lets the rest of the optimizer cleanup the rest. From what I
see on the initial review for load combine (https://reviews.llvm.org/D3580) it
was not enabled by default because it caused some performance regressions.
It's not very surprising, I see how this type of widening can obscure some
facts for the rest of the optimizer.
>> 
>> I can't find any backstory for this pass, why was it chosen to
optimize the pattern in question in this way? What is the current status of this
pass?
>> 
>> I have an alternative implementation for it locally. I implemented an
instcombine rule similar to recognise bswap/bitreverse idiom. It relies on
collectBitParts (Local.cpp) to determine the origin of the bits in a given or
value. If all the bits are happen to be loaded from adjacent locations it
replaces the or with a single load or a load plus bswap.
>> 
>> If the alternative approach sounds reasonable I'll post my patches
for review.
>> 
>> Artur
>

Sanjoy Das via llvm-dev

2016-Sep-29 00:23 UTC

head link

[llvm-dev] Load combine pass

Hi Artur,

Artur Pilipenko via llvm-dev wrote:
 > One of the arguments for doing this earlier is inline cost
 > perception of the original pattern. Reading i32/i64 by bytes look much
 > more expensive than it is and can prevent inlining of interesting
 > function.

I don't think this is just a perception issue -- if the loads have not
been widened then inlining the containing function _is_ expensive, and
the inliner cost analysis is doing the right thing.

 > Inhibiting other optimizations concern can be addressed by careful
 > selection of the pattern we’d like to match. I limit the
 > transformation to the case when all the individual have no uses other
 > than forming a wider load. In this case it’s less likely to loose
 > information during this transformation.

I agree -- I think widening loads in "obvious" cases like:

   i16 *a = ...
   i32 val = a[i] | (a[i + 1] << 16)

is more defensible than trying to widen the example that broke in
https://llvm.org/bugs/show_bug.cgi?id=29110.

Regarding atomicity, the only real optimization that we'll lose (that
I can think of) is PRE.  Additionally, it may be more expensive to
lower wider atomic loads / stores, but that can be indicated by a
target hook.  For instance, on x86, I don't think:

   load atomic i16, i16* %ptr, unordered

is cheaper than

   load atomic i32, i32* %ptr.bitcast, unordered

so from a lowering perspective there is no reason to prefer the former.

Given this, perhaps scheduling load widening after one pass of GVN/PRE
is fine?

-- Sanjoy

 >
 > I didn’t think of atomicity aspect though.
 >
 > Artur
 >
 >> On 28 Sep 2016, at 18:50, Philip Reames<listmail at
philipreames.com>
wrote:
 >>
 >> There's a bit of additional context worth adding here...
 >>
 >> Up until very recently, we had a form of widening implemented in 
GVN.  We decided to remove this in https://reviews.llvm.org/D24096 
precisely because its placement in the pass pipeline was inhibiting 
other optimizations. There's also a major problem with doing widening at 
the IR level which is that widening a pair of atomic loads into a single 
wider atomic load can not be undone. This creates a major pass ordering 
problem of its own.
 >>
 >> At this point, my general view is that widening transformations of 
any kind should be done very late.  Ideally, this is something the 
backend would do, but doing it as a CGP like fixup pass over the IR is 
also reasonable.
 >>
 >> With that in mind, I feel both the current placement of LoadCombine 
(within the inliner iteration) and the proposed InstCombine rule are 
undesirable.
 >>
 >> Philip
 >>
 >>
 >> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
 >>> Hi,
 >>>
 >>> I'm trying to optimize a pattern like this into a single i16
load:
 >>>   %1 = bitcast i16* %pData to i8*
 >>>   %2 = load i8, i8* %1, align 1
 >>>   %3 = zext i8 %2 to i16
 >>>   %4 = shl nuw i16 %3, 8
 >>>   %5 = getelementptr inbounds i8, i8* %1, i16 1
 >>>   %6 = load i8, i8* %5, align 1
 >>>   %7 = zext i8 %6 to i16
 >>>   %8 = shl nuw nsw i16 %7, 0
 >>>   %9 = or i16 %8, %4
 >>>
 >>> I came across load combine pass which is motivated by virtualliy 
the same pattern. Load combine optimizes the pattern by combining 
adjacent loads into one load and lets the rest of the optimizer cleanup 
the rest. From what I see on the initial review for load combine 
(https://reviews.llvm.org/D3580) it was not enabled by default because 
it caused some performance regressions. It's not very surprising, I see 
how this type of widening can obscure some facts for the rest of the 
optimizer.
 >>>
 >>> I can't find any backstory for this pass, why was it chosen to
optimize the pattern in question in this way? What is the current status 
of this pass?
 >>>
 >>> I have an alternative implementation for it locally. I implemented
an instcombine rule similar to recognise bswap/bitreverse idiom. It 
relies on collectBitParts (Local.cpp) to determine the origin of the 
bits in a given or value. If all the bits are happen to be loaded from 
adjacent locations it replaces the or with a single load or a load plus 
bswap.
 >>>
 >>> If the alternative approach sounds reasonable I'll post my
patches
for review.
 >>>
 >>> Artur
 >
 > _______________________________________________
 > LLVM Developers mailing list
 > llvm-dev at lists.llvm.org
 > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Artur Pilipenko via llvm-dev

2016-Sep-29 17:51 UTC

head link

[llvm-dev] Load combine pass

> On 29 Sep 2016, at 03:23, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
> 
> Hi Artur,
> 
> Artur Pilipenko via llvm-dev wrote:
> > One of the arguments for doing this earlier is inline cost
> > perception of the original pattern. Reading i32/i64 by bytes look much
> > more expensive than it is and can prevent inlining of interesting
> > function.
> 
> I don't think this is just a perception issue -- if the loads have not
> been widened then inlining the containing function _is_ expensive, and
> the inliner cost analysis is doing the right thing.
> 
> > Inhibiting other optimizations concern can be addressed by careful
> > selection of the pattern we’d like to match. I limit the
> > transformation to the case when all the individual have no uses other
> > than forming a wider load. In this case it’s less likely to loose
> > information during this transformation.
> 
> I agree -- I think widening loads in "obvious" cases like:
> 
>  i16 *a = ...
>  i32 val = a[i] | (a[i + 1] << 16)
> 
> is more defensible than trying to widen the example that broke in
> https://llvm.org/bugs/show_bug.cgi?id=29110.
> 
> Regarding atomicity, the only real optimization that we'll lose (that
> I can think of) is PRE.  Additionally, it may be more expensive to
> lower wider atomic loads / stores, but that can be indicated by a
> target hook.  For instance, on x86, I don't think:
> 
>  load atomic i16, i16* %ptr, unordered
> 
> is cheaper than
> 
>  load atomic i32, i32* %ptr.bitcast, unordered
> 
> so from a lowering perspective there is no reason to prefer the former.BTW, do we really need to emit an atomic load if all the individual components
are bytes?

Artur> 
> Given this, perhaps scheduling load widening after one pass of GVN/PRE
> is fine?
> 
> -- Sanjoy
> 
> >
> > I didn’t think of atomicity aspect though.
> >
> > Artur
> >
> >> On 28 Sep 2016, at 18:50, Philip Reames<listmail at
philipreames.com> wrote:
> >>
> >> There's a bit of additional context worth adding here...
> >>
> >> Up until very recently, we had a form of widening implemented in
GVN.  We decided to remove this in https://reviews.llvm.org/D24096 precisely
because its placement in the pass pipeline was inhibiting other optimizations.
There's also a major problem with doing widening at the IR level which is
that widening a pair of atomic loads into a single wider atomic load can not be
undone. This creates a major pass ordering problem of its own.
> >>
> >> At this point, my general view is that widening transformations of
any kind should be done very late.  Ideally, this is something the backend would
do, but doing it as a CGP like fixup pass over the IR is also reasonable.
> >>
> >> With that in mind, I feel both the current placement of
LoadCombine (within the inliner iteration) and the proposed InstCombine rule are
undesirable.
> >>
> >> Philip
> >>
> >>
> >> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
> >>> Hi,
> >>>
> >>> I'm trying to optimize a pattern like this into a single
i16 load:
> >>>   %1 = bitcast i16* %pData to i8*
> >>>   %2 = load i8, i8* %1, align 1
> >>>   %3 = zext i8 %2 to i16
> >>>   %4 = shl nuw i16 %3, 8
> >>>   %5 = getelementptr inbounds i8, i8* %1, i16 1
> >>>   %6 = load i8, i8* %5, align 1
> >>>   %7 = zext i8 %6 to i16
> >>>   %8 = shl nuw nsw i16 %7, 0
> >>>   %9 = or i16 %8, %4
> >>>
> >>> I came across load combine pass which is motivated by
virtualliy the same pattern. Load combine optimizes the pattern by combining
adjacent loads into one load and lets the rest of the optimizer cleanup the
rest. From what I see on the initial review for load combine
(https://reviews.llvm.org/D3580) it was not enabled by default because it caused
some performance regressions. It's not very surprising, I see how this type
of widening can obscure some facts for the rest of the optimizer.
> >>>
> >>> I can't find any backstory for this pass, why was it
chosen to optimize the pattern in question in this way? What is the current
status of this pass?
> >>>
> >>> I have an alternative implementation for it locally. I
implemented an instcombine rule similar to recognise bswap/bitreverse idiom. It
relies on collectBitParts (Local.cpp) to determine the origin of the bits in a
given or value. If all the bits are happen to be loaded from adjacent locations
it replaces the or with a single load or a load plus bswap.
> >>>
> >>> If the alternative approach sounds reasonable I'll post my
patches for review.
> >>>
> >>> Artur
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Artur Pilipenko via llvm-dev

2016-Oct-05 11:48 UTC

head link

[llvm-dev] Load combine pass

Philip and I talked about this is person. Given the fact that load widening in
presence of atomics is irreversible transformation we agreed that we don't
want to do this early. For now it can be implemented as a peephole optimization
over machine IR. MIR is preferred here because X86 backend does GEP
reassociation at MIR level and it can make information about addresses being
adjacent available.

Inline cost of the original pattern is a valid concern and we might want to
revisit our decision later. But in order to do widening earlier we need to have
a way to undo this transformation.

I’m going to try implementing MIR optimization but not in the immediate future.

Artur
> On 28 Sep 2016, at 19:32, Artur Pilipenko <apilipenko at
azulsystems.com> wrote:
> 
> One of the arguments for doing this earlier is inline cost perception of
the original pattern. Reading i32/i64 by bytes look much more expensive than it
is and can prevent inlining of interesting function.
> 
> Inhibiting other optimizations concern can be addressed by careful
selection of the pattern we’d like to match. I limit the transformation to the
case when all the individual have no uses other than forming a wider load. In
this case it’s less likely to loose information during this transformation.
> 
> I didn’t think of atomicity aspect though.
> 
> Artur
> 
>> On 28 Sep 2016, at 18:50, Philip Reames <listmail at
philipreames.com> wrote:
>> 
>> There's a bit of additional context worth adding here...
>> 
>> Up until very recently, we had a form of widening implemented in GVN. 
We decided to remove this in https://reviews.llvm.org/D24096 precisely because
its placement in the pass pipeline was inhibiting other optimizations.
There's also a major problem with doing widening at the IR level which is
that widening a pair of atomic loads into a single wider atomic load can not be
undone. This creates a major pass ordering problem of its own.
>> 
>> At this point, my general view is that widening transformations of any
kind should be done very late.  Ideally, this is something the backend would do,
but doing it as a CGP like fixup pass over the IR is also reasonable.
>> 
>> With that in mind, I feel both the current placement of LoadCombine
(within the inliner iteration) and the proposed InstCombine rule are
undesirable.
>> 
>> Philip
>> 
>> 
>> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
>>> Hi,
>>> 
>>> I'm trying to optimize a pattern like this into a single i16
load:
>>> %1 = bitcast i16* %pData to i8*
>>> %2 = load i8, i8* %1, align 1
>>> %3 = zext i8 %2 to i16
>>> %4 = shl nuw i16 %3, 8
>>> %5 = getelementptr inbounds i8, i8* %1, i16 1
>>> %6 = load i8, i8* %5, align 1
>>> %7 = zext i8 %6 to i16
>>> %8 = shl nuw nsw i16 %7, 0
>>> %9 = or i16 %8, %4
>>> 
>>> I came across load combine pass which is motivated by virtualliy
the same pattern. Load combine optimizes the pattern by combining adjacent loads
into one load and lets the rest of the optimizer cleanup the rest. From what I
see on the initial review for load combine (https://reviews.llvm.org/D3580) it
was not enabled by default because it caused some performance regressions.
It's not very surprising, I see how this type of widening can obscure some
facts for the rest of the optimizer.
>>> 
>>> I can't find any backstory for this pass, why was it chosen to
optimize the pattern in question in this way? What is the current status of this
pass?
>>> 
>>> I have an alternative implementation for it locally. I implemented
an instcombine rule similar to recognise bswap/bitreverse idiom. It relies on
collectBitParts (Local.cpp) to determine the origin of the bits in a given or
value. If all the bits are happen to be loaded from adjacent locations it
replaces the or with a single load or a load plus bswap.
>>> 
>>> If the alternative approach sounds reasonable I'll post my
patches for review.
>>> 
>>> Artur
>> 
>

Paweł Bylica via llvm-dev

2019-Sep-11 11:47 UTC

head link

[llvm-dev] Load combine pass

Hi,

Can I ask what is the status of load widening. It seems there is no load
widening on IR at all.

// Paweł

On Wed, Oct 5, 2016 at 1:49 PM Artur Pilipenko via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Philip and I talked about this is person. Given the fact that load
> widening in presence of atomics is irreversible transformation we agreed
> that we don't want to do this early. For now it can be implemented as a
> peephole optimization over machine IR. MIR is preferred here because X86
> backend does GEP reassociation at MIR level and it can make information
> about addresses being adjacent available.
>
> Inline cost of the original pattern is a valid concern and we might want
> to revisit our decision later. But in order to do widening earlier we need
> to have a way to undo this transformation.
>
> I’m going to try implementing MIR optimization but not in the immediate
> future.
>
> Artur
>
> > On 28 Sep 2016, at 19:32, Artur Pilipenko <apilipenko at
azulsystems.com>
> wrote:
> >
> > One of the arguments for doing this earlier is inline cost perception
of
> the original pattern. Reading i32/i64 by bytes look much more expensive
> than it is and can prevent inlining of interesting function.
> >
> > Inhibiting other optimizations concern can be addressed by careful
> selection of the pattern we’d like to match. I limit the transformation to
> the case when all the individual have no uses other than forming a wider
> load. In this case it’s less likely to loose information during this
> transformation.
> >
> > I didn’t think of atomicity aspect though.
> >
> > Artur
> >
> >> On 28 Sep 2016, at 18:50, Philip Reames <listmail at
philipreames.com>
> wrote:
> >>
> >> There's a bit of additional context worth adding here...
> >>
> >> Up until very recently, we had a form of widening implemented in
GVN.
> We decided to remove this in https://reviews.llvm.org/D24096 precisely
> because its placement in the pass pipeline was inhibiting other
> optimizations. There's also a major problem with doing widening at the
IR
> level which is that widening a pair of atomic loads into a single wider
> atomic load can not be undone. This creates a major pass ordering problem
> of its own.
> >>
> >> At this point, my general view is that widening transformations of
any
> kind should be done very late.  Ideally, this is something the backend
> would do, but doing it as a CGP like fixup pass over the IR is also
> reasonable.
> >>
> >> With that in mind, I feel both the current placement of
LoadCombine
> (within the inliner iteration) and the proposed InstCombine rule are
> undesirable.
> >>
> >> Philip
> >>
> >>
> >> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
> >>> Hi,
> >>>
> >>> I'm trying to optimize a pattern like this into a single
i16 load:
> >>> %1 = bitcast i16* %pData to i8*
> >>> %2 = load i8, i8* %1, align 1
> >>> %3 = zext i8 %2 to i16
> >>> %4 = shl nuw i16 %3, 8
> >>> %5 = getelementptr inbounds i8, i8* %1, i16 1
> >>> %6 = load i8, i8* %5, align 1
> >>> %7 = zext i8 %6 to i16
> >>> %8 = shl nuw nsw i16 %7, 0
> >>> %9 = or i16 %8, %4
> >>>
> >>> I came across load combine pass which is motivated by
virtualliy the
> same pattern. Load combine optimizes the pattern by combining adjacent
> loads into one load and lets the rest of the optimizer cleanup the rest.
> From what I see on the initial review for load combine (
> https://reviews.llvm.org/D3580) it was not enabled by default because it
> caused some performance regressions. It's not very surprising, I see
how
> this type of widening can obscure some facts for the rest of the optimizer.
> >>>
> >>> I can't find any backstory for this pass, why was it
chosen to
> optimize the pattern in question in this way? What is the current status of
> this pass?
> >>>
> >>> I have an alternative implementation for it locally. I
implemented an
> instcombine rule similar to recognise bswap/bitreverse idiom. It relies on
> collectBitParts (Local.cpp) to determine the origin of the bits in a given
> or value. If all the bits are happen to be loaded from adjacent locations
> it replaces the or with a single load or a load plus bswap.
> >>>
> >>> If the alternative approach sounds reasonable I'll post my
patches for
> review.
> >>>
> >>> Artur
> >>
> >
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190911/3c8f495a/attachment.html>

llvm dev - Oct 2016 - Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass