thr3ads.net - llvm dev - [llvm-dev] Load combine pass [Sep 2019]

If this information is useful, please help other people find it:
Share via:

Paweł Bylica via llvm-dev

2019-Sep-12 07:58 UTC

[llvm-dev] Load combine pass

Ok, thanks.

Are there any plans to reintroduce it on the IR level? I'm not confident
this is strictly necessary, but in some cases not having load widening ends
up really bad.
Like in the case where vectorizer tries to do something about it:
https://godbolt.org/z/60RuEw
https://bugs.llvm.org/show_bug.cgi?id=42708

At the current state I'm forced to use memset() to express uint64 load from
an array of bytes.

// P.

On Thu, Sep 12, 2019 at 4:22 AM Artur Pilipenko <apilipenko at azul.com>
wrote:
> Hi,
>
> Load widening/combine for the original pattern was implemented in DAG
> combiner. See:
>
>
https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6872
>
> Later a similar transform was introduced fro stores:
>
>
https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6700
>
> There is no load widening on the IR level as of now.
>
> Artur
>
> On 11 Sep 2019, at 04:47, Paweł Bylica <chfast at gmail.com> wrote:
>
> Hi,
>
> Can I ask what is the status of load widening. It seems there is no load
> widening on IR at all.
>
> // Paweł
>
> On Wed, Oct 5, 2016 at 1:49 PM Artur Pilipenko via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Philip and I talked about this is person. Given the fact that load
>> widening in presence of atomics is irreversible transformation we
agreed
>> that we don't want to do this early. For now it can be implemented
as a
>> peephole optimization over machine IR. MIR is preferred here because
X86
>> backend does GEP reassociation at MIR level and it can make information
>> about addresses being adjacent available.
>>
>> Inline cost of the original pattern is a valid concern and we might
want
>> to revisit our decision later. But in order to do widening earlier we
need
>> to have a way to undo this transformation.
>>
>> I’m going to try implementing MIR optimization but not in the immediate
>> future.
>>
>> Artur
>>
>> > On 28 Sep 2016, at 19:32, Artur Pilipenko <apilipenko at
azulsystems.com>
>> wrote:
>> >
>> > One of the arguments for doing this earlier is inline cost
perception
>> of the original pattern. Reading i32/i64 by bytes look much more
expensive
>> than it is and can prevent inlining of interesting function.
>> >
>> > Inhibiting other optimizations concern can be addressed by careful
>> selection of the pattern we’d like to match. I limit the transformation
to
>> the case when all the individual have no uses other than forming a
wider
>> load. In this case it’s less likely to loose information during this
>> transformation.
>> >
>> > I didn’t think of atomicity aspect though.
>> >
>> > Artur
>> >
>> >> On 28 Sep 2016, at 18:50, Philip Reames <listmail at
philipreames.com>
>> wrote:
>> >>
>> >> There's a bit of additional context worth adding here...
>> >>
>> >> Up until very recently, we had a form of widening implemented
in GVN.
>> We decided to remove this in https://reviews.llvm.org/D24096 precisely
>> because its placement in the pass pipeline was inhibiting other
>> optimizations. There's also a major problem with doing widening at
the IR
>> level which is that widening a pair of atomic loads into a single wider
>> atomic load can not be undone. This creates a major pass ordering
problem
>> of its own.
>> >>
>> >> At this point, my general view is that widening
transformations of any
>> kind should be done very late.  Ideally, this is something the backend
>> would do, but doing it as a CGP like fixup pass over the IR is also
>> reasonable.
>> >>
>> >> With that in mind, I feel both the current placement of
LoadCombine
>> (within the inliner iteration) and the proposed InstCombine rule are
>> undesirable.
>> >>
>> >> Philip
>> >>
>> >>
>> >> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
>> >>> Hi,
>> >>>
>> >>> I'm trying to optimize a pattern like this into a
single i16 load:
>> >>> %1 = bitcast i16* %pData to i8*
>> >>> %2 = load i8, i8* %1, align 1
>> >>> %3 = zext i8 %2 to i16
>> >>> %4 = shl nuw i16 %3, 8
>> >>> %5 = getelementptr inbounds i8, i8* %1, i16 1
>> >>> %6 = load i8, i8* %5, align 1
>> >>> %7 = zext i8 %6 to i16
>> >>> %8 = shl nuw nsw i16 %7, 0
>> >>> %9 = or i16 %8, %4
>> >>>
>> >>> I came across load combine pass which is motivated by
virtualliy the
>> same pattern. Load combine optimizes the pattern by combining adjacent
>> loads into one load and lets the rest of the optimizer cleanup the
rest.
>> From what I see on the initial review for load combine (
>> https://reviews.llvm.org/D3580) it was not enabled by default because
it
>> caused some performance regressions. It's not very surprising, I
see how
>> this type of widening can obscure some facts for the rest of the
optimizer.
>> >>>
>> >>> I can't find any backstory for this pass, why was it
chosen to
>> optimize the pattern in question in this way? What is the current
status of
>> this pass?
>> >>>
>> >>> I have an alternative implementation for it locally. I
implemented an
>> instcombine rule similar to recognise bswap/bitreverse idiom. It relies
on
>> collectBitParts (Local.cpp) to determine the origin of the bits in a
given
>> or value. If all the bits are happen to be loaded from adjacent
locations
>> it replaces the or with a single load or a load plus bswap.
>> >>>
>> >>> If the alternative approach sounds reasonable I'll
post my patches
>> for review.
>> >>>
>> >>> Artur
>> >>
>> >
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190912/1ea2e12d/attachment.html>

Artur Pilipenko via llvm-dev

2019-Sep-24 23:50 UTC

head link

[llvm-dev] Load combine pass

Implementing load combine/widening on the IR level sounds like a reasonable
enhancement to me. Although, we (Azul) don't have any plans of doing that in
the near future.

Artur

On 12 Sep 2019, at 00:58, Paweł Bylica <chfast at gmail.com<mailto:chfast
at gmail.com>> wrote:

Ok, thanks.

Are there any plans to reintroduce it on the IR level? I'm not confident
this is strictly necessary, but in some cases not having load widening ends up
really bad.
Like in the case where vectorizer tries to do something about it:
https://godbolt.org/z/60RuEw
https://bugs.llvm.org/show_bug.cgi?id=42708

At the current state I'm forced to use memset() to express uint64 load from
an array of bytes.

// P.

On Thu, Sep 12, 2019 at 4:22 AM Artur Pilipenko <apilipenko at
azul.com<mailto:apilipenko at azul.com>> wrote:
Hi,

Load widening/combine for the original pattern was implemented in DAG combiner.
See:
https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6872

Later a similar transform was introduced fro stores:
https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6700

There is no load widening on the IR level as of now.

Artur

On 11 Sep 2019, at 04:47, Paweł Bylica <chfast at gmail.com<mailto:chfast
at gmail.com>> wrote:

Hi,

Can I ask what is the status of load widening. It seems there is no load
widening on IR at all.

// Paweł

On Wed, Oct 5, 2016 at 1:49 PM Artur Pilipenko via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Philip and I talked about this is person. Given the fact that load widening in
presence of atomics is irreversible transformation we agreed that we don't
want to do this early. For now it can be implemented as a peephole optimization
over machine IR. MIR is preferred here because X86 backend does GEP
reassociation at MIR level and it can make information about addresses being
adjacent available.

Inline cost of the original pattern is a valid concern and we might want to
revisit our decision later. But in order to do widening earlier we need to have
a way to undo this transformation.

I’m going to try implementing MIR optimization but not in the immediate future.

Artur
> On 28 Sep 2016, at 19:32, Artur Pilipenko <apilipenko at
azulsystems.com<mailto:apilipenko at azulsystems.com>> wrote:
>
> One of the arguments for doing this earlier is inline cost perception of
the original pattern. Reading i32/i64 by bytes look much more expensive than it
is and can prevent inlining of interesting function.
>
> Inhibiting other optimizations concern can be addressed by careful
selection of the pattern we’d like to match. I limit the transformation to the
case when all the individual have no uses other than forming a wider load. In
this case it’s less likely to loose information during this transformation.
>
> I didn’t think of atomicity aspect though.
>
> Artur
>
>> On 28 Sep 2016, at 18:50, Philip Reames <listmail at
philipreames.com<mailto:listmail at philipreames.com>> wrote:
>>
>> There's a bit of additional context worth adding here...
>>
>> Up until very recently, we had a form of widening implemented in GVN. 
We decided to remove this in https://reviews.llvm.org/D24096 precisely because
its placement in the pass pipeline was inhibiting other optimizations.
There's also a major problem with doing widening at the IR level which is
that widening a pair of atomic loads into a single wider atomic load can not be
undone. This creates a major pass ordering problem of its own.
>>
>> At this point, my general view is that widening transformations of any
kind should be done very late.  Ideally, this is something the backend would do,
but doing it as a CGP like fixup pass over the IR is also reasonable.
>>
>> With that in mind, I feel both the current placement of LoadCombine
(within the inliner iteration) and the proposed InstCombine rule are
undesirable.
>>
>> Philip
>>
>>
>> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
>>> Hi,
>>>
>>> I'm trying to optimize a pattern like this into a single i16
load:
>>> %1 = bitcast i16* %pData to i8*
>>> %2 = load i8, i8* %1, align 1
>>> %3 = zext i8 %2 to i16
>>> %4 = shl nuw i16 %3, 8
>>> %5 = getelementptr inbounds i8, i8* %1, i16 1
>>> %6 = load i8, i8* %5, align 1
>>> %7 = zext i8 %6 to i16
>>> %8 = shl nuw nsw i16 %7, 0
>>> %9 = or i16 %8, %4
>>>
>>> I came across load combine pass which is motivated by virtualliy
the same pattern. Load combine optimizes the pattern by combining adjacent loads
into one load and lets the rest of the optimizer cleanup the rest. From what I
see on the initial review for load combine (https://reviews.llvm.org/D3580) it
was not enabled by default because it caused some performance regressions.
It's not very surprising, I see how this type of widening can obscure some
facts for the rest of the optimizer.
>>>
>>> I can't find any backstory for this pass, why was it chosen to
optimize the pattern in question in this way? What is the current status of this
pass?
>>>
>>> I have an alternative implementation for it locally. I implemented
an instcombine rule similar to recognise bswap/bitreverse idiom. It relies on
collectBitParts (Local.cpp) to determine the origin of the bits in a given or
value. If all the bits are happen to be loaded from adjacent locations it
replaces the or with a single load or a load plus bswap.
>>>
>>> If the alternative approach sounds reasonable I'll post my
patches for review.
>>>
>>> Artur
>>
>
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190924/78238dfe/attachment.html>

Philip Reames via llvm-dev

2019-Sep-25 00:12 UTC

head link

[llvm-dev] Load combine pass

If we do load combining at the IR level, one thing we'll need to give 
some thought to is atomicity.  Combining two atomic loads into a wider 
(legal) atomic load is not a reversible transformation given our current 
specification.

I've been thinking about a concept I've been tentatively calling 
"element wise atomicity" which would make this a reversible transform
by
representing the load as either a vector or struct load on which the 
atomicity requirement was clearly expressed on the individual elements, 
not the vector/struct itself.  (We have no way to spell that today.)

Not having load combining be a reversible transformation is really 
unfortunate.  Consider this small example:

store i32 0, a+4
load i32, a
load i32, a+4

If you happen to visit the load pair first, you really want to be able 
to split them again once you discover the store.

Note that the same theoretical problem exists in MI/SelectionDAG.  We 
just do so much less memory optimization there we get away with mostly 
not caring.  :)

Philip

On 9/24/2019 4:50 PM, Artur Pilipenko via llvm-dev
wrote:> Implementing load combine/widening on the IR level sounds like a 
> reasonable enhancement to me. Although, we (Azul) don't have any plans 
> of doing that in the near future.
>
> Artur
>
>> On 12 Sep 2019, at 00:58, Paweł Bylica <chfast at gmail.com 
>> <mailto:chfast at gmail.com>> wrote:
>>
>> Ok, thanks.
>>
>> Are there any plans to reintroduce it on the IR level? I'm not 
>> confident this is strictly necessary, but in some cases not having 
>> load widening ends up really bad.
>> Like in the case where vectorizer tries to do something about it:
>> https://godbolt.org/z/60RuEw
>> https://bugs.llvm.org/show_bug.cgi?id=42708
>>
>> At the current state I'm forced to use memset() to express uint64 
>> load from an array of bytes.
>>
>> // P.
>>
>> On Thu, Sep 12, 2019 at 4:22 AM Artur Pilipenko <apilipenko at
azul.com
>> <mailto:apilipenko at azul.com>> wrote:
>>
>>     Hi,
>>
>>     Load widening/combine for the original pattern was implemented in
>>     DAG combiner. See:
>>    
https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6872
>>
>>     Later a similar transform was introduced fro stores:
>>    
https://github.com/llvm/llvm-project/blob/23bbeb52f392d88bf0d0527392a7a11561ee09c0/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L6700
>>
>>     There is no load widening on the IR level as of now.
>>
>>     Artur
>>
>>>     On 11 Sep 2019, at 04:47, Paweł Bylica <chfast at gmail.com
>>>     <mailto:chfast at gmail.com>> wrote:
>>>
>>>     Hi,
>>>
>>>     Can I ask what is the status of load widening. It seems there
is
>>>     no load widening on IR at all.
>>>
>>>     // Paweł
>>>
>>>     On Wed, Oct 5, 2016 at 1:49 PM Artur Pilipenko via llvm-dev
>>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>>
>>>         Philip and I talked about this is person. Given the fact
>>>         that load widening in presence of atomics is irreversible
>>>         transformation we agreed that we don't want to do this
>>>         early. For now it can be implemented as a peephole
>>>         optimization over machine IR. MIR is preferred here because
>>>         X86 backend does GEP reassociation at MIR level and it can
>>>         make information about addresses being adjacent available.
>>>
>>>         Inline cost of the original pattern is a valid concern and
>>>         we might want to revisit our decision later. But in order
to
>>>         do widening earlier we need to have a way to undo this
>>>         transformation.
>>>
>>>         I’m going to try implementing MIR optimization but not in
>>>         the immediate future.
>>>
>>>         Artur
>>>
>>>         > On 28 Sep 2016, at 19:32, Artur Pilipenko
>>>         <apilipenko at azulsystems.com
>>>         <mailto:apilipenko at azulsystems.com>> wrote:
>>>         >
>>>         > One of the arguments for doing this earlier is inline
cost
>>>         perception of the original pattern. Reading i32/i64 by
bytes
>>>         look much more expensive than it is and can prevent
inlining
>>>         of interesting function.
>>>         >
>>>         > Inhibiting other optimizations concern can be
addressed by
>>>         careful selection of the pattern we’d like to match. I
limit
>>>         the transformation to the case when all the individual have
>>>         no uses other than forming a wider load. In this case it’s
>>>         less likely to loose information during this
transformation.
>>>         >
>>>         > I didn’t think of atomicity aspect though.
>>>         >
>>>         > Artur
>>>         >
>>>         >> On 28 Sep 2016, at 18:50, Philip Reames
>>>         <listmail at philipreames.com
>>>         <mailto:listmail at philipreames.com>> wrote:
>>>         >>
>>>         >> There's a bit of additional context worth
adding here...
>>>         >>
>>>         >> Up until very recently, we had a form of widening
>>>         implemented in GVN.  We decided to remove this in
>>>         https://reviews.llvm.org/D24096
>>>         <https://reviews.llvm.org/D24096> precisely because
its
>>>         placement in the pass pipeline was inhibiting other
>>>         optimizations. There's also a major problem with doing
>>>         widening at the IR level which is that widening a pair of
>>>         atomic loads into a single wider atomic load can not be
>>>         undone. This creates a major pass ordering problem of its
own.
>>>         >>
>>>         >> At this point, my general view is that widening
>>>         transformations of any kind should be done very late.
>>>         Ideally, this is something the backend would do, but doing
>>>         it as a CGP like fixup pass over the IR is also reasonable.
>>>         >>
>>>         >> With that in mind, I feel both the current
placement of
>>>         LoadCombine (within the inliner iteration) and the proposed
>>>         InstCombine rule are undesirable.
>>>         >>
>>>         >> Philip
>>>         >>
>>>         >>
>>>         >> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
>>>         >>> Hi,
>>>         >>>
>>>         >>> I'm trying to optimize a pattern like this
into a single
>>>         i16 load:
>>>         >>> %1 = bitcast i16* %pData to i8*
>>>         >>> %2 = load i8, i8* %1, align 1
>>>         >>> %3 = zext i8 %2 to i16
>>>         >>> %4 = shl nuw i16 %3, 8
>>>         >>> %5 = getelementptr inbounds i8, i8* %1, i16 1
>>>         >>> %6 = load i8, i8* %5, align 1
>>>         >>> %7 = zext i8 %6 to i16
>>>         >>> %8 = shl nuw nsw i16 %7, 0
>>>         >>> %9 = or i16 %8, %4
>>>         >>>
>>>         >>> I came across load combine pass which is
motivated by
>>>         virtualliy the same pattern. Load combine optimizes the
>>>         pattern by combining adjacent loads into one load and lets
>>>         the rest of the optimizer cleanup the rest. From what I see
>>>         on the initial review for load combine
>>>         (https://reviews.llvm.org/D3580) it was not enabled by
>>>         default because it caused some performance regressions.
It's
>>>         not very surprising, I see how this type of widening can
>>>         obscure some facts for the rest of the optimizer.
>>>         >>>
>>>         >>> I can't find any backstory for this pass,
why was it
>>>         chosen to optimize the pattern in question in this way?
What
>>>         is the current status of this pass?
>>>         >>>
>>>         >>> I have an alternative implementation for it
locally. I
>>>         implemented an instcombine rule similar to recognise
>>>         bswap/bitreverse idiom. It relies on collectBitParts
>>>         (Local.cpp) to determine the origin of the bits in a given
>>>         or value. If all the bits are happen to be loaded from
>>>         adjacent locations it replaces the or with a single load or
>>>         a load plus bswap.
>>>         >>>
>>>         >>> If the alternative approach sounds reasonable
I'll post
>>>         my patches for review.
>>>         >>>
>>>         >>> Artur
>>>         >>
>>>         >
>>>
>>>         _______________________________________________
>>>         LLVM Developers mailing list
>>>         llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190924/de450aa2/attachment-0001.html>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Sep 2019 - Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

Apparently Analagous Threads