thr3ads.net - llvm dev - [llvm-dev] [RFC] bitfield access shrinking [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Wei Mi via llvm-dev

2017-Mar-09 21:49 UTC

[llvm-dev] [RFC] bitfield access shrinking

On Thu, Mar 9, 2017 at 10:54 AM, Hal Finkel <hfinkel at anl.gov>
wrote:> On 03/09/2017 12:14 PM, Wei Mi via llvm-dev wrote:
>>
>> In
>>
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html,
>> consecutive bitfields are wrapped as a group and represented as a
>> large integer and emits loads stores and bit operations appropriate
>> for extracting bits from within it. It fixes the problem of violating
>> C++11 memory model that original widen load/store of bitfield was
>> facing. It also brings more coalescing opportunities across bitfields.
>>
>> If some bitfields are natural aligned and their num of bits can form a
>> legal integer type that the target supports, it is more efficient to
>> access them directly than doing a large integer load/store plus a
>> series of bit operations. We call such reverse transformation legal
>> type bitfield shrinking. Currently, llvm depends on DAGCombiner to do
>> such shrinking.
>>
>> However, DAGCombiner has the one-basic-block-a-time limitation, so we
>> started to implement a new shrinking optimization in
>> https://reviews.llvm.org/D30416, and initially put it in instcombine,
>> then moved it to CGP because we want to use some TargetLowering
>> information.
>>
>> The initial implementation in D30416 only supports load-and-or-store
>> pattern matching, and it uses a inst-by-inst scan as a safety check to
>> make sure there is no other memory write insn between the load and
>> store (no alias query is done).  After getting the initial
>> implementation, we found more problems: EarlyCSE, LoadPRE and even
>> InstCombine itself can do coalescing before the shrinking (LoadPRE
>> does it the most thoroughly).
>> The coalescing can move the load many BasicBlocks earlier and make
>> simple inst-by-inst scan unscalable and oftentimes fail. It also
>> breaks the load-and-or-store pattern. An example is below:
>>
>> Before coalescing done by earlycse/loadpre:
>> %bf.load = load i64, i64* %0, align 8
>> %bf.clear = and i64 %bf.load, -65536
>> %bf.set = or i64 %bf.value, %bf.clear
>> store i64 %bf.set2, i64* %9, align 8
>> .....
>> %bf.load1 = load i64, i64* %0, align 8
>> %bf.clear1 = and i64 %bf.load1, -4294901761
>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>> store i64 %bf.set2, i64* %9, align 8
>> .....
>> %bf.load2 = load i64, i64* %0, align 8
>> %bf.clear2 = and i64 %bf.load2, -4294901761
>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>> store i64 %bf.set2, i64* %9, align 8
>>
>> After coalescing, it will become:
>> %bf.load = load i64, i64* %0, align 8
>> %bf.clear = and i64 %bf.load, -65536
>> %bf.set = or i64 %bf.value, %bf.clear
>> .....
>> %bf.clear1 = and i64 %bf.set, -4294901761
>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>> .....
>> %bf.clear2 = and i64 %bf.set1, -4294901761
>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>> store i64 %bf.set2, i64* %9, align 8
>>
>> After load-store coalescing, %bf.load now is far away from the store,
>> and the previous load-and-or-store pattern disappears.
>>
>> A simple idea to fix it is to move the shrinking in a very early pass
>> before the first pass of EarlyCSE. However, if we move shrinking
>> earlier, it is possible to block the coalescing of other ilegal type
>> bitfields which can not be shrinked. So for coalescing and shrinking,
>> no matter which one is first, it will block the other one.
>>
>> After some discussions with Eli and Michael, I come up with another
>> idea to let shrinking stay in the late pipeline. It needs two changes
>> to the current patch:
>>
>> 1. extending the pattern match to handle store(or(and(or(and...))
>> pattern above. It needs to analyze the bit mask of every and-or pairs.
>> If the last and-or pair touch different section with the other and-or
>> pairs, we can split the original store into two, and do the shrinking
>> for the second store, like below
>>
>> %bf.load = load i64, i64* %0, align 8
>> %bf.clear = and i64 %bf.load, -65536
>> %bf.set = or i64 %bf.value, %bf.clear
>> .....
>>
>> %bf.clear1 = and i64 %bf.set, -4294901761
>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>> store i64 %bf.set1, i64* %0, align 8                         // the
first
>> store.
>> .....
>>
>> %bf.value2.shrinked = shrink_op %bf.value2
>> store i16 %bf.value2.shrinked, i64* %0, align 8       // shrinked
store.
>>
>> 2. use memoryssa + alias query to do the safety check. Because
>> dominator tree is not properly updated in CGP, I have to create a new
>> pass and put it before CGP, in order to use memoryssa.
>
>
> This makes sense to me. Should we just fix CGP to update the DT instead of
> working around it?
>
>  -Hal
I am not familiar enough with CGP to tell. I just notice ModifiedDT is
modified at several places in CGP, which indicates there are a few
transformations needing their specific dominator tree maintainance
work. And I remember simplifyCFG also doesn't use dominator tree to
save the effort to maintain it on the fly. So maybe it is easier to
separate CGP into two parts: which does need dominator tree and
doesn't. Those transformations which don't need dominator tree can
stay together.

Currently, I know consthoisting is another potential CGP
transformation but left out because of its need of dominator tree. But
consthoisting has already evolved to contain a substantial amount of
code so may be better to stay as a separate pass.

Thanks,
Wei.
>
>>
>> Eli suggested me to ask for more opinions before start writting code.
>> I think it is a good idea and here is the post. Comments are
>> appreciated.
>>
>> Thanks,
>> Wei.
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>

Michael Kuperstein via llvm-dev

2017-Mar-09 23:42 UTC

head link

[llvm-dev] [RFC] bitfield access shrinking

I think it'd be nice to keep CGP as "a grab-bag of relatively simple
transformations that don't require/preserve any complex analyses".

Note that that's not quite where we're at right now - while CGP
doesn't
seem to use AA at all, it does have *one* transformation that uses LI and
DT - eliminateMostlyEmptyBlocks. And it constructs those on demand. :-\
We may want to split that out as well.

(The "ModifiedDT" flag is a bit of red herring - IIUC, what it really
means
is ModifiedCFG, I think the only place that uses it is the optimizeBlock()
iteration - and that probably ought to be using a worklist instead...)

On Thu, Mar 9, 2017 at 1:49 PM, Wei Mi <wmi at google.com> wrote:
> On Thu, Mar 9, 2017 at 10:54 AM, Hal Finkel <hfinkel at anl.gov>
wrote:
> > On 03/09/2017 12:14 PM, Wei Mi via llvm-dev wrote:
> >>
> >> In
> >> http://lists.llvm.org/pipermail/cfe-commits/Week-of-
> Mon-20120827/063200.html,
> >> consecutive bitfields are wrapped as a group and represented as a
> >> large integer and emits loads stores and bit operations
appropriate
> >> for extracting bits from within it. It fixes the problem of
violating
> >> C++11 memory model that original widen load/store of bitfield was
> >> facing. It also brings more coalescing opportunities across
bitfields.
> >>
> >> If some bitfields are natural aligned and their num of bits can
form a
> >> legal integer type that the target supports, it is more efficient
to
> >> access them directly than doing a large integer load/store plus a
> >> series of bit operations. We call such reverse transformation
legal
> >> type bitfield shrinking. Currently, llvm depends on DAGCombiner to
do
> >> such shrinking.
> >>
> >> However, DAGCombiner has the one-basic-block-a-time limitation, so
we
> >> started to implement a new shrinking optimization in
> >> https://reviews.llvm.org/D30416, and initially put it in
instcombine,
> >> then moved it to CGP because we want to use some TargetLowering
> >> information.
> >>
> >> The initial implementation in D30416 only supports
load-and-or-store
> >> pattern matching, and it uses a inst-by-inst scan as a safety
check to
> >> make sure there is no other memory write insn between the load and
> >> store (no alias query is done).  After getting the initial
> >> implementation, we found more problems: EarlyCSE, LoadPRE and even
> >> InstCombine itself can do coalescing before the shrinking (LoadPRE
> >> does it the most thoroughly).
> >> The coalescing can move the load many BasicBlocks earlier and make
> >> simple inst-by-inst scan unscalable and oftentimes fail. It also
> >> breaks the load-and-or-store pattern. An example is below:
> >>
> >> Before coalescing done by earlycse/loadpre:
> >> %bf.load = load i64, i64* %0, align 8
> >> %bf.clear = and i64 %bf.load, -65536
> >> %bf.set = or i64 %bf.value, %bf.clear
> >> store i64 %bf.set2, i64* %9, align 8
> >> .....
> >> %bf.load1 = load i64, i64* %0, align 8
> >> %bf.clear1 = and i64 %bf.load1, -4294901761
> >> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >> store i64 %bf.set2, i64* %9, align 8
> >> .....
> >> %bf.load2 = load i64, i64* %0, align 8
> >> %bf.clear2 = and i64 %bf.load2, -4294901761
> >> %bf.set2 = or i64 %bf.value2, %bf.clear2
> >> store i64 %bf.set2, i64* %9, align 8
> >>
> >> After coalescing, it will become:
> >> %bf.load = load i64, i64* %0, align 8
> >> %bf.clear = and i64 %bf.load, -65536
> >> %bf.set = or i64 %bf.value, %bf.clear
> >> .....
> >> %bf.clear1 = and i64 %bf.set, -4294901761
> >> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >> .....
> >> %bf.clear2 = and i64 %bf.set1, -4294901761
> >> %bf.set2 = or i64 %bf.value2, %bf.clear2
> >> store i64 %bf.set2, i64* %9, align 8
> >>
> >> After load-store coalescing, %bf.load now is far away from the
store,
> >> and the previous load-and-or-store pattern disappears.
> >>
> >> A simple idea to fix it is to move the shrinking in a very early
pass
> >> before the first pass of EarlyCSE. However, if we move shrinking
> >> earlier, it is possible to block the coalescing of other ilegal
type
> >> bitfields which can not be shrinked. So for coalescing and
shrinking,
> >> no matter which one is first, it will block the other one.
> >>
> >> After some discussions with Eli and Michael, I come up with
another
> >> idea to let shrinking stay in the late pipeline. It needs two
changes
> >> to the current patch:
> >>
> >> 1. extending the pattern match to handle store(or(and(or(and...))
> >> pattern above. It needs to analyze the bit mask of every and-or
pairs.
> >> If the last and-or pair touch different section with the other
and-or
> >> pairs, we can split the original store into two, and do the
shrinking
> >> for the second store, like below
> >>
> >> %bf.load = load i64, i64* %0, align 8
> >> %bf.clear = and i64 %bf.load, -65536
> >> %bf.set = or i64 %bf.value, %bf.clear
> >> .....
> >>
> >> %bf.clear1 = and i64 %bf.set, -4294901761
> >> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >> store i64 %bf.set1, i64* %0, align 8                         //
the
> first
> >> store.
> >> .....
> >>
> >> %bf.value2.shrinked = shrink_op %bf.value2
> >> store i16 %bf.value2.shrinked, i64* %0, align 8       // shrinked
store.
> >>
> >> 2. use memoryssa + alias query to do the safety check. Because
> >> dominator tree is not properly updated in CGP, I have to create a
new
> >> pass and put it before CGP, in order to use memoryssa.
> >
> >
> > This makes sense to me. Should we just fix CGP to update the DT
instead
> of
> > working around it?
> >
> >  -Hal
>
> I am not familiar enough with CGP to tell. I just notice ModifiedDT is
> modified at several places in CGP, which indicates there are a few
> transformations needing their specific dominator tree maintainance
> work. And I remember simplifyCFG also doesn't use dominator tree to
> save the effort to maintain it on the fly. So maybe it is easier to
> separate CGP into two parts: which does need dominator tree and
> doesn't. Those transformations which don't need dominator tree can
> stay together.
>
> Currently, I know consthoisting is another potential CGP
> transformation but left out because of its need of dominator tree. But
> consthoisting has already evolved to contain a substantial amount of
> code so may be better to stay as a separate pass.
>
> Thanks,
> Wei.
>
> >
> >>
> >> Eli suggested me to ask for more opinions before start writting
code.
> >> I think it is a good idea and here is the post. Comments are
> >> appreciated.
> >>
> >> Thanks,
> >> Wei.
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
> >
> > --
> > Hal Finkel
> > Lead, Compiler Technology and Programming Languages
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170309/97b71e2b/attachment.html>

Hal Finkel via llvm-dev

2017-Mar-09 23:52 UTC

head link

[llvm-dev] [RFC] bitfield access shrinking

On 03/09/2017 05:42 PM, Michael Kuperstein wrote:> I think it'd be nice to keep CGP as "a grab-bag of relatively
simple
> transformations that don't require/preserve any complex analyses".
>
> Note that that's not quite where we're at right now - while CGP 
> doesn't seem to use AA at all, it does have *one* transformation that 
> uses LI and DT - eliminateMostlyEmptyBlocks. And it constructs those 
> on demand. :-\
:(
> We may want to split that out as well.
>
> (The "ModifiedDT" flag is a bit of red herring - IIUC, what it
really
> means is ModifiedCFG, I think the only place that uses it is the 
> optimizeBlock() iteration - and that probably ought to be using a 
> worklist instead...)
I certainly don't mind breaking things into separate passes so long as 
there aren't phase ordering problems. I'll point out, however, that:
  - If they're relatively-simple transformations, then updating the DT 
is probably not that hard (obviously I could have missed something, but 
I skimmed the code that this seems to be the case)
  - There are certainly things that are simple to do, but only if you 
have some proper analysis results. People tend to do hacky things to 
work around not having analysis results and that's not something we want 
to encourage.

Thus, I'd still vote for fixing CGP to preserve DT.

  -Hal
>
> On Thu, Mar 9, 2017 at 1:49 PM, Wei Mi <wmi at google.com 
> <mailto:wmi at google.com>> wrote:
>
>     On Thu, Mar 9, 2017 at 10:54 AM, Hal Finkel <hfinkel at anl.gov
>     <mailto:hfinkel at anl.gov>> wrote:
>     > On 03/09/2017 12:14 PM, Wei Mi via llvm-dev wrote:
>     >>
>     >> In
>     >>
>    
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html
>    
<http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html>,
>     >> consecutive bitfields are wrapped as a group and represented
as a
>     >> large integer and emits loads stores and bit operations
appropriate
>     >> for extracting bits from within it. It fixes the problem of
>     violating
>     >> C++11 memory model that original widen load/store of bitfield
was
>     >> facing. It also brings more coalescing opportunities across
>     bitfields.
>     >>
>     >> If some bitfields are natural aligned and their num of bits
can
>     form a
>     >> legal integer type that the target supports, it is more
>     efficient to
>     >> access them directly than doing a large integer load/store
plus a
>     >> series of bit operations. We call such reverse transformation
legal
>     >> type bitfield shrinking. Currently, llvm depends on
DAGCombiner
>     to do
>     >> such shrinking.
>     >>
>     >> However, DAGCombiner has the one-basic-block-a-time
limitation,
>     so we
>     >> started to implement a new shrinking optimization in
>     >> https://reviews.llvm.org/D30416
>     <https://reviews.llvm.org/D30416>, and initially put it in
>     instcombine,
>     >> then moved it to CGP because we want to use some
TargetLowering
>     >> information.
>     >>
>     >> The initial implementation in D30416 only supports
>     load-and-or-store
>     >> pattern matching, and it uses a inst-by-inst scan as a safety
>     check to
>     >> make sure there is no other memory write insn between the load
and
>     >> store (no alias query is done).  After getting the initial
>     >> implementation, we found more problems: EarlyCSE, LoadPRE and
even
>     >> InstCombine itself can do coalescing before the shrinking
(LoadPRE
>     >> does it the most thoroughly).
>     >> The coalescing can move the load many BasicBlocks earlier and
make
>     >> simple inst-by-inst scan unscalable and oftentimes fail. It
also
>     >> breaks the load-and-or-store pattern. An example is below:
>     >>
>     >> Before coalescing done by earlycse/loadpre:
>     >> %bf.load = load i64, i64* %0, align 8
>     >> %bf.clear = and i64 %bf.load, -65536
>     >> %bf.set = or i64 %bf.value, %bf.clear
>     >> store i64 %bf.set2, i64* %9, align 8
>     >> .....
>     >> %bf.load1 = load i64, i64* %0, align 8
>     >> %bf.clear1 = and i64 %bf.load1, -4294901761
>     >> %bf.set1 = or i64 %bf.value1, %bf.clear1
>     >> store i64 %bf.set2, i64* %9, align 8
>     >> .....
>     >> %bf.load2 = load i64, i64* %0, align 8
>     >> %bf.clear2 = and i64 %bf.load2, -4294901761
>     >> %bf.set2 = or i64 %bf.value2, %bf.clear2
>     >> store i64 %bf.set2, i64* %9, align 8
>     >>
>     >> After coalescing, it will become:
>     >> %bf.load = load i64, i64* %0, align 8
>     >> %bf.clear = and i64 %bf.load, -65536
>     >> %bf.set = or i64 %bf.value, %bf.clear
>     >> .....
>     >> %bf.clear1 = and i64 %bf.set, -4294901761
>     >> %bf.set1 = or i64 %bf.value1, %bf.clear1
>     >> .....
>     >> %bf.clear2 = and i64 %bf.set1, -4294901761
>     >> %bf.set2 = or i64 %bf.value2, %bf.clear2
>     >> store i64 %bf.set2, i64* %9, align 8
>     >>
>     >> After load-store coalescing, %bf.load now is far away from the
>     store,
>     >> and the previous load-and-or-store pattern disappears.
>     >>
>     >> A simple idea to fix it is to move the shrinking in a very
>     early pass
>     >> before the first pass of EarlyCSE. However, if we move
shrinking
>     >> earlier, it is possible to block the coalescing of other
ilegal
>     type
>     >> bitfields which can not be shrinked. So for coalescing and
>     shrinking,
>     >> no matter which one is first, it will block the other one.
>     >>
>     >> After some discussions with Eli and Michael, I come up with
another
>     >> idea to let shrinking stay in the late pipeline. It needs two
>     changes
>     >> to the current patch:
>     >>
>     >> 1. extending the pattern match to handle
store(or(and(or(and...))
>     >> pattern above. It needs to analyze the bit mask of every
and-or
>     pairs.
>     >> If the last and-or pair touch different section with the other
>     and-or
>     >> pairs, we can split the original store into two, and do the
>     shrinking
>     >> for the second store, like below
>     >>
>     >> %bf.load = load i64, i64* %0, align 8
>     >> %bf.clear = and i64 %bf.load, -65536
>     >> %bf.set = or i64 %bf.value, %bf.clear
>     >> .....
>     >>
>     >> %bf.clear1 = and i64 %bf.set, -4294901761
>     >> %bf.set1 = or i64 %bf.value1, %bf.clear1
>     >> store i64 %bf.set1, i64* %0, align 8              // the first
>     >> store.
>     >> .....
>     >>
>     >> %bf.value2.shrinked = shrink_op %bf.value2
>     >> store i16 %bf.value2.shrinked, i64* %0, align 8       //
>     shrinked store.
>     >>
>     >> 2. use memoryssa + alias query to do the safety check. Because
>     >> dominator tree is not properly updated in CGP, I have to
create
>     a new
>     >> pass and put it before CGP, in order to use memoryssa.
>     >
>     >
>     > This makes sense to me. Should we just fix CGP to update the DT
>     instead of
>     > working around it?
>     >
>     >  -Hal
>
>     I am not familiar enough with CGP to tell. I just notice ModifiedDT is
>     modified at several places in CGP, which indicates there are a few
>     transformations needing their specific dominator tree maintainance
>     work. And I remember simplifyCFG also doesn't use dominator tree to
>     save the effort to maintain it on the fly. So maybe it is easier to
>     separate CGP into two parts: which does need dominator tree and
>     doesn't. Those transformations which don't need dominator tree
can
>     stay together.
>
>     Currently, I know consthoisting is another potential CGP
>     transformation but left out because of its need of dominator tree. But
>     consthoisting has already evolved to contain a substantial amount of
>     code so may be better to stay as a separate pass.
>
>     Thanks,
>     Wei.
>
>     >
>     >>
>     >> Eli suggested me to ask for more opinions before start
writting
>     code.
>     >> I think it is a good idea and here is the post. Comments are
>     >> appreciated.
>     >>
>     >> Thanks,
>     >> Wei.
>     >> _______________________________________________
>     >> LLVM Developers mailing list
>     >> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>     >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>     >
>     >
>     > --
>     > Hal Finkel
>     > Lead, Compiler Technology and Programming Languages
>     > Leadership Computing Facility
>     > Argonne National Laboratory
>     >
>
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170309/26421f22/attachment.html>

Hal Finkel via llvm-dev

2017-Mar-09 23:55 UTC

head link

[llvm-dev] [RFC] bitfield access shrinking

On 03/09/2017 03:49 PM, Wei Mi wrote:> On Thu, Mar 9, 2017 at 10:54 AM, Hal Finkel <hfinkel at anl.gov>
wrote:
>> On 03/09/2017 12:14 PM, Wei Mi via llvm-dev wrote:
>>> In
>>>
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html,
>>> consecutive bitfields are wrapped as a group and represented as a
>>> large integer and emits loads stores and bit operations appropriate
>>> for extracting bits from within it. It fixes the problem of
violating
>>> C++11 memory model that original widen load/store of bitfield was
>>> facing. It also brings more coalescing opportunities across
bitfields.
>>>
>>> If some bitfields are natural aligned and their num of bits can
form a
>>> legal integer type that the target supports, it is more efficient
to
>>> access them directly than doing a large integer load/store plus a
>>> series of bit operations. We call such reverse transformation legal
>>> type bitfield shrinking. Currently, llvm depends on DAGCombiner to
do
>>> such shrinking.
>>>
>>> However, DAGCombiner has the one-basic-block-a-time limitation, so
we
>>> started to implement a new shrinking optimization in
>>> https://reviews.llvm.org/D30416, and initially put it in
instcombine,
>>> then moved it to CGP because we want to use some TargetLowering
>>> information.
>>>
>>> The initial implementation in D30416 only supports
load-and-or-store
>>> pattern matching, and it uses a inst-by-inst scan as a safety check
to
>>> make sure there is no other memory write insn between the load and
>>> store (no alias query is done).  After getting the initial
>>> implementation, we found more problems: EarlyCSE, LoadPRE and even
>>> InstCombine itself can do coalescing before the shrinking (LoadPRE
>>> does it the most thoroughly).
>>> The coalescing can move the load many BasicBlocks earlier and make
>>> simple inst-by-inst scan unscalable and oftentimes fail. It also
>>> breaks the load-and-or-store pattern. An example is below:
>>>
>>> Before coalescing done by earlycse/loadpre:
>>> %bf.load = load i64, i64* %0, align 8
>>> %bf.clear = and i64 %bf.load, -65536
>>> %bf.set = or i64 %bf.value, %bf.clear
>>> store i64 %bf.set2, i64* %9, align 8
>>> .....
>>> %bf.load1 = load i64, i64* %0, align 8
>>> %bf.clear1 = and i64 %bf.load1, -4294901761
>>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>>> store i64 %bf.set2, i64* %9, align 8
>>> .....
>>> %bf.load2 = load i64, i64* %0, align 8
>>> %bf.clear2 = and i64 %bf.load2, -4294901761
>>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>>> store i64 %bf.set2, i64* %9, align 8
>>>
>>> After coalescing, it will become:
>>> %bf.load = load i64, i64* %0, align 8
>>> %bf.clear = and i64 %bf.load, -65536
>>> %bf.set = or i64 %bf.value, %bf.clear
>>> .....
>>> %bf.clear1 = and i64 %bf.set, -4294901761
>>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>>> .....
>>> %bf.clear2 = and i64 %bf.set1, -4294901761
>>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>>> store i64 %bf.set2, i64* %9, align 8
>>>
>>> After load-store coalescing, %bf.load now is far away from the
store,
>>> and the previous load-and-or-store pattern disappears.
>>>
>>> A simple idea to fix it is to move the shrinking in a very early
pass
>>> before the first pass of EarlyCSE. However, if we move shrinking
>>> earlier, it is possible to block the coalescing of other ilegal
type
>>> bitfields which can not be shrinked. So for coalescing and
shrinking,
>>> no matter which one is first, it will block the other one.
>>>
>>> After some discussions with Eli and Michael, I come up with another
>>> idea to let shrinking stay in the late pipeline. It needs two
changes
>>> to the current patch:
>>>
>>> 1. extending the pattern match to handle store(or(and(or(and...))
>>> pattern above. It needs to analyze the bit mask of every and-or
pairs.
>>> If the last and-or pair touch different section with the other
and-or
>>> pairs, we can split the original store into two, and do the
shrinking
>>> for the second store, like below
>>>
>>> %bf.load = load i64, i64* %0, align 8
>>> %bf.clear = and i64 %bf.load, -65536
>>> %bf.set = or i64 %bf.value, %bf.clear
>>> .....
>>>
>>> %bf.clear1 = and i64 %bf.set, -4294901761
>>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>>> store i64 %bf.set1, i64* %0, align 8                         // the
first
>>> store.
>>> .....
>>>
>>> %bf.value2.shrinked = shrink_op %bf.value2
>>> store i16 %bf.value2.shrinked, i64* %0, align 8       // shrinked
store.
>>>
>>> 2. use memoryssa + alias query to do the safety check. Because
>>> dominator tree is not properly updated in CGP, I have to create a
new
>>> pass and put it before CGP, in order to use memoryssa.
>>
>> This makes sense to me. Should we just fix CGP to update the DT instead
of
>> working around it?
>>
>>   -Hal
> I am not familiar enough with CGP to tell. I just notice ModifiedDT is
> modified at several places in CGP, which indicates there are a few
> transformations needing their specific dominator tree maintainance
> work. And I remember simplifyCFG also doesn't use dominator tree to
> save the effort to maintain it on the fly. So maybe it is easier to
> separate CGP into two parts: which does need dominator tree and
> doesn't. Those transformations which don't need dominator tree can
> stay together.
>
> Currently, I know consthoisting is another potential CGP
> transformation but left out because of its need of dominator tree. But
> consthoisting has already evolved to contain a substantial amount of
> code so may be better to stay as a separate pass.
That might very well be the case here too (the matching logic might grow 
in complexity). CGP has now grown to over 5K lines, and there are 
definitely pieces that can get split out. There is a big chunk that runs 
iteratively, however, so that might not buy us as much as we might hope.

  -Hal
>
> Thanks,
> Wei.
>
>>> Eli suggested me to ask for more opinions before start writting
code.
>>> I think it is a good idea and here is the post. Comments are
>>> appreciated.
>>>
>>> Thanks,
>>> Wei.
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Chandler Carruth via llvm-dev

2017-Mar-10 08:26 UTC

head link

[llvm-dev] [RFC] bitfield access shrinking

On Thu, Mar 9, 2017 at 3:55 PM Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 03/09/2017 03:49 PM, Wei Mi wrote:
> > On Thu, Mar 9, 2017 at 10:54 AM, Hal Finkel <hfinkel at anl.gov>
wrote:
> >> On 03/09/2017 12:14 PM, Wei Mi via llvm-dev wrote:
> >>> In
> >>>
>
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html
> ,
> >>> consecutive bitfields are wrapped as a group and represented
as a
> >>> large integer and emits loads stores and bit operations
appropriate
> >>> for extracting bits from within it. It fixes the problem of
violating
> >>> C++11 memory model that original widen load/store of bitfield
was
> >>> facing. It also brings more coalescing opportunities across
bitfields.
> >>>
> >>> If some bitfields are natural aligned and their num of bits
can form a
> >>> legal integer type that the target supports, it is more
efficient to
> >>> access them directly than doing a large integer load/store
plus a
> >>> series of bit operations. We call such reverse transformation
legal
> >>> type bitfield shrinking. Currently, llvm depends on
DAGCombiner to do
> >>> such shrinking.
> >>>
> >>> However, DAGCombiner has the one-basic-block-a-time
limitation, so we
> >>> started to implement a new shrinking optimization in
> >>> https://reviews.llvm.org/D30416, and initially put it in
instcombine,
> >>> then moved it to CGP because we want to use some
TargetLowering
> >>> information.
> >>>
> >>> The initial implementation in D30416 only supports
load-and-or-store
> >>> pattern matching, and it uses a inst-by-inst scan as a safety
check to
> >>> make sure there is no other memory write insn between the load
and
> >>> store (no alias query is done).  After getting the initial
> >>> implementation, we found more problems: EarlyCSE, LoadPRE and
even
> >>> InstCombine itself can do coalescing before the shrinking
(LoadPRE
> >>> does it the most thoroughly).
> >>> The coalescing can move the load many BasicBlocks earlier and
make
> >>> simple inst-by-inst scan unscalable and oftentimes fail. It
also
> >>> breaks the load-and-or-store pattern. An example is below:
> >>>
> >>> Before coalescing done by earlycse/loadpre:
> >>> %bf.load = load i64, i64* %0, align 8
> >>> %bf.clear = and i64 %bf.load, -65536
> >>> %bf.set = or i64 %bf.value, %bf.clear
> >>> store i64 %bf.set2, i64* %9, align 8
> >>> .....
> >>> %bf.load1 = load i64, i64* %0, align 8
> >>> %bf.clear1 = and i64 %bf.load1, -4294901761
> >>> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >>> store i64 %bf.set2, i64* %9, align 8
> >>> .....
> >>> %bf.load2 = load i64, i64* %0, align 8
> >>> %bf.clear2 = and i64 %bf.load2, -4294901761
> >>> %bf.set2 = or i64 %bf.value2, %bf.clear2
> >>> store i64 %bf.set2, i64* %9, align 8
> >>>
> >>> After coalescing, it will become:
> >>> %bf.load = load i64, i64* %0, align 8
> >>> %bf.clear = and i64 %bf.load, -65536
> >>> %bf.set = or i64 %bf.value, %bf.clear
> >>> .....
> >>> %bf.clear1 = and i64 %bf.set, -4294901761
> >>> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >>> .....
> >>> %bf.clear2 = and i64 %bf.set1, -4294901761
> >>> %bf.set2 = or i64 %bf.value2, %bf.clear2
> >>> store i64 %bf.set2, i64* %9, align 8
> >>>
> >>> After load-store coalescing, %bf.load now is far away from the
store,
> >>> and the previous load-and-or-store pattern disappears.
> >>>
> >>> A simple idea to fix it is to move the shrinking in a very
early pass
> >>> before the first pass of EarlyCSE. However, if we move
shrinking
> >>> earlier, it is possible to block the coalescing of other
ilegal type
> >>> bitfields which can not be shrinked. So for coalescing and
shrinking,
> >>> no matter which one is first, it will block the other one.
> >>>
> >>> After some discussions with Eli and Michael, I come up with
another
> >>> idea to let shrinking stay in the late pipeline. It needs two
changes
> >>> to the current patch:
> >>>
> >>> 1. extending the pattern match to handle
store(or(and(or(and...))
> >>> pattern above. It needs to analyze the bit mask of every
and-or pairs.
> >>> If the last and-or pair touch different section with the other
and-or
> >>> pairs, we can split the original store into two, and do the
shrinking
> >>> for the second store, like below
> >>>
> >>> %bf.load = load i64, i64* %0, align 8
> >>> %bf.clear = and i64 %bf.load, -65536
> >>> %bf.set = or i64 %bf.value, %bf.clear
> >>> .....
> >>>
> >>> %bf.clear1 = and i64 %bf.set, -4294901761
> >>> %bf.set1 = or i64 %bf.value1, %bf.clear1
> >>> store i64 %bf.set1, i64* %0, align 8                        
// the
> first
> >>> store.
> >>> .....
> >>>
> >>> %bf.value2.shrinked = shrink_op %bf.value2
> >>> store i16 %bf.value2.shrinked, i64* %0, align 8       //
shrinked
> store.
> >>>
> >>> 2. use memoryssa + alias query to do the safety check. Because
> >>> dominator tree is not properly updated in CGP, I have to
create a new
> >>> pass and put it before CGP, in order to use memoryssa.
> >>
> >> This makes sense to me. Should we just fix CGP to update the DT
instead
> of
> >> working around it?
> >>
> >>   -Hal
> > I am not familiar enough with CGP to tell. I just notice ModifiedDT is
> > modified at several places in CGP, which indicates there are a few
> > transformations needing their specific dominator tree maintainance
> > work. And I remember simplifyCFG also doesn't use dominator tree
to
> > save the effort to maintain it on the fly. So maybe it is easier to
> > separate CGP into two parts: which does need dominator tree and
> > doesn't. Those transformations which don't need dominator tree
can
> > stay together.
> >
> > Currently, I know consthoisting is another potential CGP
> > transformation but left out because of its need of dominator tree. But
> > consthoisting has already evolved to contain a substantial amount of
> > code so may be better to stay as a separate pass.
>
> That might very well be the case here too (the matching logic might grow
> in complexity). CGP has now grown to over 5K lines, and there are
> definitely pieces that can get split out. There is a big chunk that runs
> iteratively, however, so that might not buy us as much as we might hope.
>
I think a lot of the reasons why we used CGP as a grabbag was because for a
while we didn't really have a compelling model for *why* we were moving out
of canonical form. But we've really clarified the pass pipeline structure
now so I'd love to try and build reasonable "optimization" (as
opposed to
"canonicalization" or "simplification") passes and organize
them nicely.

One perhaps especially notable thing: this will often end up *narrowing*. I
think that it would be very valuable to try and place this or other passes
that effectively narrow data access types *before the vectorizers*. This
seems important to enable efficient packing of element types.

Past that, I generally like the plan of a dedicated pass to narrow integer
stuff across basic blocks, etc., using MemorySSA and other efficient tools.
=]

-Chandler

>
>   -Hal
>
> >
> > Thanks,
> > Wei.
> >
> >>> Eli suggested me to ask for more opinions before start
writting code.
> >>> I think it is a good idea and here is the post. Comments are
> >>> appreciated.
> >>>
> >>> Thanks,
> >>> Wei.
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> llvm-dev at lists.llvm.org
> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>
> >> --
> >> Hal Finkel
> >> Lead, Compiler Technology and Programming Languages
> >> Leadership Computing Facility
> >> Argonne National Laboratory
> >>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170310/0bfda60f/attachment.html>

llvm dev - Mar 2017 - [RFC] bitfield access shrinking

[llvm-dev] [RFC] bitfield access shrinking

[llvm-dev] [RFC] bitfield access shrinking

[llvm-dev] [RFC] bitfield access shrinking

[llvm-dev] [RFC] bitfield access shrinking

[llvm-dev] [RFC] bitfield access shrinking