thr3ads.net - llvm dev - [llvm-dev] [RFC] bitfield access shrinking [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2017-Mar-09 18:47 UTC

[llvm-dev] [RFC] bitfield access shrinking

On 03/09/2017 12:28 PM, Krzysztof Parzyszek via llvm-dev
wrote:> We could add intrinsics to extract/insert a bitfield, which would 
> simplify a lot of that bitwise logic.
But then you need to teach a bunch of places about how to simply them, 
fold using bitwise logic and other things that reduce demanded bits into 
them, etc. This seems like a difficult tradeoff.

  -Hal
>
> -Krzysztof
>
>
> On 3/9/2017 12:14 PM, Wei Mi via llvm-dev wrote:
>> In 
>>
http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-20120827/063200.html,
>> consecutive bitfields are wrapped as a group and represented as a
>> large integer and emits loads stores and bit operations appropriate
>> for extracting bits from within it. It fixes the problem of violating
>> C++11 memory model that original widen load/store of bitfield was
>> facing. It also brings more coalescing opportunities across bitfields.
>>
>> If some bitfields are natural aligned and their num of bits can form a
>> legal integer type that the target supports, it is more efficient to
>> access them directly than doing a large integer load/store plus a
>> series of bit operations. We call such reverse transformation legal
>> type bitfield shrinking. Currently, llvm depends on DAGCombiner to do
>> such shrinking.
>>
>> However, DAGCombiner has the one-basic-block-a-time limitation, so we
>> started to implement a new shrinking optimization in
>> https://reviews.llvm.org/D30416, and initially put it in instcombine,
>> then moved it to CGP because we want to use some TargetLowering
>> information.
>>
>> The initial implementation in D30416 only supports load-and-or-store
>> pattern matching, and it uses a inst-by-inst scan as a safety check to
>> make sure there is no other memory write insn between the load and
>> store (no alias query is done).  After getting the initial
>> implementation, we found more problems: EarlyCSE, LoadPRE and even
>> InstCombine itself can do coalescing before the shrinking (LoadPRE
>> does it the most thoroughly).
>> The coalescing can move the load many BasicBlocks earlier and make
>> simple inst-by-inst scan unscalable and oftentimes fail. It also
>> breaks the load-and-or-store pattern. An example is below:
>>
>> Before coalescing done by earlycse/loadpre:
>> %bf.load = load i64, i64* %0, align 8
>> %bf.clear = and i64 %bf.load, -65536
>> %bf.set = or i64 %bf.value, %bf.clear
>> store i64 %bf.set2, i64* %9, align 8
>> .....
>> %bf.load1 = load i64, i64* %0, align 8
>> %bf.clear1 = and i64 %bf.load1, -4294901761
>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>> store i64 %bf.set2, i64* %9, align 8
>> .....
>> %bf.load2 = load i64, i64* %0, align 8
>> %bf.clear2 = and i64 %bf.load2, -4294901761
>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>> store i64 %bf.set2, i64* %9, align 8
>>
>> After coalescing, it will become:
>> %bf.load = load i64, i64* %0, align 8
>> %bf.clear = and i64 %bf.load, -65536
>> %bf.set = or i64 %bf.value, %bf.clear
>> .....
>> %bf.clear1 = and i64 %bf.set, -4294901761
>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>> .....
>> %bf.clear2 = and i64 %bf.set1, -4294901761
>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>> store i64 %bf.set2, i64* %9, align 8
>>
>> After load-store coalescing, %bf.load now is far away from the store,
>> and the previous load-and-or-store pattern disappears.
>>
>> A simple idea to fix it is to move the shrinking in a very early pass
>> before the first pass of EarlyCSE. However, if we move shrinking
>> earlier, it is possible to block the coalescing of other ilegal type
>> bitfields which can not be shrinked. So for coalescing and shrinking,
>> no matter which one is first, it will block the other one.
>>
>> After some discussions with Eli and Michael, I come up with another
>> idea to let shrinking stay in the late pipeline. It needs two changes
>> to the current patch:
>>
>> 1. extending the pattern match to handle store(or(and(or(and...))
>> pattern above. It needs to analyze the bit mask of every and-or pairs.
>> If the last and-or pair touch different section with the other and-or
>> pairs, we can split the original store into two, and do the shrinking
>> for the second store, like below
>>
>> %bf.load = load i64, i64* %0, align 8
>> %bf.clear = and i64 %bf.load, -65536
>> %bf.set = or i64 %bf.value, %bf.clear
>> .....
>>
>> %bf.clear1 = and i64 %bf.set, -4294901761
>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>> store i64 %bf.set1, i64* %0, align 8                         // the 
>> first store.
>> .....
>>
>> %bf.value2.shrinked = shrink_op %bf.value2
>> store i16 %bf.value2.shrinked, i64* %0, align 8       // shrinked
store.
>>
>> 2. use memoryssa + alias query to do the safety check. Because
>> dominator tree is not properly updated in CGP, I have to create a new
>> pass and put it before CGP, in order to use memoryssa.
>>
>> Eli suggested me to ask for more opinions before start writting code.
>> I think it is a good idea and here is the post. Comments are
>> appreciated.
>>
>> Thanks,
>> Wei.
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Krzysztof Parzyszek via llvm-dev

2017-Mar-09 18:57 UTC

head link

[llvm-dev] [RFC] bitfield access shrinking

On 3/9/2017 12:47 PM, Hal Finkel wrote:>
> On 03/09/2017 12:28 PM, Krzysztof Parzyszek via llvm-dev wrote:
>> We could add intrinsics to extract/insert a bitfield, which would
>> simplify a lot of that bitwise logic.
>
> But then you need to teach a bunch of places about how to simply them,
> fold using bitwise logic and other things that reduce demanded bits into
> them, etc. This seems like a difficult tradeoff.
Bitfield extraction/insertion generally does not simplify well with 
surrounding arithmetic. The main generator of the bitwise manipulation 
is SROA and if it was the only creator of these intrinsics, it would 
already make things simpler. We'd also need to teach the combiner about 
them, but that's still fairly localized.
On the other hand, they would make the intent of the code explicit. 
Targets could handle them any way they want, whether it's loading an 
i128, or an i16 from an adjusted offset.

-Krzysztof

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation

Daniel Berlin via llvm-dev

2017-Mar-09 19:19 UTC

head link

[llvm-dev] [RFC] bitfield access shrinking

Yeah,.

This seems similar to trying to optimize extract/insert values with real
binary operations.

We do it, but ... historically we only do it by teaching things they look
like things they already know ;)
(IE we teach gvn that when it looks like this, it's really an add of these
two things)



On Thu, Mar 9, 2017 at 10:47 AM, Hal Finkel via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> On 03/09/2017 12:28 PM, Krzysztof Parzyszek via llvm-dev wrote:
>
>> We could add intrinsics to extract/insert a bitfield, which would
>> simplify a lot of that bitwise logic.
>>
>
> But then you need to teach a bunch of places about how to simply them,
> fold using bitwise logic and other things that reduce demanded bits into
> them, etc. This seems like a difficult tradeoff.
>
>  -Hal
>
>
>
>> -Krzysztof
>>
>>
>> On 3/9/2017 12:14 PM, Wei Mi via llvm-dev wrote:
>>
>>> In http://lists.llvm.org/pipermail/cfe-commits/Week-of-Mon-
>>> 20120827/063200.html,
>>> consecutive bitfields are wrapped as a group and represented as a
>>> large integer and emits loads stores and bit operations appropriate
>>> for extracting bits from within it. It fixes the problem of
violating
>>> C++11 memory model that original widen load/store of bitfield was
>>> facing. It also brings more coalescing opportunities across
bitfields.
>>>
>>> If some bitfields are natural aligned and their num of bits can
form a
>>> legal integer type that the target supports, it is more efficient
to
>>> access them directly than doing a large integer load/store plus a
>>> series of bit operations. We call such reverse transformation legal
>>> type bitfield shrinking. Currently, llvm depends on DAGCombiner to
do
>>> such shrinking.
>>>
>>> However, DAGCombiner has the one-basic-block-a-time limitation, so
we
>>> started to implement a new shrinking optimization in
>>> https://reviews.llvm.org/D30416, and initially put it in
instcombine,
>>> then moved it to CGP because we want to use some TargetLowering
>>> information.
>>>
>>> The initial implementation in D30416 only supports
load-and-or-store
>>> pattern matching, and it uses a inst-by-inst scan as a safety check
to
>>> make sure there is no other memory write insn between the load and
>>> store (no alias query is done).  After getting the initial
>>> implementation, we found more problems: EarlyCSE, LoadPRE and even
>>> InstCombine itself can do coalescing before the shrinking (LoadPRE
>>> does it the most thoroughly).
>>> The coalescing can move the load many BasicBlocks earlier and make
>>> simple inst-by-inst scan unscalable and oftentimes fail. It also
>>> breaks the load-and-or-store pattern. An example is below:
>>>
>>> Before coalescing done by earlycse/loadpre:
>>> %bf.load = load i64, i64* %0, align 8
>>> %bf.clear = and i64 %bf.load, -65536
>>> %bf.set = or i64 %bf.value, %bf.clear
>>> store i64 %bf.set2, i64* %9, align 8
>>> .....
>>> %bf.load1 = load i64, i64* %0, align 8
>>> %bf.clear1 = and i64 %bf.load1, -4294901761
>>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>>> store i64 %bf.set2, i64* %9, align 8
>>> .....
>>> %bf.load2 = load i64, i64* %0, align 8
>>> %bf.clear2 = and i64 %bf.load2, -4294901761
>>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>>> store i64 %bf.set2, i64* %9, align 8
>>>
>>> After coalescing, it will become:
>>> %bf.load = load i64, i64* %0, align 8
>>> %bf.clear = and i64 %bf.load, -65536
>>> %bf.set = or i64 %bf.value, %bf.clear
>>> .....
>>> %bf.clear1 = and i64 %bf.set, -4294901761
>>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>>> .....
>>> %bf.clear2 = and i64 %bf.set1, -4294901761
>>> %bf.set2 = or i64 %bf.value2, %bf.clear2
>>> store i64 %bf.set2, i64* %9, align 8
>>>
>>> After load-store coalescing, %bf.load now is far away from the
store,
>>> and the previous load-and-or-store pattern disappears.
>>>
>>> A simple idea to fix it is to move the shrinking in a very early
pass
>>> before the first pass of EarlyCSE. However, if we move shrinking
>>> earlier, it is possible to block the coalescing of other ilegal
type
>>> bitfields which can not be shrinked. So for coalescing and
shrinking,
>>> no matter which one is first, it will block the other one.
>>>
>>> After some discussions with Eli and Michael, I come up with another
>>> idea to let shrinking stay in the late pipeline. It needs two
changes
>>> to the current patch:
>>>
>>> 1. extending the pattern match to handle store(or(and(or(and...))
>>> pattern above. It needs to analyze the bit mask of every and-or
pairs.
>>> If the last and-or pair touch different section with the other
and-or
>>> pairs, we can split the original store into two, and do the
shrinking
>>> for the second store, like below
>>>
>>> %bf.load = load i64, i64* %0, align 8
>>> %bf.clear = and i64 %bf.load, -65536
>>> %bf.set = or i64 %bf.value, %bf.clear
>>> .....
>>>
>>> %bf.clear1 = and i64 %bf.set, -4294901761
>>> %bf.set1 = or i64 %bf.value1, %bf.clear1
>>> store i64 %bf.set1, i64* %0, align 8                         // the
>>> first store.
>>> .....
>>>
>>> %bf.value2.shrinked = shrink_op %bf.value2
>>> store i16 %bf.value2.shrinked, i64* %0, align 8       // shrinked
store.
>>>
>>> 2. use memoryssa + alias query to do the safety check. Because
>>> dominator tree is not properly updated in CGP, I have to create a
new
>>> pass and put it before CGP, in order to use memoryssa.
>>>
>>> Eli suggested me to ask for more opinions before start writting
code.
>>> I think it is a good idea and here is the post. Comments are
>>> appreciated.
>>>
>>> Thanks,
>>> Wei.
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170309/61ea2b14/attachment.html>

Hal Finkel via llvm-dev

2017-Mar-09 19:22 UTC

head link

[llvm-dev] [RFC] bitfield access shrinking

On 03/09/2017 12:57 PM, Krzysztof Parzyszek wrote:> On 3/9/2017 12:47 PM, Hal Finkel wrote:
>>
>> On 03/09/2017 12:28 PM, Krzysztof Parzyszek via llvm-dev wrote:
>>> We could add intrinsics to extract/insert a bitfield, which would
>>> simplify a lot of that bitwise logic.
>>
>> But then you need to teach a bunch of places about how to simply them,
>> fold using bitwise logic and other things that reduce demanded bits
into
>> them, etc. This seems like a difficult tradeoff.
>
> Bitfield extraction/insertion generally does not simplify well with 
> surrounding arithmetic.
Maybe. I've seen plenty of places where we've simplified bitfield 
extractions/insertions with other using/generating logic (masks, shifts, 
and the like).
> The main generator of the bitwise manipulation is SROA and if it was 
> the only creator of these intrinsics, it would already make things 
> simpler. We'd also need to teach the combiner about them, but
that's
> still fairly localized.
> On the other hand, they would make the intent of the code explicit. 
> Targets could handle them any way they want, whether it's loading an 
> i128, or an i16 from an adjusted offset.
I wouldn't want just SROA to produce them. If we're going to have them, 
then we should canonicalize toward them. That implies having matching 
logic, so perhaps we're back where we started.

  -Hal
>
> -Krzysztof
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Mar 2017 - [RFC] bitfield access shrinking

[llvm-dev] [RFC] bitfield access shrinking

[llvm-dev] [RFC] bitfield access shrinking

[llvm-dev] [RFC] bitfield access shrinking

[llvm-dev] [RFC] bitfield access shrinking

Seemingly Similar Threads