thr3ads.net - llvm dev - [LLVMdev] [RFC] Poor code generation for paired load [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Quentin Colombet

2013-Aug-09 23:58 UTC

[LLVMdev] [RFC] Poor code generation for paired load

Hi,

I am investigating a poor code generation on x86-64 involving a 64-bits
structure with two 32-bits fields (in the attached examples float, but similar
behavior is exposed with i32, and we can probably generalize that to smaller
types too).
The root cause of the problem is in SROA, although I am not sure we should fix
something there. That is why I need your advices.

** Problem **

64-bits structures are usually loaded as one chunk of bits and fields are
extracted from this chunk.
Although this may be generally better than loading each field on its own, this
can lead to poor code generation when the operations extracting the fields are
more expensive than a load or when “fancy” loads are available.

More generally, this may happen for smaller size too.

** Example **

1. %chunk64 = load i64
2. %field1trunced = trunc i64 %chunk64 to i32 // <— build field1
from chunk
3. %field1float = bitcast i32 field1trunced to float // <— build field1
from chunk
4. %field2shifted = lshr i64 %chunk64, 32 // <— build field2
from chunk
5. %field2trunced = trunc i64 %field2shifter to i32 // <— build field2
from chunk
6. %field2 = bitcast i32 %field2trunced to float // <— build field2
from chunk

Scenario #1:
Floating point registers are on another register bank and register bank moves
are almost as expensive as loads (instructions 3. and 6.).
Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat

Scenario #2
Paired loads are available on the target. Truncate and shift instructions are
useless (instructions 2., 4., and 5.).
Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair

** To Reproduce **

Here is a way to reproduce the poor code generation for x86-64.

opt -sroa current_input.ll -S -o - | llc -O3 -o -

You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the next
command.

Here is a nicer code produced by modifying the input so that SROA generates
friendlier code for this case.

opt -sroa mod_input.ll -S -o - | llc -O3 -o -

Basically the difference between both inputs is that memcpy has not been
expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA inserts
its own loads to get rid of the memcpy instead of extracting the values from the
64-bits loads.

** Advices Required **

SROA generates this extract-fields-from-chunk-of-bits thing.
However, like I said, I do not think this is generally a bad thing.

Would it make sense to rewrite the definitions of the involved slices so that
SROA breaks them apart when they are loads (and under certain circumstance)?

More generally, do you think there is something we should do in SROA for this?

Currently, 32-bits targets (e.g., armv7s) do not suffer this because the
legalization of types in selection DAG split the 64-bits loads.

Should we do something similar for 64-bits targets with the proper target hooks?
If yes, what hooks?

Thanks for your help.

Cheers,

-Quentin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130809/7f312635/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: current_input.ll
Type: application/octet-stream
Size: 1960 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130809/7f312635/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130809/7f312635/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mod_input.ll
Type: application/octet-stream
Size: 2022 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130809/7f312635/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130809/7f312635/attachment-0002.html>

Eli Friedman

2013-Aug-10 03:00 UTC

head link

[LLVMdev] [RFC] Poor code generation for paired load

On Fri, Aug 9, 2013 at 4:58 PM, Quentin Colombet <qcolombet at apple.com>
wrote:> Hi,
>
> I am investigating a poor code generation on x86-64 involving a 64-bits
> structure with two 32-bits fields (in the attached examples float, but
> similar behavior is exposed with i32, and we can probably generalize that
to
> smaller types too).
> The root cause of the problem is in SROA, although I am not sure we should
> fix something there. That is why I need your advices.
>
>
> ** Problem **
>
> 64-bits structures are usually loaded as one chunk of bits and fields are
> extracted from this chunk.
> Although this may be generally better than loading each field on its own,
> this can lead to poor code generation when the operations extracting the
> fields are more expensive than a load or when “fancy” loads are available.
>
> More generally, this may happen for smaller size too.
>
>
> ** Example **
>
> 1. %chunk64 = load i64
> 2. %field1trunced = trunc i64 %chunk64 to i32          // <— build
field1
> from chunk
> 3. %field1float = bitcast i32 field1trunced to float       // <— build
> field1 from chunk
> 4. %field2shifted = lshr i64 %chunk64, 32                 // <— build
field2
> from chunk
> 5. %field2trunced = trunc i64 %field2shifter to i32     // <— build
field2
> from chunk
> 6. %field2 = bitcast i32 %field2trunced to float           // <— build
> field2 from chunk
>
> Scenario #1:
> Floating point registers are on another register bank and register bank
> moves are almost as expensive as loads (instructions 3. and 6.).
> Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat
>
> Scenario #2
> Paired loads are available on the target. Truncate and shift instructions
> are useless (instructions 2., 4., and 5.).
> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair
>
>
> ** To Reproduce **
>
> Here is a way to reproduce the poor code generation for x86-64.
>
> opt -sroa current_input.ll -S -o - | llc -O3 -o -
>
> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the
> next command.
>
> Here is a nicer code produced by modifying the input so that SROA generates
> friendlier code for this case.
>
> opt -sroa mod_input.ll -S -o - | llc -O3 -o -
>
> Basically the difference between both inputs is that memcpy has not been
> expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA
> inserts its own loads to get rid of the memcpy instead of extracting the
> values from the 64-bits loads.
>
>
> ** Advices Required **
>
> SROA generates this extract-fields-from-chunk-of-bits thing.
> However, like I said, I do not think this is generally a bad thing.
>
> Would it make sense to rewrite the definitions of the involved slices so
> that SROA breaks them apart when they are loads (and under certain
> circumstance)?
>
> More generally, do you think there is something we should do in SROA for
> this?
The load that you want to split is actually the i64 load from memory
with SROA knows nothing about.  So the transform you want isn't really
related to what SROA does.
> Currently, 32-bits targets (e.g., armv7s) do not suffer this because the
> legalization of types in selection DAG split the 64-bits loads.
>
> Should we do something similar for 64-bits targets with the proper target
> hooks?
> If yes, what hooks?
Hmm... detecting the pattern in IR isn't particularly hard; see
InstCombiner::SliceUpIllegalIntegerPHI for an example of code which
detects a similar sort of pattern.  You might want to consider adding
something in instcombine.

I'm trying to think if there's some scenario where you wouldn't want
to rewrite the load into two loads, but I'm having trouble coming up
with anything: two scalar loads at a constant offset from each other
are pretty easy to detect for the sorts of passes that like to mess
with loads.  So we probably just want to declare that two load
instructions is the canonical form for loading two floats which are
next to each other in memory, and do this transform on IR for all
targets.

-Eli

Quentin Colombet

2013-Aug-12 16:59 UTC

head link

[LLVMdev] [RFC] Poor code generation for paired load

Hi Eli,

Thanks for the feedbacks.

On Aug 9, 2013, at 8:00 PM, Eli Friedman <eli.friedman at gmail.com>
wrote:
> On Fri, Aug 9, 2013 at 4:58 PM, Quentin Colombet <qcolombet at
apple.com> wrote:
>> Hi,
>> 
>> I am investigating a poor code generation on x86-64 involving a 64-bits
>> structure with two 32-bits fields (in the attached examples float, but
>> similar behavior is exposed with i32, and we can probably generalize
that to
>> smaller types too).
>> The root cause of the problem is in SROA, although I am not sure we
should
>> fix something there. That is why I need your advices.
>> 
>> 
>> ** Problem **
>> 
>> 64-bits structures are usually loaded as one chunk of bits and fields
are
>> extracted from this chunk.
>> Although this may be generally better than loading each field on its
own,
>> this can lead to poor code generation when the operations extracting
the
>> fields are more expensive than a load or when “fancy” loads are
available.
>> 
>> More generally, this may happen for smaller size too.
>> 
>> 
>> ** Example **
>> 
>> 1. %chunk64 = load i64
>> 2. %field1trunced = trunc i64 %chunk64 to i32          // <— build
field1
>> from chunk
>> 3. %field1float = bitcast i32 field1trunced to float       // <—
build
>> field1 from chunk
>> 4. %field2shifted = lshr i64 %chunk64, 32                 // <—
build field2
>> from chunk
>> 5. %field2trunced = trunc i64 %field2shifter to i32     // <— build
field2
>> from chunk
>> 6. %field2 = bitcast i32 %field2trunced to float           // <—
build
>> field2 from chunk
>> 
>> Scenario #1:
>> Floating point registers are on another register bank and register bank
>> moves are almost as expensive as loads (instructions 3. and 6.).
>> Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat
>> 
>> Scenario #2
>> Paired loads are available on the target. Truncate and shift
instructions
>> are useless (instructions 2., 4., and 5.).
>> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair
>> 
>> 
>> ** To Reproduce **
>> 
>> Here is a way to reproduce the poor code generation for x86-64.
>> 
>> opt -sroa current_input.ll -S -o - | llc -O3 -o -
>> 
>> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with
the
>> next command.
>> 
>> Here is a nicer code produced by modifying the input so that SROA
generates
>> friendlier code for this case.
>> 
>> opt -sroa mod_input.ll -S -o - | llc -O3 -o -
>> 
>> Basically the difference between both inputs is that memcpy has not
been
>> expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA
>> inserts its own loads to get rid of the memcpy instead of extracting
the
>> values from the 64-bits loads.
>> 
>> 
>> ** Advices Required **
>> 
>> SROA generates this extract-fields-from-chunk-of-bits thing.
>> However, like I said, I do not think this is generally a bad thing.
>> 
>> Would it make sense to rewrite the definitions of the involved slices
so
>> that SROA breaks them apart when they are loads (and under certain
>> circumstance)?
>> 
>> More generally, do you think there is something we should do in SROA
for
>> this?
> 
> The load that you want to split is actually the i64 load from memory
> with SROA knows nothing about.  So the transform you want isn't really
> related to what SROA does.Agreed.
> 
>> Currently, 32-bits targets (e.g., armv7s) do not suffer this because
the
>> legalization of types in selection DAG split the 64-bits loads.
>> 
>> Should we do something similar for 64-bits targets with the proper
target
>> hooks?
>> If yes, what hooks?
> 
> Hmm... detecting the pattern in IR isn't particularly hard; see
> InstCombiner::SliceUpIllegalIntegerPHI for an example of code which
> detects a similar sort of pattern.  You might want to consider adding
> something in instcombine.Thanks for the direction, I will have a look.
> 
> I'm trying to think if there's some scenario where you wouldn't
want
> to rewrite the load into two loads, but I'm having trouble coming up
> with anything: two scalar loads at a constant offset from each other
> are pretty easy to detect for the sorts of passes that like to mess
> with loads.  So we probably just want to declare that two load
> instructions is the canonical form for loading two floats which are
> next to each other in memory, and do this transform on IR for all
> targets.It is more general than floating point types. For instance, if you replace float
by i32 (and fadd by add, of course :)) in the example I had in my initial email,
the produced code is also better for x86-64 when you split the loads.
Thus, do you think we should split all loads and rely on later passes to combine
them, if needed for the current target?

Thanks,

-Quentin> 
> -Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130812/1c3c851a/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Aug 2013 - [LLVMdev] [RFC] Poor code generation for paired load

[LLVMdev] [RFC] Poor code generation for paired load

[LLVMdev] [RFC] Poor code generation for paired load

[LLVMdev] [RFC] Poor code generation for paired load

Seemingly Similar Threads