Quentin Colombet
2013-Aug-12 16:59 UTC
[LLVMdev] [RFC] Poor code generation for paired load
Hi Eli, Thanks for the feedbacks. On Aug 9, 2013, at 8:00 PM, Eli Friedman <eli.friedman at gmail.com> wrote:> On Fri, Aug 9, 2013 at 4:58 PM, Quentin Colombet <qcolombet at apple.com> wrote: >> Hi, >> >> I am investigating a poor code generation on x86-64 involving a 64-bits >> structure with two 32-bits fields (in the attached examples float, but >> similar behavior is exposed with i32, and we can probably generalize that to >> smaller types too). >> The root cause of the problem is in SROA, although I am not sure we should >> fix something there. That is why I need your advices. >> >> >> ** Problem ** >> >> 64-bits structures are usually loaded as one chunk of bits and fields are >> extracted from this chunk. >> Although this may be generally better than loading each field on its own, >> this can lead to poor code generation when the operations extracting the >> fields are more expensive than a load or when “fancy” loads are available. >> >> More generally, this may happen for smaller size too. >> >> >> ** Example ** >> >> 1. %chunk64 = load i64 >> 2. %field1trunced = trunc i64 %chunk64 to i32 // <— build field1 >> from chunk >> 3. %field1float = bitcast i32 field1trunced to float // <— build >> field1 from chunk >> 4. %field2shifted = lshr i64 %chunk64, 32 // <— build field2 >> from chunk >> 5. %field2trunced = trunc i64 %field2shifter to i32 // <— build field2 >> from chunk >> 6. %field2 = bitcast i32 %field2trunced to float // <— build >> field2 from chunk >> >> Scenario #1: >> Floating point registers are on another register bank and register bank >> moves are almost as expensive as loads (instructions 3. and 6.). >> Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat >> >> Scenario #2 >> Paired loads are available on the target. Truncate and shift instructions >> are useless (instructions 2., 4., and 5.). >> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair >> >> >> ** To Reproduce ** >> >> Here is a way to reproduce the poor code generation for x86-64. >> >> opt -sroa current_input.ll -S -o - | llc -O3 -o - >> >> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the >> next command. >> >> Here is a nicer code produced by modifying the input so that SROA generates >> friendlier code for this case. >> >> opt -sroa mod_input.ll -S -o - | llc -O3 -o - >> >> Basically the difference between both inputs is that memcpy has not been >> expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA >> inserts its own loads to get rid of the memcpy instead of extracting the >> values from the 64-bits loads. >> >> >> ** Advices Required ** >> >> SROA generates this extract-fields-from-chunk-of-bits thing. >> However, like I said, I do not think this is generally a bad thing. >> >> Would it make sense to rewrite the definitions of the involved slices so >> that SROA breaks them apart when they are loads (and under certain >> circumstance)? >> >> More generally, do you think there is something we should do in SROA for >> this? > > The load that you want to split is actually the i64 load from memory > with SROA knows nothing about. So the transform you want isn't really > related to what SROA does.Agreed.> >> Currently, 32-bits targets (e.g., armv7s) do not suffer this because the >> legalization of types in selection DAG split the 64-bits loads. >> >> Should we do something similar for 64-bits targets with the proper target >> hooks? >> If yes, what hooks? > > Hmm... detecting the pattern in IR isn't particularly hard; see > InstCombiner::SliceUpIllegalIntegerPHI for an example of code which > detects a similar sort of pattern. You might want to consider adding > something in instcombine.Thanks for the direction, I will have a look.> > I'm trying to think if there's some scenario where you wouldn't want > to rewrite the load into two loads, but I'm having trouble coming up > with anything: two scalar loads at a constant offset from each other > are pretty easy to detect for the sorts of passes that like to mess > with loads. So we probably just want to declare that two load > instructions is the canonical form for loading two floats which are > next to each other in memory, and do this transform on IR for all > targets.It is more general than floating point types. For instance, if you replace float by i32 (and fadd by add, of course :)) in the example I had in my initial email, the produced code is also better for x86-64 when you split the loads. Thus, do you think we should split all loads and rely on later passes to combine them, if needed for the current target? Thanks, -Quentin> > -Eli-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130812/1c3c851a/attachment.html>
On Mon, Aug 12, 2013 at 9:59 AM, Quentin Colombet <qcolombet at apple.com> wrote:> Hi Eli, > > Thanks for the feedbacks. > > On Aug 9, 2013, at 8:00 PM, Eli Friedman <eli.friedman at gmail.com> wrote: > > On Fri, Aug 9, 2013 at 4:58 PM, Quentin Colombet <qcolombet at apple.com> > wrote: > > Hi, > > I am investigating a poor code generation on x86-64 involving a 64-bits > structure with two 32-bits fields (in the attached examples float, but > similar behavior is exposed with i32, and we can probably generalize that to > smaller types too). > The root cause of the problem is in SROA, although I am not sure we should > fix something there. That is why I need your advices. > > > ** Problem ** > > 64-bits structures are usually loaded as one chunk of bits and fields are > extracted from this chunk. > Although this may be generally better than loading each field on its own, > this can lead to poor code generation when the operations extracting the > fields are more expensive than a load or when “fancy” loads are available. > > More generally, this may happen for smaller size too. > > > ** Example ** > > 1. %chunk64 = load i64 > 2. %field1trunced = trunc i64 %chunk64 to i32 // <— build field1 > from chunk > 3. %field1float = bitcast i32 field1trunced to float // <— build > field1 from chunk > 4. %field2shifted = lshr i64 %chunk64, 32 // <— build field2 > from chunk > 5. %field2trunced = trunc i64 %field2shifter to i32 // <— build field2 > from chunk > 6. %field2 = bitcast i32 %field2trunced to float // <— build > field2 from chunk > > Scenario #1: > Floating point registers are on another register bank and register bank > moves are almost as expensive as loads (instructions 3. and 6.). > Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat > > Scenario #2 > Paired loads are available on the target. Truncate and shift instructions > are useless (instructions 2., 4., and 5.). > Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair > > > ** To Reproduce ** > > Here is a way to reproduce the poor code generation for x86-64. > > opt -sroa current_input.ll -S -o - | llc -O3 -o - > > You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the > next command. > > Here is a nicer code produced by modifying the input so that SROA generates > friendlier code for this case. > > opt -sroa mod_input.ll -S -o - | llc -O3 -o - > > Basically the difference between both inputs is that memcpy has not been > expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA > inserts its own loads to get rid of the memcpy instead of extracting the > values from the 64-bits loads. > > > ** Advices Required ** > > SROA generates this extract-fields-from-chunk-of-bits thing. > However, like I said, I do not think this is generally a bad thing. > > Would it make sense to rewrite the definitions of the involved slices so > that SROA breaks them apart when they are loads (and under certain > circumstance)? > > More generally, do you think there is something we should do in SROA for > this? > > > The load that you want to split is actually the i64 load from memory > with SROA knows nothing about. So the transform you want isn't really > related to what SROA does. > > Agreed. > > > Currently, 32-bits targets (e.g., armv7s) do not suffer this because the > legalization of types in selection DAG split the 64-bits loads. > > Should we do something similar for 64-bits targets with the proper target > hooks? > If yes, what hooks? > > > Hmm... detecting the pattern in IR isn't particularly hard; see > InstCombiner::SliceUpIllegalIntegerPHI for an example of code which > detects a similar sort of pattern. You might want to consider adding > something in instcombine. > > Thanks for the direction, I will have a look. > > > I'm trying to think if there's some scenario where you wouldn't want > to rewrite the load into two loads, but I'm having trouble coming up > with anything: two scalar loads at a constant offset from each other > are pretty easy to detect for the sorts of passes that like to mess > with loads. So we probably just want to declare that two load > instructions is the canonical form for loading two floats which are > next to each other in memory, and do this transform on IR for all > targets. > > It is more general than floating point types. For instance, if you replace > float by i32 (and fadd by add, of course :)) in the example I had in my > initial email, the produced code is also better for x86-64 when you split > the loads. > Thus, do you think we should split all loads and rely on later passes to > combine them, if needed for the current target?Right... replace "float" with "arithmetic type". You probably want to be careful you don't end up generating overlapping loads, and larger loads are generally better if we're not doing arithmetic, but otherwise this seems like a good idea. That said, I'd suggest writing it up and starting a new thread on llvmdev. -Eli
Quentin Colombet
2013-Aug-12 18:44 UTC
[LLVMdev] [RFC] Poor code generation for paired load
On Aug 12, 2013, at 11:40 AM, Eli Friedman <eli.friedman at gmail.com> wrote:> On Mon, Aug 12, 2013 at 9:59 AM, Quentin Colombet <qcolombet at apple.com> wrote: >> Hi Eli, >> >> Thanks for the feedbacks. >> >> On Aug 9, 2013, at 8:00 PM, Eli Friedman <eli.friedman at gmail.com> wrote: >> >> On Fri, Aug 9, 2013 at 4:58 PM, Quentin Colombet <qcolombet at apple.com> >> wrote: >> >> Hi, >> >> I am investigating a poor code generation on x86-64 involving a 64-bits >> structure with two 32-bits fields (in the attached examples float, but >> similar behavior is exposed with i32, and we can probably generalize that to >> smaller types too). >> The root cause of the problem is in SROA, although I am not sure we should >> fix something there. That is why I need your advices. >> >> >> ** Problem ** >> >> 64-bits structures are usually loaded as one chunk of bits and fields are >> extracted from this chunk. >> Although this may be generally better than loading each field on its own, >> this can lead to poor code generation when the operations extracting the >> fields are more expensive than a load or when “fancy” loads are available. >> >> More generally, this may happen for smaller size too. >> >> >> ** Example ** >> >> 1. %chunk64 = load i64 >> 2. %field1trunced = trunc i64 %chunk64 to i32 // <— build field1 >> from chunk >> 3. %field1float = bitcast i32 field1trunced to float // <— build >> field1 from chunk >> 4. %field2shifted = lshr i64 %chunk64, 32 // <— build field2 >> from chunk >> 5. %field2trunced = trunc i64 %field2shifter to i32 // <— build field2 >> from chunk >> 6. %field2 = bitcast i32 %field2trunced to float // <— build >> field2 from chunk >> >> Scenario #1: >> Floating point registers are on another register bank and register bank >> moves are almost as expensive as loads (instructions 3. and 6.). >> Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat >> >> Scenario #2 >> Paired loads are available on the target. Truncate and shift instructions >> are useless (instructions 2., 4., and 5.). >> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair >> >> >> ** To Reproduce ** >> >> Here is a way to reproduce the poor code generation for x86-64. >> >> opt -sroa current_input.ll -S -o - | llc -O3 -o - >> >> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the >> next command. >> >> Here is a nicer code produced by modifying the input so that SROA generates >> friendlier code for this case. >> >> opt -sroa mod_input.ll -S -o - | llc -O3 -o - >> >> Basically the difference between both inputs is that memcpy has not been >> expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA >> inserts its own loads to get rid of the memcpy instead of extracting the >> values from the 64-bits loads. >> >> >> ** Advices Required ** >> >> SROA generates this extract-fields-from-chunk-of-bits thing. >> However, like I said, I do not think this is generally a bad thing. >> >> Would it make sense to rewrite the definitions of the involved slices so >> that SROA breaks them apart when they are loads (and under certain >> circumstance)? >> >> More generally, do you think there is something we should do in SROA for >> this? >> >> >> The load that you want to split is actually the i64 load from memory >> with SROA knows nothing about. So the transform you want isn't really >> related to what SROA does. >> >> Agreed. >> >> >> Currently, 32-bits targets (e.g., armv7s) do not suffer this because the >> legalization of types in selection DAG split the 64-bits loads. >> >> Should we do something similar for 64-bits targets with the proper target >> hooks? >> If yes, what hooks? >> >> >> Hmm... detecting the pattern in IR isn't particularly hard; see >> InstCombiner::SliceUpIllegalIntegerPHI for an example of code which >> detects a similar sort of pattern. You might want to consider adding >> something in instcombine. >> >> Thanks for the direction, I will have a look. >> >> >> I'm trying to think if there's some scenario where you wouldn't want >> to rewrite the load into two loads, but I'm having trouble coming up >> with anything: two scalar loads at a constant offset from each other >> are pretty easy to detect for the sorts of passes that like to mess >> with loads. So we probably just want to declare that two load >> instructions is the canonical form for loading two floats which are >> next to each other in memory, and do this transform on IR for all >> targets. >> >> It is more general than floating point types. For instance, if you replace >> float by i32 (and fadd by add, of course :)) in the example I had in my >> initial email, the produced code is also better for x86-64 when you split >> the loads. >> Thus, do you think we should split all loads and rely on later passes to >> combine them, if needed for the current target? > > Right... replace "float" with "arithmetic type". > > You probably want to be careful you don't end up generating > overlapping loads, and larger loads are generally better if we're not > doing arithmetic, but otherwise this seems like a good idea. That > said, I'd suggest writing it up and starting a new thread on llvmddv.Thanks Eli. Will do. -Quentin> > -Eli-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130812/badc2834/attachment.html>
Seemingly Similar Threads
- [LLVMdev] [RFC] Poor code generation for paired load
- [LLVMdev] [RFC] Poor code generation for paired load
- [LLVMdev] [RFC] Poor code generation for paired load
- rspec model testing - test on user defined validation- How do I test that the create failed.
- I need "validates_presence_of" help