Hi all.
I'm looking for an advice on how to deal with inefficient code generation
for Intel Nehalem/Westmere architecture on 64-bit platform for the attached
test.cpp (LLVM IR is in test.cpp.ll).
The inner loop has 11 iterations and eventually unrolled.
Test.lea.s is the assembly code of the outer loop. It simply has 11 loads, 11 FP
add, 11 FP mull, 1 FP store and lea+mov for index computation, cmp and jump.
The problem is that lea is on critical path because it's dispatched on the
same port as all FP add operations (port 1).
Intel Architecture Code Analyzer (IACA) reports throughput for that assembly
block is 12.95 cycles.
I made a short investigation and found that there is a pass in code gen that
replaces index increment with lea.
Here is the snippet from llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
if (MI.isConvertibleTo3Addr()) {
// This instruction is potentially convertible to a true
// three-address instruction. Check if it is profitable.
if (!regBKilled || isProfitableToConv3Addr(regA, regB)) {
// Try to convert it.
if (convertInstTo3Addr(mi, nmi, regA, regB, Dist)) {
++NumConvertedTo3Addr;
return true; // Done with this instruction.
}
}
}
regBKilled is false for my test case and isProfitableToConv3Addr is not even
called.
I've made an experiment and left only
if (isProfitableToConv3Addr(regA, regB)) {
That gave me test.inc.s where lea replaced with inc+mov and this code is ~27%
faster on my Westmere system. IACA throughput analysis gives 11 cycles for new
block.
But the best performance I've got from switching scheduling algorithm from
ILP to BURR (test.burr.s). It gives a few percent more vs. "ILP+INC"
and I'm not sure why - it might be because test.burr.s has less instructions
(no two moves that copy index) or it might be because additions scheduled
differently. BURR puts loads and FP mul between additions, which are gathered at
the end of the loop by ILP.
I didn't run experiments on sandy bridge, but IACA gives 12.45 cycles for
original code (test.lea.s), so I expect BURR to improve performance there too
for the attached test case.
Unfortunately I'm familiar enough with the LLVM codegen code to make a good
fix for this issue and I would appreciate any help.
Thanks,
Aleksey
--------------------------------------------------------------------
Closed Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
Russian Federation
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/e31251e6/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: test.cpp
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/e31251e6/attachment.ksh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.cpp.ll
Type: application/octet-stream
Size: 3248 bytes
Desc: test.cpp.ll
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/e31251e6/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.lea.s
Type: application/octet-stream
Size: 832 bytes
Desc: test.lea.s
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/e31251e6/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.inc.s
Type: application/octet-stream
Size: 839 bytes
Desc: test.inc.s
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/e31251e6/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.burr.s
Type: application/octet-stream
Size: 801 bytes
Desc: test.burr.s
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130917/e31251e6/attachment-0003.obj>
This sounds like llvm.org/pr13320. On 17 September 2013 18:20, Bader, Aleksey A <aleksey.a.bader at intel.com> wrote:> Hi all. > > > > I’m looking for an advice on how to deal with inefficient code generation > for Intel Nehalem/Westmere architecture on 64-bit platform for the attached > test.cpp (LLVM IR is in test.cpp.ll). > > The inner loop has 11 iterations and eventually unrolled. > > Test.lea.s is the assembly code of the outer loop. It simply has 11 loads, > 11 FP add, 11 FP mull, 1 FP store and lea+mov for index computation, cmp and > jump. > > The problem is that lea is on critical path because it’s dispatched on the > same port as all FP add operations (port 1). > > Intel Architecture Code Analyzer (IACA) reports throughput for that assembly > block is 12.95 cycles. > > I made a short investigation and found that there is a pass in code gen that > replaces index increment with lea. > > Here is the snippet from llvm/lib/CodeGen/TwoAddressInstructionPass.cpp > > > > if (MI.isConvertibleTo3Addr()) { > > // This instruction is potentially convertible to a true > > // three-address instruction. Check if it is profitable. > > if (!regBKilled || isProfitableToConv3Addr(regA, regB)) { > > // Try to convert it. > > if (convertInstTo3Addr(mi, nmi, regA, regB, Dist)) { > > ++NumConvertedTo3Addr; > > return true; // Done with this instruction. > > } > > } > > } > > > > regBKilled is false for my test case and isProfitableToConv3Addr is not even > called. > > I’ve made an experiment and left only > > > > if (isProfitableToConv3Addr(regA, regB)) { > > > > That gave me test.inc.s where lea replaced with inc+mov and this code is > ~27% faster on my Westmere system. IACA throughput analysis gives 11 cycles > for new block. > > > > But the best performance I’ve got from switching scheduling algorithm from > ILP to BURR (test.burr.s). It gives a few percent more vs. “ILP+INC” and I’m > not sure why – it might be because test.burr.s has less instructions (no two > moves that copy index) or it might be because additions scheduled > differently. BURR puts loads and FP mul between additions, which are > gathered at the end of the loop by ILP. > > > > I didn’t run experiments on sandy bridge, but IACA gives 12.45 cycles for > original code (test.lea.s), so I expect BURR to improve performance there > too for the attached test case. > > > > Unfortunately I’m familiar enough with the LLVM codegen code to make a good > fix for this issue and I would appreciate any help. > > > > Thanks, > > Aleksey > > > -------------------------------------------------------------------- > Closed Joint Stock Company Intel A/O > Registered legal address: Krylatsky Hills Business Park, > 17 Krylatskaya Str., Bldg 4, Moscow 121614, > Russian Federation > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
The two address pass is only concerned about register pressure. It sounds like it should be taught about profitability. In cases where profitability can only be determined with something machinetracemetric then it probably should live it to more sophisticated pass like regalloc. In this case, we probably need a profitability target hook which knows about lea. We should also consider disabling it's dumb pseudo scheduling code when we enable MI scheduler. Evan Sent from my iPad> On Oct 2, 2013, at 8:38 AM, Rafael Espíndola <rafael.espindola at gmail.com> wrote: > > This sounds like llvm.org/pr13320. > >> On 17 September 2013 18:20, Bader, Aleksey A <aleksey.a.bader at intel.com> wrote: >> Hi all. >> >> >> >> I’m looking for an advice on how to deal with inefficient code generation >> for Intel Nehalem/Westmere architecture on 64-bit platform for the attached >> test.cpp (LLVM IR is in test.cpp.ll). >> >> The inner loop has 11 iterations and eventually unrolled. >> >> Test.lea.s is the assembly code of the outer loop. It simply has 11 loads, >> 11 FP add, 11 FP mull, 1 FP store and lea+mov for index computation, cmp and >> jump. >> >> The problem is that lea is on critical path because it’s dispatched on the >> same port as all FP add operations (port 1). >> >> Intel Architecture Code Analyzer (IACA) reports throughput for that assembly >> block is 12.95 cycles. >> >> I made a short investigation and found that there is a pass in code gen that >> replaces index increment with lea. >> >> Here is the snippet from llvm/lib/CodeGen/TwoAddressInstructionPass.cpp >> >> >> >> if (MI.isConvertibleTo3Addr()) { >> >> // This instruction is potentially convertible to a true >> >> // three-address instruction. Check if it is profitable. >> >> if (!regBKilled || isProfitableToConv3Addr(regA, regB)) { >> >> // Try to convert it. >> >> if (convertInstTo3Addr(mi, nmi, regA, regB, Dist)) { >> >> ++NumConvertedTo3Addr; >> >> return true; // Done with this instruction. >> >> } >> >> } >> >> } >> >> >> >> regBKilled is false for my test case and isProfitableToConv3Addr is not even >> called. >> >> I’ve made an experiment and left only >> >> >> >> if (isProfitableToConv3Addr(regA, regB)) { >> >> >> >> That gave me test.inc.s where lea replaced with inc+mov and this code is >> ~27% faster on my Westmere system. IACA throughput analysis gives 11 cycles >> for new block. >> >> >> >> But the best performance I’ve got from switching scheduling algorithm from >> ILP to BURR (test.burr.s). It gives a few percent more vs. “ILP+INC” and I’m >> not sure why – it might be because test.burr.s has less instructions (no two >> moves that copy index) or it might be because additions scheduled >> differently. BURR puts loads and FP mul between additions, which are >> gathered at the end of the loop by ILP. >> >> >> >> I didn’t run experiments on sandy bridge, but IACA gives 12.45 cycles for >> original code (test.lea.s), so I expect BURR to improve performance there >> too for the attached test case. >> >> >> >> Unfortunately I’m familiar enough with the LLVM codegen code to make a good >> fix for this issue and I would appreciate any help. >> >> >> >> Thanks, >> >> Aleksey >> >> >> -------------------------------------------------------------------- >> Closed Joint Stock Company Intel A/O >> Registered legal address: Krylatsky Hills Business Park, >> 17 Krylatskaya Str., Bldg 4, Moscow 121614, >> Russian Federation >> >> This e-mail and any attachments may contain confidential material for >> the sole use of the intended recipient(s). Any review or distribution >> by others is strictly prohibited. If you are not the intended >> recipient, please contact the sender and delete all copies. >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev