thr3ads.net - llvm dev - [LLVMdev] Codegen performance issue: LEA vs. INC. [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Evan Cheng

2013-Oct-03 06:48 UTC

[LLVMdev] Codegen performance issue: LEA vs. INC.

The two address pass is only concerned about register pressure. It sounds like
it should be taught about profitability.  In cases where profitability can only
be determined with something machinetracemetric then it probably should live it
to more sophisticated pass like regalloc.

In this case, we probably need a profitability target hook which knows about
lea. We should also consider disabling it's dumb pseudo scheduling code when
we enable MI scheduler.

Evan

Sent from my iPad
> On Oct 2, 2013, at 8:38 AM, Rafael Espíndola <rafael.espindola at
gmail.com> wrote:
> 
> This sounds like llvm.org/pr13320.
> 
>> On 17 September 2013 18:20, Bader, Aleksey A <aleksey.a.bader at
intel.com> wrote:
>> Hi all.
>> 
>> 
>> 
>> I’m looking for an advice on how to deal with inefficient code
generation
>> for Intel Nehalem/Westmere architecture on 64-bit platform for the
attached
>> test.cpp (LLVM IR is in test.cpp.ll).
>> 
>> The inner loop has 11 iterations and eventually unrolled.
>> 
>> Test.lea.s is the assembly code of the outer loop. It simply has 11
loads,
>> 11 FP add, 11 FP mull, 1 FP store and lea+mov for index computation,
cmp and
>> jump.
>> 
>> The problem is that lea is on critical path because it’s dispatched on
the
>> same port as all FP add operations (port 1).
>> 
>> Intel Architecture Code Analyzer (IACA) reports throughput for that
assembly
>> block is 12.95 cycles.
>> 
>> I made a short investigation and found that there is a pass in code gen
that
>> replaces index increment with lea.
>> 
>> Here is the snippet from llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
>> 
>> 
>> 
>> if (MI.isConvertibleTo3Addr()) {
>> 
>>  // This instruction is potentially convertible to a true
>> 
>>  // three-address instruction.  Check if it is profitable.
>> 
>>  if (!regBKilled || isProfitableToConv3Addr(regA, regB)) {
>> 
>>    // Try to convert it.
>> 
>>    if (convertInstTo3Addr(mi, nmi, regA, regB, Dist)) {
>> 
>>      ++NumConvertedTo3Addr;
>> 
>>      return true; // Done with this instruction.
>> 
>>    }
>> 
>>  }
>> 
>> }
>> 
>> 
>> 
>> regBKilled is false for my test case and isProfitableToConv3Addr is not
even
>> called.
>> 
>> I’ve made an experiment and left only
>> 
>> 
>> 
>> if (isProfitableToConv3Addr(regA, regB)) {
>> 
>> 
>> 
>> That gave me test.inc.s where lea replaced with inc+mov and this code
is
>> ~27% faster on my Westmere system. IACA throughput analysis gives 11
cycles
>> for new block.
>> 
>> 
>> 
>> But the best performance I’ve got from switching scheduling algorithm
from
>> ILP to BURR (test.burr.s). It gives a few percent more vs. “ILP+INC”
and I’m
>> not sure why – it might be because test.burr.s has less instructions
(no two
>> moves that copy index) or it might be because additions scheduled
>> differently. BURR puts loads and FP mul between additions, which are
>> gathered at the end of the loop by ILP.
>> 
>> 
>> 
>> I didn’t run experiments on sandy bridge, but IACA gives 12.45 cycles
for
>> original code (test.lea.s), so I expect BURR to improve performance
there
>> too for the attached test case.
>> 
>> 
>> 
>> Unfortunately I’m familiar enough with the LLVM codegen code to make a
good
>> fix for this issue and I would appreciate any help.
>> 
>> 
>> 
>> Thanks,
>> 
>> Aleksey
>> 
>> 
>> --------------------------------------------------------------------
>> Closed Joint Stock Company Intel A/O
>> Registered legal address: Krylatsky Hills Business Park,
>> 17 Krylatskaya Str., Bldg 4, Moscow 121614,
>> Russian Federation
>> 
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Andrew Trick

2013-Oct-05 06:00 UTC

head link

[LLVMdev] Codegen performance issue: LEA vs. INC.

On Oct 2, 2013, at 11:48 PM, Evan Cheng <evan.cheng at apple.com> wrote:
> The two address pass is only concerned about register pressure. It sounds
like it should be taught about profitability.  In cases where profitability can
only be determined with something machinetracemetric then it probably should
live it to more sophisticated pass like regalloc.
> 
> In this case, we probably need a profitability target hook which knows
about lea. We should also consider disabling it's dumb pseudo scheduling
code when we enable MI scheduler.
Sorry, I set this aside to look at closely and never got back to it.

The lea->cmp problem is fixed by switching to the MI scheduler. Please run
with -mllvm -misched-bench to confirm.

Leaving the old ILP scheduler as the default continues to cause confusion.
People have had plenty of time to evaluate the new scheduler. I’ll plan to
switch the default for x86 on Monday.

Now, that doesn’t mean that your analysis of the 2-address pass is irrelevant.
It’s just that the new pass order happens to work better. MI Scheduler also
makes an effort to facilitate macro fusion. But for the record, the 2-address
pass heuristics are clearly obsolete. As Rafael pointed out, that’s covered in
PR13320. I’m honestly not even sure why we still use inc/dec in x86-64, saving a
byte?

Long-term plan: ideally, some of the tricks the 2-address pass is doing would be
done within the MI scheduler now where we track register pressure precisely and
know the final location of instructions. The major hurdle in doing that is
updating live intervals on-the-fly. repairIntervalsInRange is incomplete. That’s
also the reason we can’t kill of the LiveVariables pass.

-Andy
>> On Oct 2, 2013, at 8:38 AM, Rafael Espíndola <rafael.espindola at
gmail.com> wrote:
>> 
>> This sounds like llvm.org/pr13320.
>> 
>>> On 17 September 2013 18:20, Bader, Aleksey A <aleksey.a.bader at
intel.com> wrote:
>>> Hi all.
>>> 
>>> 
>>> 
>>> I’m looking for an advice on how to deal with inefficient code
generation
>>> for Intel Nehalem/Westmere architecture on 64-bit platform for the
attached
>>> test.cpp (LLVM IR is in test.cpp.ll).
>>> 
>>> The inner loop has 11 iterations and eventually unrolled.
>>> 
>>> Test.lea.s is the assembly code of the outer loop. It simply has 11
loads,
>>> 11 FP add, 11 FP mull, 1 FP store and lea+mov for index
computation, cmp and
>>> jump.
>>> 
>>> The problem is that lea is on critical path because it’s dispatched
on the
>>> same port as all FP add operations (port 1).
>>> 
>>> Intel Architecture Code Analyzer (IACA) reports throughput for that
assembly
>>> block is 12.95 cycles.
>>> 
>>> I made a short investigation and found that there is a pass in code
gen that
>>> replaces index increment with lea.
>>> 
>>> Here is the snippet from
llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
>>> 
>>> 
>>> 
>>> if (MI.isConvertibleTo3Addr()) {
>>> 
>>> // This instruction is potentially convertible to a true
>>> 
>>> // three-address instruction.  Check if it is profitable.
>>> 
>>> if (!regBKilled || isProfitableToConv3Addr(regA, regB)) {
>>> 
>>>   // Try to convert it.
>>> 
>>>   if (convertInstTo3Addr(mi, nmi, regA, regB, Dist)) {
>>> 
>>>     ++NumConvertedTo3Addr;
>>> 
>>>     return true; // Done with this instruction.
>>> 
>>>   }
>>> 
>>> }
>>> 
>>> }
>>> 
>>> 
>>> 
>>> regBKilled is false for my test case and isProfitableToConv3Addr is
not even
>>> called.
>>> 
>>> I’ve made an experiment and left only
>>> 
>>> 
>>> 
>>> if (isProfitableToConv3Addr(regA, regB)) {
>>> 
>>> 
>>> 
>>> That gave me test.inc.s where lea replaced with inc+mov and this
code is
>>> ~27% faster on my Westmere system. IACA throughput analysis gives
11 cycles
>>> for new block.
>>> 
>>> 
>>> 
>>> But the best performance I’ve got from switching scheduling
algorithm from
>>> ILP to BURR (test.burr.s). It gives a few percent more vs.
“ILP+INC” and I’m
>>> not sure why – it might be because test.burr.s has less
instructions (no two
>>> moves that copy index) or it might be because additions scheduled
>>> differently. BURR puts loads and FP mul between additions, which
are
>>> gathered at the end of the loop by ILP.
>>> 
>>> 
>>> 
>>> I didn’t run experiments on sandy bridge, but IACA gives 12.45
cycles for
>>> original code (test.lea.s), so I expect BURR to improve performance
there
>>> too for the attached test case.
>>> 
>>> 
>>> 
>>> Unfortunately I’m familiar enough with the LLVM codegen code to
make a good
>>> fix for this issue and I would appreciate any help.
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> Aleksey
>>> 
>>> 
>>>
--------------------------------------------------------------------
>>> Closed Joint Stock Company Intel A/O
>>> Registered legal address: Krylatsky Hills Business Park,
>>> 17 Krylatskaya Str., Bldg 4, Moscow 121614,
>>> Russian Federation
>>> 
>>> This e-mail and any attachments may contain confidential material
for
>>> the sole use of the intended recipient(s). Any review or
distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Rafael Espíndola

2013-Oct-05 12:06 UTC

head link

[LLVMdev] Codegen performance issue: LEA vs. INC.

> The lea->cmp problem is fixed by switching to the MI scheduler. Please
run with -mllvm -misched-bench to confirm.
I get the same output in the testcase in pr13320. The leaq is in
between the cmp and the jmp, preventing  macro-fusion.

Cheers,
Rafael

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Oct 2013 - [LLVMdev] Codegen performance issue: LEA vs. INC.

[LLVMdev] Codegen performance issue: LEA vs. INC.

[LLVMdev] Codegen performance issue: LEA vs. INC.

[LLVMdev] Codegen performance issue: LEA vs. INC.

Maybe Matching Threads