thr3ads.net - llvm dev - [LLVMdev] Tight overlapping loops and performance [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Jonathan Turner

2009-Mar-02 22:45 UTC

[LLVMdev] Tight overlapping loops and performance

> Date: Mon, 2 Mar 2009 13:41:45 -0800
> From: eli.friedman at gmail.com
> To: llvmdev at cs.uiuc.edu
> Subject: Re: [LLVMdev] Tight overlapping loops and performance
>
> Hmm, on my computer, I get around 2.5 seconds with both gcc -O3 and
> llvm-gcc -O3 (using llvm-gcc from svn).  Not sure what you're doing
> differently; I wouldn't be surprised if it's sensitive to the
version
> of LLVM.
For which version of gcc?  I should mention I'm on OS X and using the LLVM
SVN.
 > First, try looking at the generated code... the code LLVM generates is
> probably not what you're expecting.  I'm getting the following for
the
> main loop:
I was seeing the same thing, but wasn't sure what to make of it.  It looks
like values are being swapped into and out of memory and not holding them in
registers.  That's why I was asking about other optimization passes, at
first glance -mem2reg looked like a good candidate, but I didn't notice any
improvement using it blindly.
 > int timeout = 2000;
> int loopcond;
> do {
> timeoutwork();
> do {
> timeout--;
> loopcond = computationresult();
> } while (loopcond && timeout);
> } while (loopcond);
My current implementation uses something very similar, but if you'll notice
the difference between this example and my examples is that the branch for
checking 'timeout' is taken in the majority case where in mine it
isn't.  It can be checked separately for less cost, assuming the variables
stay in registers.


Jonathan

_________________________________________________________________
Windows Live™ Contacts: Organize your contact list. 
windowslive.com/connect/post/marcusatmicrosoft.spaces.live.com-Blog-cns!503D1D86EBB2B53C!2285.entry?ocid=TXT_TAGLM_WL_UGC_Contacts_032009
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<lists.llvm.org/pipermail/llvm-dev/attachments/20090302/b0b3d67c/attachment.html>

Jonathan Turner

2009-Mar-02 23:14 UTC

head link

[LLVMdev] Tight overlapping loops and performance

> My current implementation uses something very similar, but if you'll
notice the difference between this example and my examples is that the branch
for checking 'timeout' is taken in the majority case where in mine it
isn't. It can be checked separately for less cost, assuming the variables
stay in registers.

Perhaps I shouldn't be so quick to stick my foot in my mouth.  All my
knowledge of asm is from pretty old machines, I don't know anything about
modern pipelines.  Having said that, I'm definitely willing to learn.

Jonathan

_________________________________________________________________
Express your personality in color! Preview and select themes for Hotmail®. 
windowslive-hotmail.com/LearnMore/personalize.aspx?ocid=TXT_MSGTX_WL_HM_express_032009#colortheme

Eli Friedman

2009-Mar-02 23:30 UTC

head link

[LLVMdev] Tight overlapping loops and performance

On Mon, Mar 2, 2009 at 2:45 PM, Jonathan Turner <probata at hotmail.com>
wrote:> For which version of gcc?  I should mention I'm on OS X and using the
LLVM
> SVN.
gcc 4.3.  It's also possible this is processor-sensitive.
>> First, try looking at the generated code... the code LLVM generates is
>> probably not what you're expecting. I'm getting the following
for the
>> main loop:
>
> I was seeing the same thing, but wasn't sure what to make of it.  It
looks
> like values are being swapped into and out of memory and not holding them
in
> registers.
You're misreading the asm... nothing is touching memory.  (BTW, "leal
-1(%eax), %eax" isn't a memory operation; it's just subtracting one
from %eax.)  You might want to try reading the LLVM IR (which you can
generate with llvm-gcc -S -emit-llvm); it tends to be easier to read.
> My current implementation uses something very similar, but if you'll
notice
> the difference between this example and my examples is that the branch for
> checking 'timeout' is taken in the majority case where in mine it
isn't.  It
> can be checked separately for less cost, assuming the variables stay in
> registers.
A taken and non-taken branch have roughly the same cost on any
remotely recent x86 processor.

-Eli

Jonathan Turner

2009-Mar-03 00:58 UTC

head link

[LLVMdev] Tight overlapping loops and performance

> You're misreading the asm... nothing is touching memory. (BTW,
"leal
> -1(%eax), %eax" isn't a memory operation; it's just
subtracting one
> from %eax.) You might want to try reading the LLVM IR (which you can
> generate with llvm-gcc -S -emit-llvm); it tends to be easier to read.
I tried that, but I'm still learning LLVM. Seeing indvar, phi nodes, tail
calls on printfs, and nounwinds had me more confused than the asm.
> A taken and non-taken branch have roughly the same cost on any
> remotely recent x86 processor.
I was wondering if that might be the case.

The crux of the example still seems intact.  From LLVM SVN, converted to asm via
llc:

                .text
        .align  4,0x90
        .globl  _main
_main:
        subl    $12, %esp
        movl    $1999, %eax
        xorl    %ecx, %ecx
        movl    $1999, %edx
        .align  4,0x90
LBB1_1: ## loopto
        cmpl    $1, %eax
        leal    -1(%eax), %eax
        cmove   %edx, %eax
        incl    %ecx
        cmpl    $999999999, %ecx
        jne     LBB1_1  ## loopto
LBB1_2: ## bb1
        movl    %eax, 4(%esp)
        movl    $LC, (%esp)
        call    _printf
        xorl    %eax, %eax
        addl    $12, %esp
        ret
        .section __TEXT,__cstring,cstring_literals
LC:                             ## LC
        .asciz  "Timeout: %i\n"
 
        .subsections_via_symbols
 
Setting the loops to decl instead of cmove/incl might seem like more work, but
appears to be faster:
 
        .text
        .align  4,0x90
        .globl  _main
_main:
        subl    $12, %esp
        movl    $2000, %eax
        movl    $1000000000, %ecx
        .align  4,0x90
LBB1_3:
        movl    $2000, %eax
LBB1_1: ## loopto
        decl    %eax
        jz      LBB1_3
        decl    %ecx
        jnz     LBB1_1  ## loopto
LBB1_2: ## bb1
        movl    %eax, 4(%esp)
        movl    $LC, (%esp)
        call    _printf
        xorl    %eax, %eax
        addl    $12, %esp
        ret
        .section __TEXT,__cstring,cstring_literals
LC:                             ## LC
        .asciz  "Timeout: %i\n"
 
        .subsections_via_symbols


The first example is 1.7s, the second is 1.0s.  That's on my dual core OS X
box.  I have a 2-processor quad-core Xeon box that runs Linux and also has very
similar results.


Jonathan

_________________________________________________________________
Windows Live™ Contacts: Organize your contact list. 
windowslive.com/connect/post/marcusatmicrosoft.spaces.live.com-Blog-cns!503D1D86EBB2B53C!2285.entry?ocid=TXT_TAGLM_WL_UGC_Contacts_032009

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Mar 2009 - [LLVMdev] Tight overlapping loops and performance

[LLVMdev] Tight overlapping loops and performance

[LLVMdev] Tight overlapping loops and performance

[LLVMdev] Tight overlapping loops and performance

[LLVMdev] Tight overlapping loops and performance

Maybe Matching Threads