thr3ads.net - llvm dev - [LLVMdev] Tight overlapping loops and performance [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Jonathan Turner

2009-Mar-03 00:58 UTC

[LLVMdev] Tight overlapping loops and performance

> You're misreading the asm... nothing is touching memory. (BTW,
"leal
> -1(%eax), %eax" isn't a memory operation; it's just
subtracting one
> from %eax.) You might want to try reading the LLVM IR (which you can
> generate with llvm-gcc -S -emit-llvm); it tends to be easier to read.
I tried that, but I'm still learning LLVM. Seeing indvar, phi nodes, tail
calls on printfs, and nounwinds had me more confused than the asm.
> A taken and non-taken branch have roughly the same cost on any
> remotely recent x86 processor.
I was wondering if that might be the case.

The crux of the example still seems intact.  From LLVM SVN, converted to asm via
llc:

                .text
        .align  4,0x90
        .globl  _main
_main:
        subl    $12, %esp
        movl    $1999, %eax
        xorl    %ecx, %ecx
        movl    $1999, %edx
        .align  4,0x90
LBB1_1: ## loopto
        cmpl    $1, %eax
        leal    -1(%eax), %eax
        cmove   %edx, %eax
        incl    %ecx
        cmpl    $999999999, %ecx
        jne     LBB1_1  ## loopto
LBB1_2: ## bb1
        movl    %eax, 4(%esp)
        movl    $LC, (%esp)
        call    _printf
        xorl    %eax, %eax
        addl    $12, %esp
        ret
        .section __TEXT,__cstring,cstring_literals
LC:                             ## LC
        .asciz  "Timeout: %i\n"
 
        .subsections_via_symbols
 
Setting the loops to decl instead of cmove/incl might seem like more work, but
appears to be faster:
 
        .text
        .align  4,0x90
        .globl  _main
_main:
        subl    $12, %esp
        movl    $2000, %eax
        movl    $1000000000, %ecx
        .align  4,0x90
LBB1_3:
        movl    $2000, %eax
LBB1_1: ## loopto
        decl    %eax
        jz      LBB1_3
        decl    %ecx
        jnz     LBB1_1  ## loopto
LBB1_2: ## bb1
        movl    %eax, 4(%esp)
        movl    $LC, (%esp)
        call    _printf
        xorl    %eax, %eax
        addl    $12, %esp
        ret
        .section __TEXT,__cstring,cstring_literals
LC:                             ## LC
        .asciz  "Timeout: %i\n"
 
        .subsections_via_symbols


The first example is 1.7s, the second is 1.0s.  That's on my dual core OS X
box.  I have a 2-processor quad-core Xeon box that runs Linux and also has very
similar results.


Jonathan

_________________________________________________________________
Windows Live™ Contacts: Organize your contact list. 
http://windowslive.com/connect/post/marcusatmicrosoft.spaces.live.com-Blog-cns!503D1D86EBB2B53C!2285.entry?ocid=TXT_TAGLM_WL_UGC_Contacts_032009

Eli Friedman

2009-Mar-03 05:43 UTC

head link

[LLVMdev] Tight overlapping loops and performance

On Mon, Mar 2, 2009 at 4:58 PM, Jonathan Turner <probata at hotmail.com>
wrote:> The crux of the example still seems intact.
Have you tried putting something non-trivial (like asm("nop;");) where
you'd put the code that runs on the timeout?

-Eli

Evan Cheng

2009-Mar-03 07:09 UTC

head link

[LLVMdev] Tight overlapping loops and performance

On Mar 2, 2009, at 4:58 PM, Jonathan Turner wrote:
>
>
>> You're misreading the asm... nothing is touching memory. (BTW,
"leal
>> -1(%eax), %eax" isn't a memory operation; it's just
subtracting one
>> from %eax.) You might want to try reading the LLVM IR (which you can
>> generate with llvm-gcc -S -emit-llvm); it tends to be easier to read.
>
> I tried that, but I'm still learning LLVM. Seeing indvar, phi nodes,  
> tail
> calls on printfs, and nounwinds had me more confused than the asm.
>
>> A taken and non-taken branch have roughly the same cost on any
>> remotely recent x86 processor.
>
> I was wondering if that might be the case.
>
> The crux of the example still seems intact.  From LLVM SVN,  
> converted to asm via llc:
>
>
>        .align  4,0x90
> LBB1_1: ## loopto
>        cmpl    $1, %eax
>        leal    -1(%eax), %eax
>        cmove   %edx, %eax
>        incl    %ecx
>        cmpl    $999999999, %ecx
>        jne     LBB1_1  ## loopto
>
> LBB1_1: ## loopto
>        decl    %eax
>        jz      LBB1_3
>        decl    %ecx
>        jnz     LBB1_1  ## loopto
>
The main issue is incl updates the EFLAGS condition code register. But  
llvm x86 isn't taking advantage of that. This is a known issue,  
hopefully someone will find the time to implement before 2.6.

The second issue is the leal -1 can be turned (back) into a decl.  
Combine that with the optimization previously described, it can  
eliminate the first cmpl.

Feel free to file a bugzilla for this. I'm hopefully this will be  
fixed in the not too far future.

Thanks,

Evan
>
> Jonathan
>
> _________________________________________________________________
> Windows Live™ Contacts: Organize your contact list.
>
http://windowslive.com/connect/post/marcusatmicrosoft.spaces.live.com-Blog-cns!503D1D86EBB2B53C!2285.entry?ocid=TXT_TAGLM_WL_UGC_Contacts_032009
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Jonathan Turner

2009-Mar-03 12:26 UTC

head link

[LLVMdev] Tight overlapping loops and performance

> Have you tried putting something non-trivial (like asm("nop;");)
where
> you'd put the code that runs on the timeout?
>
> -Eli
Using a asm("nop") does fix the llvm output, which makes it sound like
a bug.  At least in my expectations, a trivial loop should be faster than a
non-trivial one.

> The main issue is incl updates the EFLAGS condition code register. But
> llvm x86 isn't taking advantage of that. This is a known issue,
> hopefully someone will find the time to implement before 2.6.
>
> The second issue is the leal -1 can be turned (back) into a decl.
> Combine that with the optimization previously described, it can
> eliminate the first cmpl.
>
> Feel free to file a bugzilla for this. I'm hopefully this will be
> fixed in the not too far future.
>
> Thanks,
>
> Evan
Will do.  Thanks.


Jonathan

_________________________________________________________________
Windows Live™ Contacts: Organize your contact list. 
http://windowslive.com/connect/post/marcusatmicrosoft.spaces.live.com-Blog-cns!503D1D86EBB2B53C!2285.entry?ocid=TXT_TAGLM_WL_UGC_Contacts_032009

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Mar 2009 - [LLVMdev] Tight overlapping loops and performance

[LLVMdev] Tight overlapping loops and performance

[LLVMdev] Tight overlapping loops and performance

[LLVMdev] Tight overlapping loops and performance

[LLVMdev] Tight overlapping loops and performance

Seemingly Similar Threads