> Date: Mon, 2 Mar 2009 13:41:45 -0800 > From: eli.friedman at gmail.com > To: llvmdev at cs.uiuc.edu > Subject: Re: [LLVMdev] Tight overlapping loops and performance > > Hmm, on my computer, I get around 2.5 seconds with both gcc -O3 and > llvm-gcc -O3 (using llvm-gcc from svn). Not sure what you're doing > differently; I wouldn't be surprised if it's sensitive to the version > of LLVM.For which version of gcc? I should mention I'm on OS X and using the LLVM SVN.> First, try looking at the generated code... the code LLVM generates is > probably not what you're expecting. I'm getting the following for the > main loop:I was seeing the same thing, but wasn't sure what to make of it. It looks like values are being swapped into and out of memory and not holding them in registers. That's why I was asking about other optimization passes, at first glance -mem2reg looked like a good candidate, but I didn't notice any improvement using it blindly.> int timeout = 2000; > int loopcond; > do { > timeoutwork(); > do { > timeout--; > loopcond = computationresult(); > } while (loopcond && timeout); > } while (loopcond);My current implementation uses something very similar, but if you'll notice the difference between this example and my examples is that the branch for checking 'timeout' is taken in the majority case where in mine it isn't. It can be checked separately for less cost, assuming the variables stay in registers. Jonathan _________________________________________________________________ Windows Live™ Contacts: Organize your contact list. windowslive.com/connect/post/marcusatmicrosoft.spaces.live.com-Blog-cns!503D1D86EBB2B53C!2285.entry?ocid=TXT_TAGLM_WL_UGC_Contacts_032009 -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20090302/b0b3d67c/attachment.html>
> My current implementation uses something very similar, but if you'll notice the difference between this example and my examples is that the branch for checking 'timeout' is taken in the majority case where in mine it isn't. It can be checked separately for less cost, assuming the variables stay in registers.Perhaps I shouldn't be so quick to stick my foot in my mouth. All my knowledge of asm is from pretty old machines, I don't know anything about modern pipelines. Having said that, I'm definitely willing to learn. Jonathan _________________________________________________________________ Express your personality in color! Preview and select themes for Hotmail®. windowslive-hotmail.com/LearnMore/personalize.aspx?ocid=TXT_MSGTX_WL_HM_express_032009#colortheme
On Mon, Mar 2, 2009 at 2:45 PM, Jonathan Turner <probata at hotmail.com> wrote:> For which version of gcc? I should mention I'm on OS X and using the LLVM > SVN.gcc 4.3. It's also possible this is processor-sensitive.>> First, try looking at the generated code... the code LLVM generates is >> probably not what you're expecting. I'm getting the following for the >> main loop: > > I was seeing the same thing, but wasn't sure what to make of it. It looks > like values are being swapped into and out of memory and not holding them in > registers.You're misreading the asm... nothing is touching memory. (BTW, "leal -1(%eax), %eax" isn't a memory operation; it's just subtracting one from %eax.) You might want to try reading the LLVM IR (which you can generate with llvm-gcc -S -emit-llvm); it tends to be easier to read.> My current implementation uses something very similar, but if you'll notice > the difference between this example and my examples is that the branch for > checking 'timeout' is taken in the majority case where in mine it isn't. It > can be checked separately for less cost, assuming the variables stay in > registers.A taken and non-taken branch have roughly the same cost on any remotely recent x86 processor. -Eli
> You're misreading the asm... nothing is touching memory. (BTW, "leal > -1(%eax), %eax" isn't a memory operation; it's just subtracting one > from %eax.) You might want to try reading the LLVM IR (which you can > generate with llvm-gcc -S -emit-llvm); it tends to be easier to read.I tried that, but I'm still learning LLVM. Seeing indvar, phi nodes, tail calls on printfs, and nounwinds had me more confused than the asm.> A taken and non-taken branch have roughly the same cost on any > remotely recent x86 processor.I was wondering if that might be the case. The crux of the example still seems intact. From LLVM SVN, converted to asm via llc: .text .align 4,0x90 .globl _main _main: subl $12, %esp movl $1999, %eax xorl %ecx, %ecx movl $1999, %edx .align 4,0x90 LBB1_1: ## loopto cmpl $1, %eax leal -1(%eax), %eax cmove %edx, %eax incl %ecx cmpl $999999999, %ecx jne LBB1_1 ## loopto LBB1_2: ## bb1 movl %eax, 4(%esp) movl $LC, (%esp) call _printf xorl %eax, %eax addl $12, %esp ret .section __TEXT,__cstring,cstring_literals LC: ## LC .asciz "Timeout: %i\n" .subsections_via_symbols Setting the loops to decl instead of cmove/incl might seem like more work, but appears to be faster: .text .align 4,0x90 .globl _main _main: subl $12, %esp movl $2000, %eax movl $1000000000, %ecx .align 4,0x90 LBB1_3: movl $2000, %eax LBB1_1: ## loopto decl %eax jz LBB1_3 decl %ecx jnz LBB1_1 ## loopto LBB1_2: ## bb1 movl %eax, 4(%esp) movl $LC, (%esp) call _printf xorl %eax, %eax addl $12, %esp ret .section __TEXT,__cstring,cstring_literals LC: ## LC .asciz "Timeout: %i\n" .subsections_via_symbols The first example is 1.7s, the second is 1.0s. That's on my dual core OS X box. I have a 2-processor quad-core Xeon box that runs Linux and also has very similar results. Jonathan _________________________________________________________________ Windows Live™ Contacts: Organize your contact list. windowslive.com/connect/post/marcusatmicrosoft.spaces.live.com-Blog-cns!503D1D86EBB2B53C!2285.entry?ocid=TXT_TAGLM_WL_UGC_Contacts_032009