On Thu, Oct 8, 2009 at 3:59 PM, Devang Patel <devang.patel at gmail.com> wrote:> On Thu, Oct 8, 2009 at 11:28 AM, Villmow, Micah <Micah.Villmow at amd.com> wrote: >> >> >>> -----Original Message----- >>> From: Jeffrey Yasskin [mailto:jyasskin at google.com] >>> Sent: Thursday, October 08, 2009 11:09 AM >>> To: Villmow, Micah >>> Cc: LLVM Developers Mailing List >>> Subject: Re: [LLVMdev] Instructions that cannot be duplicated >>> >>> On Thu, Oct 8, 2009 at 10:49 AM, Villmow, Micah <Micah.Villmow at amd.com> >>> wrote: >>> > >>> > >>> >> -----Original Message----- >>> >> From: Eli Friedman [mailto:eli.friedman at gmail.com] >>> >> Sent: Wednesday, October 07, 2009 5:50 PM >>> >> To: Villmow, Micah >>> >> Cc: LLVM Developers Mailing List >>> >> Subject: Re: [LLVMdev] Instructions that cannot be duplicated >>> >> >>> >> On Wed, Oct 7, 2009 at 11:20 AM, Villmow, Micah >>> > <Micah.Villmow at amd.com> >>> >> wrote: >>> >> > Is there a current way to specify that an instruction or function >>> >> call >>> >> > cannot be duplicated and thus any optimizations that might want to >>> >> duplicate >>> >> > this instruction would fail? >>> >> >>> >> No. Anything can be duplicated. That could change, but you would >>> >> need to make a strong case for why other solutions won't work. >>> > [Villmow, Micah] Well the problem is that the function in question >>> > cannot get duplicated because it has side-effects that duplicating >>> > causes undefined behavior on vector hardware. Also, moving the >>> > instruction inside of flow control when it is originally outside of >>> flow >>> > control produces undefined behavior. There currently is no way to >>> > specify this in LLVM that I know of. We've tried lowering it to an >>> > intrinsic and setting MayWriteMem and this does not solve the >>> problem. >>> > After looking at the llvm IR, there is no equivalent method of >>> > representing an instruction that is an execution barrier(not a memory >>> > barrier, which llvm.barrier.[ss|ll|ls|sl] is). If you have any >>> idea's, >>> > we would be willing to give them a try. >>> >>> Is the effect similar to pthread_barrier_wait(barrier_for($pc)) >>> [http://linux.die.net/man/3/pthread_barrier_wait] where the >>> implementation automatically generates the barrier_for() function and >>> automatically calculates the number of threads to wait for? >>> >>> If the barrier lowers to any sort of function call, it sounds like >>> you're currently looking up the PC of the caller and finding the >>> barrier that way. Instead, could specify the barrier as an explicit >>> argument to the function when your frontend generates the call >>> instruction, which would free you from worrying about whether the call >>> winds up in multiple places in the optimized IR. >>> >>> If the barrier lowers to a magic instruction on your chip, and that >>> instruction doesn't take an ID of any sort besides its address, you >>> could generate a one-instruction function for each barrier() in the >>> source language and allow calls to that function to be duplicated. >>> There may be optimizations that merge "identical" functions, but >>> they'll be easier to turn off than optimizations that assume they can >>> rearrange control flow. >>> >>> If your chip doesn't support function calls, that might constitute the >>> strong case Eli's asking for. >> [Villmow, Micah] Jeffrey thanks for the information on pthread. The barrier on our hardware is a single instruction, not a function call, with no arguments and everything is handled implicitly, including keeping track of the number of hardware threads that need to hit the barrier. If one of those hardware threads does not hit the barrier, then it causes undefined behavior. Hence why we need to guarantee that this instruction is not optimized around, moved or duplicated as the algorithm writer must place it following strict guidelines for it to work correctly. >> >> So, my strong case for some sort of workaround is this: >> >> Valid original code is: >> flag = false >> if (cond) >> { flag = bar(); >> } >> foo() >> if (flag) {bar} >> >> >> transformes to >> >> if (cond) { >> flag = bar() >> foo() >> if (flag) >> bar() >> } else { >> foo() >> } >> >> Assumptions: >> - foo() is the barrier >> - each hardware thread is a 64wide vector >> - two hardware threads are executing >> - Each vector element gets a unique id between 0 and 127 >> - The condition is true if id > 32 >> - Predication is used on control flow >> What happens in the original code: >> The first half of the first hardware thread predicates computation on the first condition, second half executes bar and all threads in the second wavefront execute bar. Both hardware threads hit the barrier and wait for the other hardware thread to reach that point, then continue execution. >> >> What happens in the optimized code: >> first half of the first hardware thread predicates computation on the first condition, the second half executes bar and waits at the barrier. The second hardware thread executes bar and hits the barrier, forcing continuation of execution. Both the first and second hardware thread executes the rest of the if block. Once the block ends, the predication masks are flipped and the second half of the first hardware thread hits the barrier and blocks waiting for the second hardware thread. The second hardware thread skips execution of the else block thus not hitting the barrier, causing the first hardware thread to never return from barrier. >> >> We have already detected two optimization passes(Loopunswitch and simplifycfg) that perform these type of transforms, and we are sure there might be more. We want to be able to enable these transforms for normal code, but have them respect the barrier instruction. > > ... and even if a new barrier intrinsic that does not allow cloning > (not sure how, but anyway...) is introduced, you'll have to modify all > these optimization passes to take a note of this special barrier. New > barrier won't be respected automatically. > > Have you thought about dividing code in separate functions and make > sure the inliner does not inline them ? > > fn1() /* do not inline */ > your barrier() > fn2() /* do not inline */IMO Jeff's solution is the cleanest, simplest way to get code that works. Just generate a separate function for every barrier in the program, and mark it noinline. This way the instruction pointers will be unique to the barrier. Reid
For these particular targets, the hardware may not support function calls. So, at some point, the compiler will need to inline the function. It would need to do something special since it can't inline in each call site but need to introduce a jump to the barrier and a jump back to the various basic block it came from which could be messy. It would be nicer to avoid the transformation that copied these basic block. -- Mon Ping On Oct 8, 2009, at 2:11 PM, Reid Kleckner wrote:> On Thu, Oct 8, 2009 at 3:59 PM, Devang Patel > <devang.patel at gmail.com> wrote: >> On Thu, Oct 8, 2009 at 11:28 AM, Villmow, Micah <Micah.Villmow at amd.com >> > wrote: >>> >>> >>>> -----Original Message----- >>>> From: Jeffrey Yasskin [mailto:jyasskin at google.com] >>>> Sent: Thursday, October 08, 2009 11:09 AM >>>> To: Villmow, Micah >>>> Cc: LLVM Developers Mailing List >>>> Subject: Re: [LLVMdev] Instructions that cannot be duplicated >>>> >>>> On Thu, Oct 8, 2009 at 10:49 AM, Villmow, Micah <Micah.Villmow at amd.com >>>> > >>>> wrote: >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Eli Friedman [mailto:eli.friedman at gmail.com] >>>>>> Sent: Wednesday, October 07, 2009 5:50 PM >>>>>> To: Villmow, Micah >>>>>> Cc: LLVM Developers Mailing List >>>>>> Subject: Re: [LLVMdev] Instructions that cannot be duplicated >>>>>> >>>>>> On Wed, Oct 7, 2009 at 11:20 AM, Villmow, Micah >>>>> <Micah.Villmow at amd.com> >>>>>> wrote: >>>>>>> Is there a current way to specify that an instruction or >>>>>>> function >>>>>> call >>>>>>> cannot be duplicated and thus any optimizations that might >>>>>>> want to >>>>>> duplicate >>>>>>> this instruction would fail? >>>>>> >>>>>> No. Anything can be duplicated. That could change, but you >>>>>> would >>>>>> need to make a strong case for why other solutions won't work. >>>>> [Villmow, Micah] Well the problem is that the function in question >>>>> cannot get duplicated because it has side-effects that duplicating >>>>> causes undefined behavior on vector hardware. Also, moving the >>>>> instruction inside of flow control when it is originally outside >>>>> of >>>> flow >>>>> control produces undefined behavior. There currently is no way to >>>>> specify this in LLVM that I know of. We've tried lowering it to an >>>>> intrinsic and setting MayWriteMem and this does not solve the >>>> problem. >>>>> After looking at the llvm IR, there is no equivalent method of >>>>> representing an instruction that is an execution barrier(not a >>>>> memory >>>>> barrier, which llvm.barrier.[ss|ll|ls|sl] is). If you have any >>>> idea's, >>>>> we would be willing to give them a try. >>>> >>>> Is the effect similar to pthread_barrier_wait(barrier_for($pc)) >>>> [http://linux.die.net/man/3/pthread_barrier_wait] where the >>>> implementation automatically generates the barrier_for() function >>>> and >>>> automatically calculates the number of threads to wait for? >>>> >>>> If the barrier lowers to any sort of function call, it sounds like >>>> you're currently looking up the PC of the caller and finding the >>>> barrier that way. Instead, could specify the barrier as an explicit >>>> argument to the function when your frontend generates the call >>>> instruction, which would free you from worrying about whether the >>>> call >>>> winds up in multiple places in the optimized IR. >>>> >>>> If the barrier lowers to a magic instruction on your chip, and that >>>> instruction doesn't take an ID of any sort besides its address, you >>>> could generate a one-instruction function for each barrier() in the >>>> source language and allow calls to that function to be duplicated. >>>> There may be optimizations that merge "identical" functions, but >>>> they'll be easier to turn off than optimizations that assume they >>>> can >>>> rearrange control flow. >>>> >>>> If your chip doesn't support function calls, that might >>>> constitute the >>>> strong case Eli's asking for. >>> [Villmow, Micah] Jeffrey thanks for the information on pthread. >>> The barrier on our hardware is a single instruction, not a >>> function call, with no arguments and everything is handled >>> implicitly, including keeping track of the number of hardware >>> threads that need to hit the barrier. If one of those hardware >>> threads does not hit the barrier, then it causes undefined >>> behavior. Hence why we need to guarantee that this instruction is >>> not optimized around, moved or duplicated as the algorithm writer >>> must place it following strict guidelines for it to work correctly. >>> >>> So, my strong case for some sort of workaround is this: >>> >>> Valid original code is: >>> flag = false >>> if (cond) >>> { flag = bar(); >>> } >>> foo() >>> if (flag) {bar} >>> >>> >>> transformes to >>> >>> if (cond) { >>> flag = bar() >>> foo() >>> if (flag) >>> bar() >>> } else { >>> foo() >>> } >>> >>> Assumptions: >>> - foo() is the barrier >>> - each hardware thread is a 64wide vector >>> - two hardware threads are executing >>> - Each vector element gets a unique id between 0 and 127 >>> - The condition is true if id > 32 >>> - Predication is used on control flow >>> What happens in the original code: >>> The first half of the first hardware thread predicates >>> computation on the first condition, second half executes bar and >>> all threads in the second wavefront execute bar. Both hardware >>> threads hit the barrier and wait for the other hardware thread to >>> reach that point, then continue execution. >>> >>> What happens in the optimized code: >>> first half of the first hardware thread predicates computation on >>> the first condition, the second half executes bar and waits at the >>> barrier. The second hardware thread executes bar and hits the >>> barrier, forcing continuation of execution. Both the first and >>> second hardware thread executes the rest of the if block. Once the >>> block ends, the predication masks are flipped and the second half >>> of the first hardware thread hits the barrier and blocks waiting >>> for the second hardware thread. The second hardware thread skips >>> execution of the else block thus not hitting the barrier, causing >>> the first hardware thread to never return from barrier. >>> >>> We have already detected two optimization passes(Loopunswitch and >>> simplifycfg) that perform these type of transforms, and we are >>> sure there might be more. We want to be able to enable these >>> transforms for normal code, but have them respect the barrier >>> instruction. >> >> ... and even if a new barrier intrinsic that does not allow cloning >> (not sure how, but anyway...) is introduced, you'll have to modify >> all >> these optimization passes to take a note of this special barrier. New >> barrier won't be respected automatically. >> >> Have you thought about dividing code in separate functions and make >> sure the inliner does not inline them ? >> >> fn1() /* do not inline */ >> your barrier() >> fn2() /* do not inline */ > > IMO Jeff's solution is the cleanest, simplest way to get code that > works. Just generate a separate function for every barrier in the > program, and mark it noinline. This way the instruction pointers will > be unique to the barrier. > > Reid > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On Thu, Oct 8, 2009 at 2:11 PM, Reid Kleckner <rnk at mit.edu> wrote:> IMO Jeff's solution is the cleanest, simplest way to get code that > works. Just generate a separate function for every barrier in the > program, and mark it noinline. This way the instruction pointers will > be unique to the barrier.No, this gets rather nasty: to support an instruction like this, it isn't legal to duplicate calls to functions containing a barrier instruction. Another proposal: add an executebarrier function attribute for functions which directly or indirectly contain an execution barrier, and adjust all the relevant transformation passes, like jump threading and loop unswitching, to avoid duplicating calls to such functions. This puts a slight burden on the frontend to mark functions appropriately, but I don't see any other solution which doesn't affect code which doesn't use execute barriers. -Eli
Is inlining (which duplicates code) of functions containing OpenCL style barriers legal?or e.g. if you had some changed phase ordering where you had if (cond) { S1; } call user_func() // user_func has a barrier buried inside it. you do tail splitting if (cond) { S1; call user_func() } else { call user_func(); } now you inline -- oops now you might have a problem so do you want IPA to propagate the barrier bit to the call sites? you could do inlining before tail splitting sounds messy... Vinod On Thu, Oct 8, 2009 at 8:38 PM, Eli Friedman <eli.friedman at gmail.com> wrote:> On Thu, Oct 8, 2009 at 2:11 PM, Reid Kleckner <rnk at mit.edu> wrote: > > IMO Jeff's solution is the cleanest, simplest way to get code that > > works. Just generate a separate function for every barrier in the > > program, and mark it noinline. This way the instruction pointers will > > be unique to the barrier. > > No, this gets rather nasty: to support an instruction like this, it > isn't legal to duplicate calls to functions containing a barrier > instruction. > > Another proposal: add an executebarrier function attribute for > functions which directly or indirectly contain an execution barrier, > and adjust all the relevant transformation passes, like jump threading > and loop unswitching, to avoid duplicating calls to such functions. > This puts a slight burden on the frontend to mark functions > appropriately, but I don't see any other solution which doesn't affect > code which doesn't use execute barriers. > > -Eli > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20091008/a5a3c08a/attachment.html>