> -----Original Message----- > From: Jeffrey Yasskin [mailto:jyasskin at google.com] > Sent: Thursday, October 08, 2009 11:09 AM > To: Villmow, Micah > Cc: LLVM Developers Mailing List > Subject: Re: [LLVMdev] Instructions that cannot be duplicated > > On Thu, Oct 8, 2009 at 10:49 AM, Villmow, Micah <Micah.Villmow at amd.com> > wrote: > > > > > >> -----Original Message----- > >> From: Eli Friedman [mailto:eli.friedman at gmail.com] > >> Sent: Wednesday, October 07, 2009 5:50 PM > >> To: Villmow, Micah > >> Cc: LLVM Developers Mailing List > >> Subject: Re: [LLVMdev] Instructions that cannot be duplicated > >> > >> On Wed, Oct 7, 2009 at 11:20 AM, Villmow, Micah > > <Micah.Villmow at amd.com> > >> wrote: > >> > Is there a current way to specify that an instruction or function > >> call > >> > cannot be duplicated and thus any optimizations that might want to > >> duplicate > >> > this instruction would fail? > >> > >> No. Anything can be duplicated. That could change, but you would > >> need to make a strong case for why other solutions won't work. > > [Villmow, Micah] Well the problem is that the function in question > > cannot get duplicated because it has side-effects that duplicating > > causes undefined behavior on vector hardware. Also, moving the > > instruction inside of flow control when it is originally outside of > flow > > control produces undefined behavior. There currently is no way to > > specify this in LLVM that I know of. We've tried lowering it to an > > intrinsic and setting MayWriteMem and this does not solve the > problem. > > After looking at the llvm IR, there is no equivalent method of > > representing an instruction that is an execution barrier(not a memory > > barrier, which llvm.barrier.[ss|ll|ls|sl] is). If you have any > idea's, > > we would be willing to give them a try. > > Is the effect similar to pthread_barrier_wait(barrier_for($pc)) > [http://linux.die.net/man/3/pthread_barrier_wait] where the > implementation automatically generates the barrier_for() function and > automatically calculates the number of threads to wait for? > > If the barrier lowers to any sort of function call, it sounds like > you're currently looking up the PC of the caller and finding the > barrier that way. Instead, could specify the barrier as an explicit > argument to the function when your frontend generates the call > instruction, which would free you from worrying about whether the call > winds up in multiple places in the optimized IR. > > If the barrier lowers to a magic instruction on your chip, and that > instruction doesn't take an ID of any sort besides its address, you > could generate a one-instruction function for each barrier() in the > source language and allow calls to that function to be duplicated. > There may be optimizations that merge "identical" functions, but > they'll be easier to turn off than optimizations that assume they can > rearrange control flow. > > If your chip doesn't support function calls, that might constitute the > strong case Eli's asking for.[Villmow, Micah] Jeffrey thanks for the information on pthread. The barrier on our hardware is a single instruction, not a function call, with no arguments and everything is handled implicitly, including keeping track of the number of hardware threads that need to hit the barrier. If one of those hardware threads does not hit the barrier, then it causes undefined behavior. Hence why we need to guarantee that this instruction is not optimized around, moved or duplicated as the algorithm writer must place it following strict guidelines for it to work correctly. So, my strong case for some sort of workaround is this: Valid original code is: flag = false if (cond) { flag = bar(); } foo() if (flag) {bar} transformes to if (cond) { flag = bar() foo() if (flag) bar() } else { foo() } Assumptions: - foo() is the barrier - each hardware thread is a 64wide vector - two hardware threads are executing - Each vector element gets a unique id between 0 and 127 - The condition is true if id > 32 - Predication is used on control flow What happens in the original code: The first half of the first hardware thread predicates computation on the first condition, second half executes bar and all threads in the second wavefront execute bar. Both hardware threads hit the barrier and wait for the other hardware thread to reach that point, then continue execution. What happens in the optimized code: first half of the first hardware thread predicates computation on the first condition, the second half executes bar and waits at the barrier. The second hardware thread executes bar and hits the barrier, forcing continuation of execution. Both the first and second hardware thread executes the rest of the if block. Once the block ends, the predication masks are flipped and the second half of the first hardware thread hits the barrier and blocks waiting for the second hardware thread. The second hardware thread skips execution of the else block thus not hitting the barrier, causing the first hardware thread to never return from barrier. We have already detected two optimization passes(Loopunswitch and simplifycfg) that perform these type of transforms, and we are sure there might be more. We want to be able to enable these transforms for normal code, but have them respect the barrier instruction. Thanks again, Micah> > > On the unique barrier issue, even if the barrier is given a unique > > global identifier, it is the function duplication that causes the > > problem. A unique global identifier lets us identify that invalid > > optimizations have occurred, but it does not guarantee correctness > since > > the barrier function is unique per function call. So any sort of > > duplication is invalid. > > Micah
On Thu, Oct 8, 2009 at 11:28 AM, Villmow, Micah <Micah.Villmow at amd.com> wrote:> > >> -----Original Message----- >> From: Jeffrey Yasskin [mailto:jyasskin at google.com] >> Sent: Thursday, October 08, 2009 11:09 AM >> To: Villmow, Micah >> Cc: LLVM Developers Mailing List >> Subject: Re: [LLVMdev] Instructions that cannot be duplicated >> >> On Thu, Oct 8, 2009 at 10:49 AM, Villmow, Micah <Micah.Villmow at amd.com> >> wrote: >> > >> > >> >> -----Original Message----- >> >> From: Eli Friedman [mailto:eli.friedman at gmail.com] >> >> Sent: Wednesday, October 07, 2009 5:50 PM >> >> To: Villmow, Micah >> >> Cc: LLVM Developers Mailing List >> >> Subject: Re: [LLVMdev] Instructions that cannot be duplicated >> >> >> >> On Wed, Oct 7, 2009 at 11:20 AM, Villmow, Micah >> > <Micah.Villmow at amd.com> >> >> wrote: >> >> > Is there a current way to specify that an instruction or function >> >> call >> >> > cannot be duplicated and thus any optimizations that might want to >> >> duplicate >> >> > this instruction would fail? >> >> >> >> No. Anything can be duplicated. That could change, but you would >> >> need to make a strong case for why other solutions won't work. >> > [Villmow, Micah] Well the problem is that the function in question >> > cannot get duplicated because it has side-effects that duplicating >> > causes undefined behavior on vector hardware. Also, moving the >> > instruction inside of flow control when it is originally outside of >> flow >> > control produces undefined behavior. There currently is no way to >> > specify this in LLVM that I know of. We've tried lowering it to an >> > intrinsic and setting MayWriteMem and this does not solve the >> problem. >> > After looking at the llvm IR, there is no equivalent method of >> > representing an instruction that is an execution barrier(not a memory >> > barrier, which llvm.barrier.[ss|ll|ls|sl] is). If you have any >> idea's, >> > we would be willing to give them a try. >> >> Is the effect similar to pthread_barrier_wait(barrier_for($pc)) >> [http://linux.die.net/man/3/pthread_barrier_wait] where the >> implementation automatically generates the barrier_for() function and >> automatically calculates the number of threads to wait for? >> >> If the barrier lowers to any sort of function call, it sounds like >> you're currently looking up the PC of the caller and finding the >> barrier that way. Instead, could specify the barrier as an explicit >> argument to the function when your frontend generates the call >> instruction, which would free you from worrying about whether the call >> winds up in multiple places in the optimized IR. >> >> If the barrier lowers to a magic instruction on your chip, and that >> instruction doesn't take an ID of any sort besides its address, you >> could generate a one-instruction function for each barrier() in the >> source language and allow calls to that function to be duplicated. >> There may be optimizations that merge "identical" functions, but >> they'll be easier to turn off than optimizations that assume they can >> rearrange control flow. >> >> If your chip doesn't support function calls, that might constitute the >> strong case Eli's asking for. > [Villmow, Micah] Jeffrey thanks for the information on pthread. The barrier on our hardware is a single instruction, not a function call, with no arguments and everything is handled implicitly, including keeping track of the number of hardware threads that need to hit the barrier. If one of those hardware threads does not hit the barrier, then it causes undefined behavior. Hence why we need to guarantee that this instruction is not optimized around, moved or duplicated as the algorithm writer must place it following strict guidelines for it to work correctly. > > So, my strong case for some sort of workaround is this: > > Valid original code is: > flag = false > if (cond) > { flag = bar(); > } > foo() > if (flag) {bar} > > > transformes to > > if (cond) { > flag = bar() > foo() > if (flag) > bar() > } else { > foo() > } > > Assumptions: > - foo() is the barrier > - each hardware thread is a 64wide vector > - two hardware threads are executing > - Each vector element gets a unique id between 0 and 127 > - The condition is true if id > 32 > - Predication is used on control flow > What happens in the original code: > The first half of the first hardware thread predicates computation on the first condition, second half executes bar and all threads in the second wavefront execute bar. Both hardware threads hit the barrier and wait for the other hardware thread to reach that point, then continue execution. > > What happens in the optimized code: > first half of the first hardware thread predicates computation on the first condition, the second half executes bar and waits at the barrier. The second hardware thread executes bar and hits the barrier, forcing continuation of execution. Both the first and second hardware thread executes the rest of the if block. Once the block ends, the predication masks are flipped and the second half of the first hardware thread hits the barrier and blocks waiting for the second hardware thread. The second hardware thread skips execution of the else block thus not hitting the barrier, causing the first hardware thread to never return from barrier. > > We have already detected two optimization passes(Loopunswitch and simplifycfg) that perform these type of transforms, and we are sure there might be more. We want to be able to enable these transforms for normal code, but have them respect the barrier instruction.... and even if a new barrier intrinsic that does not allow cloning (not sure how, but anyway...) is introduced, you'll have to modify all these optimization passes to take a note of this special barrier. New barrier won't be respected automatically. Have you thought about dividing code in separate functions and make sure the inliner does not inline them ? fn1() /* do not inline */ your barrier() fn2() /* do not inline */ - Devang
On Thu, Oct 8, 2009 at 3:59 PM, Devang Patel <devang.patel at gmail.com> wrote:> On Thu, Oct 8, 2009 at 11:28 AM, Villmow, Micah <Micah.Villmow at amd.com> wrote: >> >> >>> -----Original Message----- >>> From: Jeffrey Yasskin [mailto:jyasskin at google.com] >>> Sent: Thursday, October 08, 2009 11:09 AM >>> To: Villmow, Micah >>> Cc: LLVM Developers Mailing List >>> Subject: Re: [LLVMdev] Instructions that cannot be duplicated >>> >>> On Thu, Oct 8, 2009 at 10:49 AM, Villmow, Micah <Micah.Villmow at amd.com> >>> wrote: >>> > >>> > >>> >> -----Original Message----- >>> >> From: Eli Friedman [mailto:eli.friedman at gmail.com] >>> >> Sent: Wednesday, October 07, 2009 5:50 PM >>> >> To: Villmow, Micah >>> >> Cc: LLVM Developers Mailing List >>> >> Subject: Re: [LLVMdev] Instructions that cannot be duplicated >>> >> >>> >> On Wed, Oct 7, 2009 at 11:20 AM, Villmow, Micah >>> > <Micah.Villmow at amd.com> >>> >> wrote: >>> >> > Is there a current way to specify that an instruction or function >>> >> call >>> >> > cannot be duplicated and thus any optimizations that might want to >>> >> duplicate >>> >> > this instruction would fail? >>> >> >>> >> No. Anything can be duplicated. That could change, but you would >>> >> need to make a strong case for why other solutions won't work. >>> > [Villmow, Micah] Well the problem is that the function in question >>> > cannot get duplicated because it has side-effects that duplicating >>> > causes undefined behavior on vector hardware. Also, moving the >>> > instruction inside of flow control when it is originally outside of >>> flow >>> > control produces undefined behavior. There currently is no way to >>> > specify this in LLVM that I know of. We've tried lowering it to an >>> > intrinsic and setting MayWriteMem and this does not solve the >>> problem. >>> > After looking at the llvm IR, there is no equivalent method of >>> > representing an instruction that is an execution barrier(not a memory >>> > barrier, which llvm.barrier.[ss|ll|ls|sl] is). If you have any >>> idea's, >>> > we would be willing to give them a try. >>> >>> Is the effect similar to pthread_barrier_wait(barrier_for($pc)) >>> [http://linux.die.net/man/3/pthread_barrier_wait] where the >>> implementation automatically generates the barrier_for() function and >>> automatically calculates the number of threads to wait for? >>> >>> If the barrier lowers to any sort of function call, it sounds like >>> you're currently looking up the PC of the caller and finding the >>> barrier that way. Instead, could specify the barrier as an explicit >>> argument to the function when your frontend generates the call >>> instruction, which would free you from worrying about whether the call >>> winds up in multiple places in the optimized IR. >>> >>> If the barrier lowers to a magic instruction on your chip, and that >>> instruction doesn't take an ID of any sort besides its address, you >>> could generate a one-instruction function for each barrier() in the >>> source language and allow calls to that function to be duplicated. >>> There may be optimizations that merge "identical" functions, but >>> they'll be easier to turn off than optimizations that assume they can >>> rearrange control flow. >>> >>> If your chip doesn't support function calls, that might constitute the >>> strong case Eli's asking for. >> [Villmow, Micah] Jeffrey thanks for the information on pthread. The barrier on our hardware is a single instruction, not a function call, with no arguments and everything is handled implicitly, including keeping track of the number of hardware threads that need to hit the barrier. If one of those hardware threads does not hit the barrier, then it causes undefined behavior. Hence why we need to guarantee that this instruction is not optimized around, moved or duplicated as the algorithm writer must place it following strict guidelines for it to work correctly. >> >> So, my strong case for some sort of workaround is this: >> >> Valid original code is: >> flag = false >> if (cond) >> { flag = bar(); >> } >> foo() >> if (flag) {bar} >> >> >> transformes to >> >> if (cond) { >> flag = bar() >> foo() >> if (flag) >> bar() >> } else { >> foo() >> } >> >> Assumptions: >> - foo() is the barrier >> - each hardware thread is a 64wide vector >> - two hardware threads are executing >> - Each vector element gets a unique id between 0 and 127 >> - The condition is true if id > 32 >> - Predication is used on control flow >> What happens in the original code: >> The first half of the first hardware thread predicates computation on the first condition, second half executes bar and all threads in the second wavefront execute bar. Both hardware threads hit the barrier and wait for the other hardware thread to reach that point, then continue execution. >> >> What happens in the optimized code: >> first half of the first hardware thread predicates computation on the first condition, the second half executes bar and waits at the barrier. The second hardware thread executes bar and hits the barrier, forcing continuation of execution. Both the first and second hardware thread executes the rest of the if block. Once the block ends, the predication masks are flipped and the second half of the first hardware thread hits the barrier and blocks waiting for the second hardware thread. The second hardware thread skips execution of the else block thus not hitting the barrier, causing the first hardware thread to never return from barrier. >> >> We have already detected two optimization passes(Loopunswitch and simplifycfg) that perform these type of transforms, and we are sure there might be more. We want to be able to enable these transforms for normal code, but have them respect the barrier instruction. > > ... and even if a new barrier intrinsic that does not allow cloning > (not sure how, but anyway...) is introduced, you'll have to modify all > these optimization passes to take a note of this special barrier. New > barrier won't be respected automatically. > > Have you thought about dividing code in separate functions and make > sure the inliner does not inline them ? > > fn1() /* do not inline */ > your barrier() > fn2() /* do not inline */IMO Jeff's solution is the cleanest, simplest way to get code that works. Just generate a separate function for every barrier in the program, and mark it noinline. This way the instruction pointers will be unique to the barrier. Reid