Disclaimer: I've been focusing on vectorization (single thread problem) for
the last 10+yrs. Please read this with grain of rust. 😊
The scheme proposed in the Page 4 Fig 1(d) of [0] may not work in general
problems, as is. Some code may not have bar(). Then, one may certainly argue
that why bother splitting this into two critical sections. There are values in
keeping critical section sizes as small as possible, and it's not too hard
to imagine two such critical sections touching the same variable along with some
other variables.
if (omp_critical_start(tid)) {
*p += 1;
omp_critical_end(tid);
}
bar()[p]; // May "use" pointer p.
if (omp_critical_start(tid)) {
*p *= 2;
omp_critical_end(tid);
}
In the absence of user-code bar(), if we are to take the spirit of the proposal,
we would inject a pseudo function call (appropriately name it as
__memory_fence() 😊) and it would look like
if (omp_critical_start(tid)) {
*p += 1;
omp_critical_end(tid);
}
__memory_fence()[p]; // May "use" pointer p.
if (omp_critical_start(tid)) {
*p *= 2;
omp_critical_end(tid);
}
We would then proceed to argue with the following improvement of it
if (omp_critical_start(tid)) {
*p += 1;
omp_critical_end(tid);
}
__memory_fence_release()[p]; // May "use" pointer p.
;
__memory_fence_acquire()[p]; // May "use" pointer p.
if (omp_critical_start(tid)) {
*p *= 2;
omp_critical_end(tid);
}
We then might as well talk about improving LLVM IR fence instruction definition.
My preference is to interpret "noalias/restrict to be applicable to all
threads, but only between one fence to the next (per dynamic control flow
sense)". However, if most developers think " noalias/restrict to be
effective to all threads for the entire scope of the variable/parameter"
&& we can make such fence based dependency explicit like in the spirit
of the proposal in [0] (and as I outlined above), it would be good to discuss
the long term pros/cons of the both approaches. For the optimizers, the former
can't really be based on dynamic control flow based. As such, it would end
up in some conservative static scoping based enforcement. The latter, if
implementable, may have a better reflection of the mem-fence requirement and
control flow, which may lead to better optimization.
Thanks,
Hideki
-----Original Message-----
From: Johannes Doerfert <johannesdoerfert at gmail.com>
Sent: Wednesday, January 27, 2021 2:26 PM
To: Kaylor, Andrew <andrew.kaylor at intel.com>
Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at quicinc.com>
Subject: Re: [llvm-dev] Memory barrier problem
Long story short, we have 3 general options forward:
1) Introduce/Leverage an attribute and make it "stronger" than the
`noalias` guarantee we have right now. This is not a great option because we
would need to scan for intermediate instructions with this property in order to
utilize `noalias`.
2) We go with explicit uses of the `noalias` pointer as proposed in [0].
The downside is that we need to add them when we inline and such. We also need
to add them to synchronizing instructions or declare the behavior of GVN right
now as somewhat sane. So we say that synchronizing instructions in the current
function are stronger than `noalias` but that brings us back to the problem with
1) which I really dislike. I guess, operand bundles for instructions would solve
this in a reasonably nice way.
3) Introduce a weaker version of `noalias` with the drawbacks of 1) but at least
__restrict__ pointers don't inherit the problem (as it is a user error to
synchronize via restrict pointers).
[4) Don't deduce `noalias` if there might be synchronization in the
scope/function. This is what the Attributor does right now:
https://github.com/llvm/llvm-project/blob/fb12df4a8e33d759938057718273dfb434b2d9c4/llvm/lib/Transforms/IPO/AttributorAttributes.cpp#L2458]
I'm interested in other thoughts on this.
~ Johannes
[0] https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
On 1/27/21 2:34 PM, Kaylor, Andrew wrote:> Thanks, Johannes!
>
> It looks like the bug you were referring to is
https://bugs.llvm.org/show_bug.cgi?id=41781.
>
> I see there that Eli is asserting that 'restrict' (and therefore
'noalias') applies to memory accesses in any thread. I was assuming
otherwise. If I remove the 'noalias' attribute there are no problems
with my example, but this would also mean the potential loss of local
optimizations that would otherwise be possible in more complicated cases.
>
> In the case I was starting from, the 'noalias' attribute was
something our compiler derived based on its knowledge of the SYCL rules. Within
the kernel, we know the pointer appearing as an argument here won't alias
with any other pointers in the kernel, but nothing prevents other instances of
the kernel from modifying the same memory. Hence, the barriers to synchronize
the accesses.
>
> I'll have to read the relevant section of your paper a few more times
to fully grasp what you are saying there. It's clear that you are addressing
the same general problem I'm looking at here. I wasn't clear on first
reading whether you are saying that the restrict/noalias attribute could be
employed for this optimization (possibly with additional constructs to manage
the synchronization) or whether you meant that something entirely different than
restrict/noalias was needed.
>
> -Andy
>
> -----Original Message-----
> From: Johannes Doerfert <johannesdoerfert at gmail.com>
> Sent: Wednesday, January 27, 2021 10:55 AM
> To: Kaylor, Andrew <andrew.kaylor at intel.com>
> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
> Subject: Re: [llvm-dev] Memory barrier problem
>
>
> On 1/27/21 11:50 AM, Kaylor, Andrew via llvm-dev wrote:
>> Hi everyone,
>>
>> I have a problem with multi-threaded memory synchronization that
I'd like to get some input on.
>>
>> Consider the following IR:
>>
>> ------------
>>
>> define void @bar() convergent {
>> fence acq_rel
>> ret void
>> }
>>
>> define i32 @foo(i32* noalias %p, i32 %flag) {
>> entry:
>> store i32 0, i32* %p
>> call void @bar()
>> %cmp = icmp eq i32 %flag, 0
>> br i1 %cmp, label %if.then, label %if.end
>>
>> if.then:
>> store i32 1, i32* %p
>> br label %if.end
>>
>> if.end:
>> call void @bar()
>> %x = load i32, i32* %p
>> ret i32 %x
>> }
>>
>> ------------
>>
>> I have an argument (%p) which is marked with the 'noalias'
attribute. The memory pointed to by this argument is read, written, and read
again within the function. Between these accesses, I am calling a function that
contains a fence instruction. If that call with the fence is not inlined, GVN
will eliminate the second load.
>>
>> ------------
>>
>> define i32 @foo(i32* noalias %p, i32 %flag) {
>> entry:
>> store i32 0, i32* %p, align 4
>> call void @bar()
>> %cmp = icmp eq i32 %flag, 0
>> br i1 %cmp, label %if.then, label %if.end
>>
>> if.then:
>> store i32 1, i32* %p, align 4
>> br label %if.end
>>
>> if.end:
>> %x = phi i32 [ 1, %if.then ], [ 0, %entry ] ; <==============
Incorrect
>> call void @bar()
>> ret i32 %x
>> }
>>
>> ------------
>>
>> https://godbolt.org/z/14o8oY
>>
>> This is a reduction of a scenario I've come across in a SYCL
program. The bar() function corresponds to a work group barrier that is meant to
have the memory synchronizing effect described by the fence instruction in my
example. I'm trying to figure out how to construct LLVM IR that will
represent the semantics I need.
>>
>> If I remove the 'noalias' attribute from the argument, GVN
won't make this optimization because it conservatively assumes that the
memory might be modified within the called function. That's fine, but I
think it fixes the problem for the wrong reason. In fact, the memory location is
not modified in the called function and as I understand it the 'noalias'
attribute only guarantees that the memory won't be accessed *in the current
thread* using pointers that aren't based on the 'noalias' pointer.
So, the fact that it might be modified by another thread shouldn't
invalidate the 'noalias' attribute. Is that correct?
>>
>> I can also block the GVN optimization by putting the fence instruction
directly in the foo() function, such as by inlining the call to bar(). But, of
course, the semantics of the IR should not depend on whether or not I've
inlined functions. In this case the inlining is trivial, but the problem
potentially exists for a called function that uses a barrier in a way that is
not so immediately visible.
>>
>> I put the 'convergent' attribute on my bar() function mostly to
demonstrate that this doesn't solve the problem. As I understand it, the
'convergent' attribute describes control flow constraints and says
nothing about memory access synchronization. Is that correct?
>>
>>
>> Is there a way to handle this case? I have some ideas, but I'd like
to start by just posing the question to see if there are better avenues
available than I've considered.
> So far, I don't think we have a proper way to handle this. The
> argument was, the user code is wrong because multiple threads wrote
> the variable which violated restrict. As we added deduction of noalias
> we run into this again. What we proposed, but haven't tried to
> upstream yet, is to provide explicit uses of restrict pointers. See
> Chapter 3 in
> https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
>
> I'd be very interested in discussing this further, a little short on
time right now though.
>
> ~ Johannes
>
> P.S. There was a bug report with ample of related discussion, I to look for
it again, maybe Eli remembers.
>
>
>> Thanks,
>> Andy
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev