>I still don't understand why an explicit fence causes LLVM to avoid the
optimization based on `noalias` but an outlined fence doesn't. That is not
consistent *at all*.
Me too.
>So let's not try to build something on top of that. I prefer fake uses
but a weakened `noalias` can be discussed too.
Maybe, but if we can fix this problem
>>>>> define void @bar() convergent {
>>>>> fence acq_rel
>>>>> ret void
>>>>> }
as below, by an ability to mark any call a potential fence, by a function
attribute, that's what I prefer. It could be that there was an underlying
assumption (or misconception) that the convergent attribute would suffice, but
it apparently does not. Revisiting there seems worthwhile.
define void @bar() convergent fence.acq_rel {
fence acq_rel
ret void
}
Thanks,
Hideki
-----Original Message-----
From: Johannes Doerfert <johannesdoerfert at gmail.com>
Sent: Monday, February 1, 2021 6:57 AM
To: Saito, Hideki <hideki.saito at intel.com>; Kaylor, Andrew
<andrew.kaylor at intel.com>
Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at quicinc.com>
Subject: Re: [llvm-dev] Memory barrier problem
On 2/1/21 1:43 AM, Saito, Hideki wrote:>> The problem is that not only "bar" but also
`omp_critical_{start/end}` synchronize threads. We need to add the "fake
uses" to those calls as well.
> Yes, indeed, the following two operations need to be "atomic".
>
>>> omp_critical_end(tid);
>>> }
>>> __memory_fence_release()[p]; // May "use" pointer p.
> and so are
>
>>> __memory_fence_acquire()[p]; // May "use" pointer p.
>>> if (omp_critical_start(tid)) {
> Now, going back to Andy's example. We either need to make bar() act
> like a fence, or sandwich it with fences (with "fake uses" if
your
> proposal is adopted) --- from the caller perspective. Letting bar have
> a fence inside isn't enough, from the caller side optimization
> purpose. (Fence inside bar() would work from the execution on HW
> perspective, if compiler hasn't already messed it up.)
The thing is, I still don't understand why an explicit fence causes LLVM to
avoid the optimization based on `noalias` but an outlined fence doesn't.
That is not consistent *at all*. So let's not try to build something on top
of that. I prefer fake uses but a weakened `noalias` can be discussed too.
~ Johannes
>
> Thanks,
> Hideki
>
> -----Original Message-----
> From: Johannes Doerfert <johannesdoerfert at gmail.com>
> Sent: Saturday, January 30, 2021 3:02 PM
> To: Saito, Hideki <hideki.saito at intel.com>; Kaylor, Andrew
> <andrew.kaylor at intel.com>
> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
> Subject: Re: [llvm-dev] Memory barrier problem
>
> Hi Saito,
>
> thanks for pointing this out. I agree now, the reasoning was not
"complete" because I focused on the example too much.
>
>
> On 1/29/21 1:33 PM, Saito, Hideki wrote:
>>> Could you explain why we would need a fence (or anything for that
>>> matter) if there is no user code bar()?
>>> The way I see it, if there is no user code that might be a
synchronization event, then there is no problem.
>>> The noalias can just apply to the pointer and we know no other
thread will interfere with the memory in the scope because it would be a race.
Does that make sense?
>> As soon as the critical section ends, some other thread can modify the
value of *p (under its critical section), prior to the beginning of the next
critical section. If OS decides to context switch between omp_critical_end() and
omp_critical_start(), that would be eternity from application SW perspective.
Threads aren't executing in lock-step.
> Disclaimer: I thought there is no problem, and I still believe there
> is none for the example given in the paper and discussed here before.
> I explain why I think there is no problem in this example below.
> Afterwards I make the example slightly more complicated and argue why
> it is now broken if we do not also add "operand bundle uses" to
the
> `om_critical_` calls which can, after all, also synchronize and be
> observed by the user.
>
> This is only relevant for the example and only serves to explain why I
> thought there was no problem:
> Even if a second thread updates *p between the first
> `omp_critical_end` and the second `omp_critical_start`, that even is
> not synchronized with the thread running this method.
> Let's call that thread T0 and the other one T1 and assume T1 will
> write 5, we then have 3 critical sections to argue about:
>
> T0: A: critical{*p = *p + 1}
> B: critical{*p = *p * 2}
> T1: critical{*p = 5}
>
> Now we can have 3 different interleavings from a observer perspective:
>
> 1: T0A T0B T1
> 2: T0A T1 T0B
> 3: T1 T0A T0B
>
> But since there is no synchronization between T0 and T1, the user
> cannot argue any of them is less correct than the others.
> So if we utilize the `noalias` guarantee to replace the load in T0B
> with the value stored in T0A, we basically made interleaving
> 2 impossible and it would instead "become" interleaving 3.
However, if
> there is a barrier-like effect between T0A and T0B, the user can
> determine that T0A should have happened already, basically distinguish
> interleaving 2 from 3. That brings me to the example where this logic
> falls apart:
>
> if (omp_critical_start(tid)) {
> *p += 1;
> *flag = 1
> omp_critical_end(tid);
> }
> if (omp_critical_start(tid)) {
> *p *= 2;
> omp_critical_end(tid);
> }
>
> Since the user can now observe if T0A was already performed, by
> checking `flag`, they can distinguish interleaving 2 and 3 which makes
> the forwarding illegal. The problem is that not only "bar" but
also
> `omp_critical_{start/end}` synchronize threads. We need to add the
> "fake uses" to those calls as well. This basically will keep the
> pointer update in the critical section, which is conservatively correct.
>
> I hope this made some sense.
>
> ~ Johannes
>
>
>> Thanks,
>> Hideki
>>
>> -----Original Message-----
>> From: Johannes Doerfert <johannesdoerfert at gmail.com>
>> Sent: Friday, January 29, 2021 11:04 AM
>> To: Saito, Hideki <hideki.saito at intel.com>; Kaylor, Andrew
>> <andrew.kaylor at intel.com>
>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
>> Subject: Re: [llvm-dev] Memory barrier problem
>>
>>
>> On 1/29/21 12:56 PM, Saito, Hideki wrote:
>>> Disclaimer: I've been focusing on vectorization (single thread
>>> problem) for the last 10+yrs. Please read this with grain of rust.
😊
>>> >>>
>>> The scheme proposed in the Page 4 Fig 1(d) of [0] may not work in
general problems, as is. Some code may not have bar(). Then, one may certainly
argue that why bother splitting this into two critical sections. There are
values in keeping critical section sizes as small as possible, and it's not
too hard to imagine two such critical sections touching the same variable along
with some other variables.
>>>
>>> if (omp_critical_start(tid)) {
>>> *p += 1;
>>> omp_critical_end(tid);
>>> }
>>> bar()[p]; // May "use" pointer p.
>>> if (omp_critical_start(tid)) {
>>> *p *= 2;
>>> omp_critical_end(tid);
>>> }
>>>
>>> In the absence of user-code bar(), if we are to take the spirit of
>>> the proposal, we would inject a pseudo function call (appropriately
>>> name it as __memory_fence() 😊) and it would look like
>> Could you explain why we would need a fence (or anything for that
>> matter) if there is no user code bar()?
>> The way I see it, if there is no user code that might be a
synchronization event, then there is no problem.
>> The noalias can just apply to the pointer and we know no other thread
will interfere with the memory in the scope because it would be a race. Does
that make sense?
>>
>> ~ Johannes
>>
>>
>>> if (omp_critical_start(tid)) {
>>> *p += 1;
>>> omp_critical_end(tid);
>>> }
>>> __memory_fence()[p]; // May "use" pointer p.
>>> if (omp_critical_start(tid)) {
>>> *p *= 2;
>>> omp_critical_end(tid);
>>> }
>>>
>>> We would then proceed to argue with the following improvement of it
>>>
>>> if (omp_critical_start(tid)) {
>>> *p += 1;
>>> omp_critical_end(tid);
>>> }
>>> __memory_fence_release()[p]; // May "use" pointer p.
>>> ;
>>> __memory_fence_acquire()[p]; // May "use" pointer p.
>>> if (omp_critical_start(tid)) {
>>> *p *= 2;
>>> omp_critical_end(tid);
>>> }
>>>
>>> We then might as well talk about improving LLVM IR fence
instruction definition.
>>>
>>> My preference is to interpret "noalias/restrict to be
applicable to all threads, but only between one fence to the next (per dynamic
control flow sense)". However, if most developers think "
noalias/restrict to be effective to all threads for the entire scope of the
variable/parameter" && we can make such fence based dependency
explicit like in the spirit of the proposal in [0] (and as I outlined above), it
would be good to discuss the long term pros/cons of the both approaches. For the
optimizers, the former can't really be based on dynamic control flow based.
As such, it would end up in some conservative static scoping based enforcement.
The latter, if implementable, may have a better reflection of the mem-fence
requirement and control flow, which may lead to better optimization.
>>>
>>> Thanks,
>>> Hideki
>>>
>>> -----Original Message-----
>>> From: Johannes Doerfert <johannesdoerfert at gmail.com>
>>> Sent: Wednesday, January 27, 2021 2:26 PM
>>> To: Kaylor, Andrew <andrew.kaylor at intel.com>
>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>
>>> Long story short, we have 3 general options forward:
>>>
>>> 1) Introduce/Leverage an attribute and make it "stronger"
than the `noalias` guarantee we have right now. This is not a great option
because we would need to scan for intermediate instructions with this property
in order to utilize `noalias`.
>>> 2) We go with explicit uses of the `noalias` pointer as proposed in
[0].
>>> The downside is that we need to add them when we inline and such.
We also need to add them to synchronizing instructions or declare the behavior
of GVN right now as somewhat sane. So we say that synchronizing instructions in
the current function are stronger than `noalias` but that brings us back to the
problem with 1) which I really dislike. I guess, operand bundles for
instructions would solve this in a reasonably nice way.
>>> 3) Introduce a weaker version of `noalias` with the drawbacks of 1)
but at least __restrict__ pointers don't inherit the problem (as it is a
user error to synchronize via restrict pointers).
>>> [4) Don't deduce `noalias` if there might be synchronization in
the scope/function. This is what the Attributor does right now:
>>>
https://github.com/llvm/llvm-project/blob/fb12df4a8e33d7599380577182
>>> 73
>>>
dfb434b2d9c4/llvm/lib/Transforms/IPO/AttributorAttributes.cpp#L2458]
>>>
>>> I'm interested in other thoughts on this.
>>>
>>> ~ Johannes
>>>
>>>
>>> [0]
>>> https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
>>>
>>>
>>>
>>>
>>> On 1/27/21 2:34 PM, Kaylor, Andrew wrote:
>>>> Thanks, Johannes!
>>>>
>>>> It looks like the bug you were referring to is
https://bugs.llvm.org/show_bug.cgi?id=41781.
>>>>
>>>> I see there that Eli is asserting that 'restrict' (and
therefore 'noalias') applies to memory accesses in any thread. I was
assuming otherwise. If I remove the 'noalias' attribute there are no
problems with my example, but this would also mean the potential loss of local
optimizations that would otherwise be possible in more complicated cases.
>>>>
>>>> In the case I was starting from, the 'noalias'
attribute was something our compiler derived based on its knowledge of the SYCL
rules. Within the kernel, we know the pointer appearing as an argument here
won't alias with any other pointers in the kernel, but nothing prevents
other instances of the kernel from modifying the same memory. Hence, the
barriers to synchronize the accesses.
>>>>
>>>> I'll have to read the relevant section of your paper a few
more times to fully grasp what you are saying there. It's clear that you are
addressing the same general problem I'm looking at here. I wasn't clear
on first reading whether you are saying that the restrict/noalias attribute
could be employed for this optimization (possibly with additional constructs to
manage the synchronization) or whether you meant that something entirely
different than restrict/noalias was needed.
>>>>
>>>> -Andy
>>>>
>>>> -----Original Message-----
>>>> From: Johannes Doerfert <johannesdoerfert at gmail.com>
>>>> Sent: Wednesday, January 27, 2021 10:55 AM
>>>> To: Kaylor, Andrew <andrew.kaylor at intel.com>
>>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
>>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>>
>>>>
>>>> On 1/27/21 11:50 AM, Kaylor, Andrew via llvm-dev wrote:
>>>>> Hi everyone,
>>>>>
>>>>> I have a problem with multi-threaded memory synchronization
that I'd like to get some input on.
>>>>>
>>>>> Consider the following IR:
>>>>>
>>>>> ------------
>>>>>
>>>>> define void @bar() convergent {
>>>>> fence acq_rel
>>>>> ret void
>>>>> }
>>>>>
>>>>> define i32 @foo(i32* noalias %p, i32 %flag) {
>>>>> entry:
>>>>> store i32 0, i32* %p
>>>>> call void @bar()
>>>>> %cmp = icmp eq i32 %flag, 0
>>>>> br i1 %cmp, label %if.then, label %if.end
>>>>>
>>>>> if.then:
>>>>> store i32 1, i32* %p
>>>>> br label %if.end
>>>>>
>>>>> if.end:
>>>>> call void @bar()
>>>>> %x = load i32, i32* %p
>>>>> ret i32 %x
>>>>> }
>>>>>
>>>>> ------------
>>>>>
>>>>> I have an argument (%p) which is marked with the
'noalias' attribute. The memory pointed to by this argument is read,
written, and read again within the function. Between these accesses, I am
calling a function that contains a fence instruction. If that call with the
fence is not inlined, GVN will eliminate the second load.
>>>>>
>>>>> ------------
>>>>>
>>>>> define i32 @foo(i32* noalias %p, i32 %flag) {
>>>>> entry:
>>>>> store i32 0, i32* %p, align 4
>>>>> call void @bar()
>>>>> %cmp = icmp eq i32 %flag, 0
>>>>> br i1 %cmp, label %if.then, label %if.end
>>>>>
>>>>> if.then:
>>>>> store i32 1, i32* %p, align 4
>>>>> br label %if.end
>>>>>
>>>>> if.end:
>>>>> %x = phi i32 [ 1, %if.then ], [ 0, %entry ] ;
<============== Incorrect
>>>>> call void @bar()
>>>>> ret i32 %x
>>>>> }
>>>>>
>>>>> ------------
>>>>>
>>>>> https://godbolt.org/z/14o8oY
>>>>>
>>>>> This is a reduction of a scenario I've come across in a
SYCL program. The bar() function corresponds to a work group barrier that is
meant to have the memory synchronizing effect described by the fence instruction
in my example. I'm trying to figure out how to construct LLVM IR that will
represent the semantics I need.
>>>>>
>>>>> If I remove the 'noalias' attribute from the
argument, GVN won't make this optimization because it conservatively assumes
that the memory might be modified within the called function. That's fine,
but I think it fixes the problem for the wrong reason. In fact, the memory
location is not modified in the called function and as I understand it the
'noalias' attribute only guarantees that the memory won't be
accessed *in the current thread* using pointers that aren't based on the
'noalias' pointer. So, the fact that it might be modified by another
thread shouldn't invalidate the 'noalias' attribute. Is that
correct?
>>>>>
>>>>> I can also block the GVN optimization by putting the fence
instruction directly in the foo() function, such as by inlining the call to
bar(). But, of course, the semantics of the IR should not depend on whether or
not I've inlined functions. In this case the inlining is trivial, but the
problem potentially exists for a called function that uses a barrier in a way
that is not so immediately visible.
>>>>>
>>>>> I put the 'convergent' attribute on my bar()
function mostly to demonstrate that this doesn't solve the problem. As I
understand it, the 'convergent' attribute describes control flow
constraints and says nothing about memory access synchronization. Is that
correct?
>>>>>
>>>>>
>>>>> Is there a way to handle this case? I have some ideas, but
I'd like to start by just posing the question to see if there are better
avenues available than I've considered.
>>>> So far, I don't think we have a proper way to handle this.
The
>>>> argument was, the user code is wrong because multiple threads
wrote
>>>> the variable which violated restrict. As we added deduction of
>>>> noalias we run into this again. What we proposed, but
haven't tried
>>>> to upstream yet, is to provide explicit uses of restrict
pointers.
>>>> See Chapter 3 in
>>>>
https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
>>>>
>>>> I'd be very interested in discussing this further, a little
short on time right now though.
>>>>
>>>> ~ Johannes
>>>>
>>>> P.S. There was a bug report with ample of related discussion, I
to look for it again, maybe Eli remembers.
>>>>
>>>>
>>>>> Thanks,
>>>>> Andy
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev