thr3ads.net - llvm dev - [llvm-dev] Memory barrier problem [Feb 2021]

If this information is useful, please help other people find it:
Share via:

Johannes Doerfert via llvm-dev

2021-Feb-02 20:10 UTC

[llvm-dev] Memory barrier problem

On 2/2/21 2:00 PM, Saito, Hideki wrote:>> I don't really follow this argument, I think I simply don't
understand it. What do you mean by "the use of fence"? Adding explicit
fence instructions somewhere?
> In the context of OpenMP example below, add fence attribute to
omp_critical_start/end().
> For the SPMD programming example below, add fence attribute to bar().
> If we add explicit fence instructions separate from those calls, I
don't think they can be made "quasi atomic" with those calls.
>
> What I'm advocating is that they are cross-thread synchronization
points and thus we should mark them just as such (i.e., fence!). Compiler should
understand how to deal with cross thread synchronization points. For me, doing
something else is a workaround (sorry if it sounds harsh) ---- even if it's
a practical workaround that may be the only deployable solution today.
I guess I just don't get why you use "fence" equivalent to 
"synchronization point". Maybe it's just a language mismatch on my
end.

Again, everything that is not `nosync` is potentially a cross thread 
synchronization point. There is no need to "opt-in" to such a thing,
it's the default.

~ Johannes

>> Since the language of restrict is to this day implying other threads
cannot access those pointers
> If language lawyers are standing firm here (which might require changes in
the definition wording, to better accommodate SPMD programing model), I think we
should lobby for a similar but multi-thread friendly "restrict"-ness
to the C/C++ language standard. Asking programmers to drop
"restrictness" will lead to missed optimization opportunities.
>
> I think I've stated enough of my points. I'd like to hear more from
others.
>
> Thanks,
> Hideki
>
> -----Original Message-----
> From: Johannes Doerfert <johannesdoerfert at gmail.com>
> Sent: Tuesday, February 2, 2021 11:05 AM
> To: Saito, Hideki <hideki.saito at intel.com>; Kaylor, Andrew
<andrew.kaylor at intel.com>
> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
> Subject: Re: [llvm-dev] Memory barrier problem
>
>
> On 2/2/21 12:30 PM, Saito, Hideki wrote:
>>> If that is what you want, make `nosync` "required" for
`noalias` "over calls". I mean, everything needs to start as "may
have fence" by default and it's not only "fence", but any
sync instruction. So, `noalias` >pointers are impacted by potentially
synchronizing calls.
>>> I am not sure this is what we want, in parts because it degrades
optimization potential for all restrict pointers, even in the non threaded
environment. I was thinking we keep user `__restrict__` pointer as >they are,
if you break synchronization it's a user fault.
>>> That said, if we derive `noalias` we need to ensure we don't
break synchronization. "Fake uses" is one way but probably not the
only one.
>> I think (and others may disagree) it makes sense to let the programmer
>> decide whether they want a thread safe program or thread unsafe (or
>> rather single thread optimized) program (e.g., a flag, pragma,
>> per-function attribute, etc.). I do not believe compiler can resolve
>> that problem on its own. If we go for that direction, there will be a
>> question of what the default should be and my preference on the
>> default is go conservative (i.e., go safer versus more optimizable).
>> That might mean existing code may get compiled into slower executing
>> code until a new flag ("single thread optimized") is added.
[This is
>> from "threading is already a norm" thinking.]
> Too many nobs is not helpful either. As said before, we should keep
`__restrict__` as it is and deal with "deduced" `noalias` properly.
How we do that is another question though.
>
>
>> My preference on the use of fence comes from the desire to let compiler
implementation match with the theory that you learn at school (Educators, please
chime in) and also how the HW works. If for some reason, fence approach
doesn't work well with LLVM framework today, we should have a good
documentation on how that conclusion was drawn, adopt plan-B, and once in a
while revisit the viability. I'm not totally opposed on the alternatives
since we have a problem that needs a resolution quickly enough. I like
theoretical clarity, but I can't fight for it forever.
> I don't really follow this argument, I think I simply don't
understand it. What do you mean by "the use of fence"? Adding explicit
fence instructions somewhere?
>
>
>> W.r.t. restrict, I'd like to hear more from the language lawyers on
their original intent when the language construct was born and the current
interpretation of it in the presence of threading.
> I would have assumed `__restrict` predates "common"
multi-processing in C. Since the language of restrict is to this day implying
other threads cannot access those pointers, I would not dare to argue we should
weaken it in order to deduce `noalias`.
>
> ~ Johannes
>
>
>> Thanks,
>> Hideki
>>
>> -----Original Message-----
>> From: Johannes Doerfert <johannesdoerfert at gmail.com>
>> Sent: Monday, February 1, 2021 10:08 AM
>> To: Saito, Hideki <hideki.saito at intel.com>; Kaylor, Andrew
>> <andrew.kaylor at intel.com>
>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
>> Subject: Re: [llvm-dev] Memory barrier problem
>>
>>
>> On 2/1/21 11:34 AM, Saito, Hideki wrote:
>>>> I still don't understand why an explicit fence causes LLVM
to avoid the optimization based on `noalias` but an outlined fence doesn't.
That is not consistent *at all*.
>>> Me too.
>>>
>>>> So let's not try to build something on top of that. I
prefer fake uses but a weakened `noalias` can be discussed too.
>>> Maybe, but if we can fix this problem
>>>
>>>>>>>> define void @bar() convergent {
>>>>>>>>           fence acq_rel
>>>>>>>>           ret void
>>>>>>>> }
>>> as below, by an ability to mark any call a potential fence, by a
function attribute, that's what I prefer. It could be that there was an
underlying assumption (or misconception) that the convergent attribute would
suffice, but it apparently does not. Revisiting there seems worthwhile.
>>>
>>> define void @bar() convergent fence.acq_rel {
>>>            fence acq_rel
>>>            ret void
>>> }
>> If that is what you want, make `nosync` "required" for
`noalias` "over calls". I mean, everything needs to start as "may
have fence" by default and it's not only "fence", but any
sync instruction. So, `noalias` pointers are impacted by potentially
synchronizing calls.
>> I am not sure this is what we want, in parts because it degrades
optimization potential for all restrict pointers, even in the non threaded
environment. I was thinking we keep user `__restrict__` pointer as they are, if
you break synchronization it's a user fault.
>> That said, if we derive `noalias` we need to ensure we don't break
synchronization. "Fake uses" is one way but probably not the only one.
>>
>> ~ Johannes
>>
>>
>>> Thanks,
>>> Hideki
>>>
>>> -----Original Message-----
>>> From: Johannes Doerfert <johannesdoerfert at gmail.com>
>>> Sent: Monday, February 1, 2021 6:57 AM
>>> To: Saito, Hideki <hideki.saito at intel.com>; Kaylor, Andrew
>>> <andrew.kaylor at intel.com>
>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>
>>>
>>> On 2/1/21 1:43 AM, Saito, Hideki wrote:
>>>>> The problem is that not only "bar" but also 
`omp_critical_{start/end}` synchronize threads. We need to add the "fake
uses" to those calls as well.
>>>> Yes, indeed, the following two operations need to be
"atomic".
>>>>
>>>>>>           omp_critical_end(tid);
>>>>>> }
>>>>>> __memory_fence_release()[p]; // May "use"
pointer p.
>>>> and so are
>>>>
>>>>>> __memory_fence_acquire()[p]; // May "use"
pointer p.
>>>>>> if (omp_critical_start(tid)) {
>>>> Now, going back to Andy's example. We either need to make
bar() act
>>>> like a fence, or sandwich it with fences (with "fake
uses" if your
>>>> proposal is adopted) --- from the caller perspective. Letting
bar
>>>> have a fence inside isn't enough, from the caller side
optimization
>>>> purpose. (Fence inside bar() would work from the execution on
HW
>>>> perspective, if compiler hasn't already messed it up.)
>>> The thing is, I still don't understand why an explicit fence
causes LLVM to avoid the optimization based on `noalias` but an outlined fence
doesn't. That is not consistent *at all*. So let's not try to build
something on top of that. I prefer fake uses but a weakened `noalias` can be
discussed too.
>>>
>>> ~ Johannes
>>>
>>>
>>>> Thanks,
>>>> Hideki
>>>>
>>>> -----Original Message-----
>>>> From: Johannes Doerfert <johannesdoerfert at gmail.com>
>>>> Sent: Saturday, January 30, 2021 3:02 PM
>>>> To: Saito, Hideki <hideki.saito at intel.com>; Kaylor,
Andrew
>>>> <andrew.kaylor at intel.com>
>>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
>>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>>
>>>> Hi Saito,
>>>>
>>>> thanks for pointing this out. I agree now, the reasoning was
not "complete" because I focused on the example too much.
>>>>
>>>>
>>>> On 1/29/21 1:33 PM, Saito, Hideki wrote:
>>>>>> Could you explain why we would need a fence (or
anything for that
>>>>>> matter) if there is no user code bar()?
>>>>>> The way I see it, if there is no user code that might
be a synchronization event, then there is no problem.
>>>>>> The noalias can just apply to the pointer and we know
no other thread will interfere with the memory in the scope because it would be
a race. Does that make sense?
>>>>> As soon as the critical section ends, some other thread can
modify the value of *p (under its critical section), prior to the beginning of
the next critical section. If OS decides to context switch between
omp_critical_end() and omp_critical_start(), that would be eternity from
application SW perspective. Threads aren't executing in lock-step.
>>>> Disclaimer: I thought there is no problem, and I still believe
there
>>>> is none for the example given in the paper and discussed here
before.
>>>> I explain why I think there is no problem in this example
below.
>>>> Afterwards I make the example slightly more complicated and
argue
>>>> why it is now broken if we do not also add "operand bundle
uses" to
>>>> the `om_critical_` calls which can, after all, also synchronize
and
>>>> be observed by the user.
>>>>
>>>> This is only relevant for the example and only serves to
explain why
>>>> I thought there was no problem:
>>>> Even if a second thread updates *p between the first
>>>> `omp_critical_end` and the second `omp_critical_start`, that
even is
>>>> not synchronized with the thread running this method.
>>>> Let's call that thread T0 and the other one T1 and assume
T1 will
>>>> write 5, we then have 3 critical sections to argue about:
>>>>
>>>> T0: A: critical{*p = *p + 1}
>>>>          B: critical{*p = *p * 2}
>>>> T1: critical{*p = 5}
>>>>
>>>> Now we can have 3 different interleavings from a observer
perspective:
>>>>
>>>> 1: T0A T0B T1
>>>> 2: T0A T1 T0B
>>>> 3: T1 T0A T0B
>>>>
>>>> But since there is no synchronization between T0 and T1, the
user
>>>> cannot argue any of them is less correct than the others.
>>>> So if we utilize the `noalias` guarantee to replace the load in
T0B
>>>> with the value stored in T0A, we basically made interleaving
>>>> 2 impossible and it would instead "become"
interleaving 3. However,
>>>> if there is a barrier-like effect between T0A and T0B, the user
can
>>>> determine that T0A should have happened already, basically
>>>> distinguish interleaving 2 from 3. That brings me to the
example
>>>> where this logic falls apart:
>>>>
>>>> if (omp_critical_start(tid)) {
>>>>          *p += 1;
>>>>          *flag = 1
>>>>           omp_critical_end(tid);
>>>> }
>>>> if (omp_critical_start(tid)) {
>>>>          *p *= 2;
>>>>          omp_critical_end(tid);
>>>> }
>>>>
>>>> Since the user can now observe if T0A was already performed, by
>>>> checking `flag`, they can distinguish interleaving 2 and 3
which
>>>> makes the forwarding illegal. The problem is that not only
"bar" but
>>>> also `omp_critical_{start/end}` synchronize threads. We need to
add
>>>> the "fake uses" to those calls as well. This
basically will keep the
>>>> pointer update in the critical section, which is conservatively
correct.
>>>>
>>>> I hope this made some sense.
>>>>
>>>> ~ Johannes
>>>>
>>>>
>>>>> Thanks,
>>>>> Hideki
>>>>>
>>>>> -----Original Message-----
>>>>> From: Johannes Doerfert <johannesdoerfert at
gmail.com>
>>>>> Sent: Friday, January 29, 2021 11:04 AM
>>>>> To: Saito, Hideki <hideki.saito at intel.com>;
Kaylor, Andrew
>>>>> <andrew.kaylor at intel.com>
>>>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma
at quicinc.com>
>>>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>>>
>>>>>
>>>>> On 1/29/21 12:56 PM, Saito, Hideki wrote:
>>>>>> Disclaimer: I've been focusing on vectorization
(single thread
>>>>>> problem) for the last 10+yrs. Please read this with
grain of rust.
>>>>>> 😊
>>>>>> The scheme proposed in the Page 4 Fig 1(d) of [0] may
not work in general problems, as is. Some code may not have bar(). Then, one may
certainly argue that why bother splitting this into two critical sections. There
are values in keeping critical section sizes as small as possible, and it's
not too hard to imagine two such critical sections touching the same variable
along with some other variables.
>>>>>>
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p += 1;
>>>>>>           omp_critical_end(tid);
>>>>>> }
>>>>>> bar()[p]; // May "use" pointer p.
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p *= 2;
>>>>>>          omp_critical_end(tid);
>>>>>> }
>>>>>>
>>>>>> In the absence of user-code bar(), if we are to take
the spirit of
>>>>>> the proposal, we would inject a pseudo function call
>>>>>> (appropriately name it as __memory_fence() 😊) and it
would look
>>>>>> like
>>>>> Could you explain why we would need a fence (or anything
for that
>>>>> matter) if there is no user code bar()?
>>>>> The way I see it, if there is no user code that might be a
synchronization event, then there is no problem.
>>>>> The noalias can just apply to the pointer and we know no
other thread will interfere with the memory in the scope because it would be a
race. Does that make sense?
>>>>>
>>>>> ~ Johannes
>>>>>
>>>>>
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p += 1;
>>>>>>           omp_critical_end(tid);
>>>>>> }
>>>>>> __memory_fence()[p]; // May "use" pointer p.
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p *= 2;
>>>>>>          omp_critical_end(tid);
>>>>>> }
>>>>>>
>>>>>> We would then proceed to argue with the following
improvement of
>>>>>> it
>>>>>>
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p += 1;
>>>>>>           omp_critical_end(tid);
>>>>>> }
>>>>>> __memory_fence_release()[p]; // May "use"
pointer p.
>>>>>> ;
>>>>>> __memory_fence_acquire()[p]; // May "use"
pointer p.
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p *= 2;
>>>>>>          omp_critical_end(tid);
>>>>>> }
>>>>>>
>>>>>> We then might as well talk about improving LLVM IR
fence instruction definition.
>>>>>>
>>>>>> My preference is to interpret "noalias/restrict to
be applicable to all threads, but only between one fence to the next (per
dynamic control flow sense)". However, if most developers think "
noalias/restrict to be effective to all threads for the entire scope of the
variable/parameter" && we can make such fence based dependency
explicit like in the spirit of the proposal in [0] (and as I outlined above), it
would be good to discuss the long term pros/cons of the both approaches. For the
optimizers, the former can't really be based on dynamic control flow based.
As such, it would end up in some conservative static scoping based enforcement.
The latter, if implementable, may have a better reflection of the mem-fence
requirement and control flow, which may lead to better optimization.
>>>>>>
>>>>>> Thanks,
>>>>>> Hideki
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Johannes Doerfert <johannesdoerfert at
gmail.com>
>>>>>> Sent: Wednesday, January 27, 2021 2:26 PM
>>>>>> To: Kaylor, Andrew <andrew.kaylor at intel.com>
>>>>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman
<efriedma at quicinc.com>
>>>>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>>>>
>>>>>> Long story short, we have 3 general options forward:
>>>>>>
>>>>>> 1) Introduce/Leverage an attribute and make it
"stronger" than the `noalias` guarantee we have right now. This is not
a great option because we would need to scan for intermediate instructions with
this property in order to utilize `noalias`.
>>>>>> 2) We go with explicit uses of the `noalias` pointer as
proposed in [0].
>>>>>> The downside is that we need to add them when we inline
and such. We also need to add them to synchronizing instructions or declare the
behavior of GVN right now as somewhat sane. So we say that synchronizing
instructions in the current function are stronger than `noalias` but that brings
us back to the problem with 1) which I really dislike. I guess, operand bundles
for instructions would solve this in a reasonably nice way.
>>>>>> 3) Introduce a weaker version of `noalias` with the
drawbacks of 1) but at least __restrict__ pointers don't inherit the problem
(as it is a user error to synchronize via restrict pointers).
>>>>>> [4) Don't deduce `noalias` if there might be
synchronization in the scope/function. This is what the Attributor does right
now:
>>>>>>
https://github.com/llvm/llvm-project/blob/fb12df4a8e33d75993805771
>>>>>> 8
>>>>>> 2
>>>>>> 73
>>>>>>
dfb434b2d9c4/llvm/lib/Transforms/IPO/AttributorAttributes.cpp#L245
>>>>>> 8
>>>>>> ]
>>>>>>
>>>>>> I'm interested in other thoughts on this.
>>>>>>
>>>>>> ~ Johannes
>>>>>>
>>>>>>
>>>>>> [0]
>>>>>>
https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1/27/21 2:34 PM, Kaylor, Andrew wrote:
>>>>>>> Thanks, Johannes!
>>>>>>>
>>>>>>> It looks like the bug you were referring to is
https://bugs.llvm.org/show_bug.cgi?id=41781.
>>>>>>>
>>>>>>> I see there that Eli is asserting that
'restrict' (and therefore 'noalias') applies to memory accesses
in any thread. I was assuming otherwise. If I remove the 'noalias'
attribute there are no problems with my example, but this would also mean the
potential loss of local optimizations that would otherwise be possible in more
complicated cases.
>>>>>>>
>>>>>>> In the case I was starting from, the
'noalias' attribute was something our compiler derived based on its
knowledge of the SYCL rules. Within the kernel, we know the pointer appearing as
an argument here won't alias with any other pointers in the kernel, but
nothing prevents other instances of the kernel from modifying the same memory.
Hence, the barriers to synchronize the accesses.
>>>>>>>
>>>>>>> I'll have to read the relevant section of your
paper a few more times to fully grasp what you are saying there. It's clear
that you are addressing the same general problem I'm looking at here. I
wasn't clear on first reading whether you are saying that the
restrict/noalias attribute could be employed for this optimization (possibly
with additional constructs to manage the synchronization) or whether you meant
that something entirely different than restrict/noalias was needed.
>>>>>>>
>>>>>>> -Andy
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Johannes Doerfert <johannesdoerfert at
gmail.com>
>>>>>>> Sent: Wednesday, January 27, 2021 10:55 AM
>>>>>>> To: Kaylor, Andrew <andrew.kaylor at
intel.com>
>>>>>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman
<efriedma at quicinc.com>
>>>>>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>>>>>
>>>>>>>
>>>>>>> On 1/27/21 11:50 AM, Kaylor, Andrew via llvm-dev
wrote:
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I have a problem with multi-threaded memory
synchronization that I'd like to get some input on.
>>>>>>>>
>>>>>>>> Consider the following IR:
>>>>>>>>
>>>>>>>> ------------
>>>>>>>>
>>>>>>>> define void @bar() convergent {
>>>>>>>>           fence acq_rel
>>>>>>>>           ret void
>>>>>>>> }
>>>>>>>>
>>>>>>>> define i32 @foo(i32* noalias %p, i32 %flag) {
>>>>>>>> entry:
>>>>>>>>           store i32 0, i32* %p
>>>>>>>>           call void @bar()
>>>>>>>>           %cmp = icmp eq i32 %flag, 0
>>>>>>>>           br i1 %cmp, label %if.then, label
%if.end
>>>>>>>>
>>>>>>>> if.then:
>>>>>>>>           store i32 1, i32* %p
>>>>>>>>           br label %if.end
>>>>>>>>
>>>>>>>> if.end:
>>>>>>>>           call void @bar()
>>>>>>>>           %x = load i32, i32* %p
>>>>>>>>           ret i32 %x
>>>>>>>> }
>>>>>>>>
>>>>>>>> ------------
>>>>>>>>
>>>>>>>> I have an argument (%p) which is marked with
the 'noalias' attribute. The memory pointed to by this argument is read,
written, and read again within the function. Between these accesses, I am
calling a function that contains a fence instruction. If that call with the
fence is not inlined, GVN will eliminate the second load.
>>>>>>>>
>>>>>>>> ------------
>>>>>>>>
>>>>>>>> define i32 @foo(i32* noalias %p, i32 %flag) {
>>>>>>>> entry:
>>>>>>>>           store i32 0, i32* %p, align 4
>>>>>>>>           call void @bar()
>>>>>>>>           %cmp = icmp eq i32 %flag, 0
>>>>>>>>           br i1 %cmp, label %if.then, label
%if.end
>>>>>>>>
>>>>>>>> if.then:
>>>>>>>>           store i32 1, i32* %p, align 4
>>>>>>>>           br label %if.end
>>>>>>>>
>>>>>>>> if.end:
>>>>>>>>           %x = phi i32 [ 1, %if.then ], [ 0,
%entry ]   ; <============== Incorrect
>>>>>>>>           call void @bar()
>>>>>>>>           ret i32 %x
>>>>>>>> }
>>>>>>>>
>>>>>>>> ------------
>>>>>>>>
>>>>>>>> https://godbolt.org/z/14o8oY
>>>>>>>>
>>>>>>>> This is a reduction of a scenario I've come
across in a SYCL program. The bar() function corresponds to a work group barrier
that is meant to have the memory synchronizing effect described by the fence
instruction in my example. I'm trying to figure out how to construct LLVM IR
that will represent the semantics I need.
>>>>>>>>
>>>>>>>> If I remove the 'noalias' attribute
from the argument, GVN won't make this optimization because it
conservatively assumes that the memory might be modified within the called
function. That's fine, but I think it fixes the problem for the wrong
reason. In fact, the memory location is not modified in the called function and
as I understand it the 'noalias' attribute only guarantees that the
memory won't be accessed *in the current thread* using pointers that
aren't based on the 'noalias' pointer. So, the fact that it might be
modified by another thread shouldn't invalidate the 'noalias'
attribute. Is that correct?
>>>>>>>>
>>>>>>>> I can also block the GVN optimization by
putting the fence instruction directly in the foo() function, such as by
inlining the call to bar(). But, of course, the semantics of the IR should not
depend on whether or not I've inlined functions. In this case the inlining
is trivial, but the problem potentially exists for a called function that uses a
barrier in a way that is not so immediately visible.
>>>>>>>>
>>>>>>>> I put the 'convergent' attribute on my
bar() function mostly to demonstrate that this doesn't solve the problem. As
I understand it, the 'convergent' attribute describes control flow
constraints and says nothing about memory access synchronization. Is that
correct?
>>>>>>>>
>>>>>>>>
>>>>>>>> Is there a way to handle this case? I have some
ideas, but I'd like to start by just posing the question to see if there are
better avenues available than I've considered.
>>>>>>> So far, I don't think we have a proper way to
handle this. The
>>>>>>> argument was, the user code is wrong because
multiple threads
>>>>>>> wrote the variable which violated restrict. As we
added deduction
>>>>>>> of noalias we run into this again. What we
proposed, but haven't
>>>>>>> tried to upstream yet, is to provide explicit uses
of restrict pointers.
>>>>>>> See Chapter 3 in
>>>>>>>
https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pd
>>>>>>> f
>>>>>>>
>>>>>>> I'd be very interested in discussing this
further, a little short on time right now though.
>>>>>>>
>>>>>>> ~ Johannes
>>>>>>>
>>>>>>> P.S. There was a bug report with ample of related
discussion, I to look for it again, maybe Eli remembers.
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Andy
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Saito, Hideki via llvm-dev

2021-Feb-02 21:20 UTC

head link

[llvm-dev] Memory barrier problem

>I guess I just don't get why you use "fence" equivalent to
"synchronization point". Maybe it's just a language mismatch on my
end.
I agree there is a difference. Synchronization doesn't necessarily mean
memory based synchronization.
>Again, everything that is not `nosync` is potentially a cross thread
synchronization point. There is no need to "opt-in" to such a thing,
it's the default.
If "lack of nosync" doesn't imply fence (not clear in LangRef), we
need explicit fence semantics (or some other mechanism) to force
write-back/re-read. Hopefully, we do not have to
conservatively generate a HW fence instruction (i.e., somewhere in the callee
code path does it where it is really needed). "Fake use" might force
write-back/re-read, but it won't necessarily imply "communicating with
other threads" (which fence does). That's why I call it a workaround.
If "lack of nosync" implies fence and it is ignored by optimizers due
to the presence of noalias && if we think noalias definition won't
be made multithread friendly, we need to come up with a new flavor of noalias
that is multithread friendly so that we won't lose optimization
opportunities within the code between two "fences".

If you are coming from multithread unfriendly "noalias", what is the
logical explanation to add "fake use" of such noalias pointers to the
call?

Anyway, long story short:
* At language level: Restrict definition is unlikely to change. We need
something new that is thread friendly.
* At LLVM IR level: We need two flavors of noalias. Thread friendly and thread
unfriendly.
* We need to clarify what "lack of nosync" means. Does it imply fence?
If not, how do we force write-back/re-read?

Hideki

-----Original Message-----
From: Johannes Doerfert <johannesdoerfert at gmail.com> 
Sent: Tuesday, February 2, 2021 12:11 PM
To: Saito, Hideki <hideki.saito at intel.com>; Kaylor, Andrew
<andrew.kaylor at intel.com>
Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at quicinc.com>
Subject: Re: [llvm-dev] Memory barrier problem


On 2/2/21 2:00 PM, Saito, Hideki wrote:>> I don't really follow this argument, I think I simply don't
understand it. What do you mean by "the use of fence"? Adding explicit
fence instructions somewhere?
> In the context of OpenMP example below, add fence attribute to
omp_critical_start/end().
> For the SPMD programming example below, add fence attribute to bar().
> If we add explicit fence instructions separate from those calls, I
don't think they can be made "quasi atomic" with those calls.
>
> What I'm advocating is that they are cross-thread synchronization
points and thus we should mark them just as such (i.e., fence!). Compiler should
understand how to deal with cross thread synchronization points. For me, doing
something else is a workaround (sorry if it sounds harsh) ---- even if it's
a practical workaround that may be the only deployable solution today.
I guess I just don't get why you use "fence" equivalent to
"synchronization point". Maybe it's just a language mismatch on my
end.

Again, everything that is not `nosync` is potentially a cross thread
synchronization point. There is no need to "opt-in" to such a thing,
it's the default.

~ Johannes

>> Since the language of restrict is to this day implying other threads 
>> cannot access those pointers
> If language lawyers are standing firm here (which might require changes in
the definition wording, to better accommodate SPMD programing model), I think we
should lobby for a similar but multi-thread friendly "restrict"-ness
to the C/C++ language standard. Asking programmers to drop
"restrictness" will lead to missed optimization opportunities.
>
> I think I've stated enough of my points. I'd like to hear more from
others.
>
> Thanks,
> Hideki
>
> -----Original Message-----
> From: Johannes Doerfert <johannesdoerfert at gmail.com>
> Sent: Tuesday, February 2, 2021 11:05 AM
> To: Saito, Hideki <hideki.saito at intel.com>; Kaylor, Andrew 
> <andrew.kaylor at intel.com>
> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
> Subject: Re: [llvm-dev] Memory barrier problem
>
>
> On 2/2/21 12:30 PM, Saito, Hideki wrote:
>>> If that is what you want, make `nosync` "required" for
`noalias` "over calls". I mean, everything needs to start as "may
have fence" by default and it's not only "fence", but any
sync instruction. So, `noalias` >pointers are impacted by potentially
synchronizing calls.
>>> I am not sure this is what we want, in parts because it degrades
optimization potential for all restrict pointers, even in the non threaded
environment. I was thinking we keep user `__restrict__` pointer as >they are,
if you break synchronization it's a user fault.
>>> That said, if we derive `noalias` we need to ensure we don't
break synchronization. "Fake uses" is one way but probably not the
only one.
>> I think (and others may disagree) it makes sense to let the 
>> programmer decide whether they want a thread safe program or thread 
>> unsafe (or rather single thread optimized) program (e.g., a flag, 
>> pragma, per-function attribute, etc.). I do not believe compiler can 
>> resolve that problem on its own. If we go for that direction, there 
>> will be a question of what the default should be and my preference on 
>> the default is go conservative (i.e., go safer versus more
optimizable).
>> That might mean existing code may get compiled into slower executing 
>> code until a new flag ("single thread optimized") is added.
[This is
>> from "threading is already a norm" thinking.]
> Too many nobs is not helpful either. As said before, we should keep
`__restrict__` as it is and deal with "deduced" `noalias` properly.
How we do that is another question though.
>
>
>> My preference on the use of fence comes from the desire to let compiler
implementation match with the theory that you learn at school (Educators, please
chime in) and also how the HW works. If for some reason, fence approach
doesn't work well with LLVM framework today, we should have a good
documentation on how that conclusion was drawn, adopt plan-B, and once in a
while revisit the viability. I'm not totally opposed on the alternatives
since we have a problem that needs a resolution quickly enough. I like
theoretical clarity, but I can't fight for it forever.
> I don't really follow this argument, I think I simply don't
understand it. What do you mean by "the use of fence"? Adding explicit
fence instructions somewhere?
>
>
>> W.r.t. restrict, I'd like to hear more from the language lawyers on
their original intent when the language construct was born and the current
interpretation of it in the presence of threading.
> I would have assumed `__restrict` predates "common"
multi-processing in C. Since the language of restrict is to this day implying
other threads cannot access those pointers, I would not dare to argue we should
weaken it in order to deduce `noalias`.
>
> ~ Johannes
>
>
>> Thanks,
>> Hideki
>>
>> -----Original Message-----
>> From: Johannes Doerfert <johannesdoerfert at gmail.com>
>> Sent: Monday, February 1, 2021 10:08 AM
>> To: Saito, Hideki <hideki.saito at intel.com>; Kaylor, Andrew 
>> <andrew.kaylor at intel.com>
>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
>> Subject: Re: [llvm-dev] Memory barrier problem
>>
>>
>> On 2/1/21 11:34 AM, Saito, Hideki wrote:
>>>> I still don't understand why an explicit fence causes LLVM
to avoid the optimization based on `noalias` but an outlined fence doesn't.
That is not consistent *at all*.
>>> Me too.
>>>
>>>> So let's not try to build something on top of that. I
prefer fake uses but a weakened `noalias` can be discussed too.
>>> Maybe, but if we can fix this problem
>>>
>>>>>>>> define void @bar() convergent {
>>>>>>>>           fence acq_rel
>>>>>>>>           ret void
>>>>>>>> }
>>> as below, by an ability to mark any call a potential fence, by a
function attribute, that's what I prefer. It could be that there was an
underlying assumption (or misconception) that the convergent attribute would
suffice, but it apparently does not. Revisiting there seems worthwhile.
>>>
>>> define void @bar() convergent fence.acq_rel {
>>>            fence acq_rel
>>>            ret void
>>> }
>> If that is what you want, make `nosync` "required" for
`noalias` "over calls". I mean, everything needs to start as "may
have fence" by default and it's not only "fence", but any
sync instruction. So, `noalias` pointers are impacted by potentially
synchronizing calls.
>> I am not sure this is what we want, in parts because it degrades
optimization potential for all restrict pointers, even in the non threaded
environment. I was thinking we keep user `__restrict__` pointer as they are, if
you break synchronization it's a user fault.
>> That said, if we derive `noalias` we need to ensure we don't break
synchronization. "Fake uses" is one way but probably not the only one.
>>
>> ~ Johannes
>>
>>
>>> Thanks,
>>> Hideki
>>>
>>> -----Original Message-----
>>> From: Johannes Doerfert <johannesdoerfert at gmail.com>
>>> Sent: Monday, February 1, 2021 6:57 AM
>>> To: Saito, Hideki <hideki.saito at intel.com>; Kaylor, Andrew
>>> <andrew.kaylor at intel.com>
>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>
>>>
>>> On 2/1/21 1:43 AM, Saito, Hideki wrote:
>>>>> The problem is that not only "bar" but also 
`omp_critical_{start/end}` synchronize threads. We need to add the "fake
uses" to those calls as well.
>>>> Yes, indeed, the following two operations need to be
"atomic".
>>>>
>>>>>>           omp_critical_end(tid);
>>>>>> }
>>>>>> __memory_fence_release()[p]; // May "use"
pointer p.
>>>> and so are
>>>>
>>>>>> __memory_fence_acquire()[p]; // May "use"
pointer p.
>>>>>> if (omp_critical_start(tid)) {
>>>> Now, going back to Andy's example. We either need to make
bar() act
>>>> like a fence, or sandwich it with fences (with "fake
uses" if your
>>>> proposal is adopted) --- from the caller perspective. Letting
bar
>>>> have a fence inside isn't enough, from the caller side
optimization
>>>> purpose. (Fence inside bar() would work from the execution on
HW
>>>> perspective, if compiler hasn't already messed it up.)
>>> The thing is, I still don't understand why an explicit fence
causes LLVM to avoid the optimization based on `noalias` but an outlined fence
doesn't. That is not consistent *at all*. So let's not try to build
something on top of that. I prefer fake uses but a weakened `noalias` can be
discussed too.
>>>
>>> ~ Johannes
>>>
>>>
>>>> Thanks,
>>>> Hideki
>>>>
>>>> -----Original Message-----
>>>> From: Johannes Doerfert <johannesdoerfert at gmail.com>
>>>> Sent: Saturday, January 30, 2021 3:02 PM
>>>> To: Saito, Hideki <hideki.saito at intel.com>; Kaylor,
Andrew
>>>> <andrew.kaylor at intel.com>
>>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma at
quicinc.com>
>>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>>
>>>> Hi Saito,
>>>>
>>>> thanks for pointing this out. I agree now, the reasoning was
not "complete" because I focused on the example too much.
>>>>
>>>>
>>>> On 1/29/21 1:33 PM, Saito, Hideki wrote:
>>>>>> Could you explain why we would need a fence (or
anything for that
>>>>>> matter) if there is no user code bar()?
>>>>>> The way I see it, if there is no user code that might
be a synchronization event, then there is no problem.
>>>>>> The noalias can just apply to the pointer and we know
no other thread will interfere with the memory in the scope because it would be
a race. Does that make sense?
>>>>> As soon as the critical section ends, some other thread can
modify the value of *p (under its critical section), prior to the beginning of
the next critical section. If OS decides to context switch between
omp_critical_end() and omp_critical_start(), that would be eternity from
application SW perspective. Threads aren't executing in lock-step.
>>>> Disclaimer: I thought there is no problem, and I still believe 
>>>> there is none for the example given in the paper and discussed
here before.
>>>> I explain why I think there is no problem in this example
below.
>>>> Afterwards I make the example slightly more complicated and
argue
>>>> why it is now broken if we do not also add "operand bundle
uses" to
>>>> the `om_critical_` calls which can, after all, also synchronize
and
>>>> be observed by the user.
>>>>
>>>> This is only relevant for the example and only serves to
explain
>>>> why I thought there was no problem:
>>>> Even if a second thread updates *p between the first 
>>>> `omp_critical_end` and the second `omp_critical_start`, that
even
>>>> is not synchronized with the thread running this method.
>>>> Let's call that thread T0 and the other one T1 and assume
T1 will
>>>> write 5, we then have 3 critical sections to argue about:
>>>>
>>>> T0: A: critical{*p = *p + 1}
>>>>          B: critical{*p = *p * 2}
>>>> T1: critical{*p = 5}
>>>>
>>>> Now we can have 3 different interleavings from a observer
perspective:
>>>>
>>>> 1: T0A T0B T1
>>>> 2: T0A T1 T0B
>>>> 3: T1 T0A T0B
>>>>
>>>> But since there is no synchronization between T0 and T1, the
user
>>>> cannot argue any of them is less correct than the others.
>>>> So if we utilize the `noalias` guarantee to replace the load in
T0B
>>>> with the value stored in T0A, we basically made interleaving
>>>> 2 impossible and it would instead "become"
interleaving 3. However,
>>>> if there is a barrier-like effect between T0A and T0B, the user
can
>>>> determine that T0A should have happened already, basically 
>>>> distinguish interleaving 2 from 3. That brings me to the
example
>>>> where this logic falls apart:
>>>>
>>>> if (omp_critical_start(tid)) {
>>>>          *p += 1;
>>>>          *flag = 1
>>>>           omp_critical_end(tid);
>>>> }
>>>> if (omp_critical_start(tid)) {
>>>>          *p *= 2;
>>>>          omp_critical_end(tid);
>>>> }
>>>>
>>>> Since the user can now observe if T0A was already performed, by
>>>> checking `flag`, they can distinguish interleaving 2 and 3
which
>>>> makes the forwarding illegal. The problem is that not only
"bar"
>>>> but also `omp_critical_{start/end}` synchronize threads. We
need to
>>>> add the "fake uses" to those calls as well. This
basically will
>>>> keep the pointer update in the critical section, which is
conservatively correct.
>>>>
>>>> I hope this made some sense.
>>>>
>>>> ~ Johannes
>>>>
>>>>
>>>>> Thanks,
>>>>> Hideki
>>>>>
>>>>> -----Original Message-----
>>>>> From: Johannes Doerfert <johannesdoerfert at
gmail.com>
>>>>> Sent: Friday, January 29, 2021 11:04 AM
>>>>> To: Saito, Hideki <hideki.saito at intel.com>;
Kaylor, Andrew
>>>>> <andrew.kaylor at intel.com>
>>>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman <efriedma
at quicinc.com>
>>>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>>>
>>>>>
>>>>> On 1/29/21 12:56 PM, Saito, Hideki wrote:
>>>>>> Disclaimer: I've been focusing on vectorization
(single thread
>>>>>> problem) for the last 10+yrs. Please read this with
grain of rust.
>>>>>> 😊
>>>>>> The scheme proposed in the Page 4 Fig 1(d) of [0] may
not work in general problems, as is. Some code may not have bar(). Then, one may
certainly argue that why bother splitting this into two critical sections. There
are values in keeping critical section sizes as small as possible, and it's
not too hard to imagine two such critical sections touching the same variable
along with some other variables.
>>>>>>
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p += 1;
>>>>>>           omp_critical_end(tid);
>>>>>> }
>>>>>> bar()[p]; // May "use" pointer p.
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p *= 2;
>>>>>>          omp_critical_end(tid);
>>>>>> }
>>>>>>
>>>>>> In the absence of user-code bar(), if we are to take
the spirit
>>>>>> of the proposal, we would inject a pseudo function call
>>>>>> (appropriately name it as __memory_fence() 😊) and it
would look
>>>>>> like
>>>>> Could you explain why we would need a fence (or anything
for that
>>>>> matter) if there is no user code bar()?
>>>>> The way I see it, if there is no user code that might be a
synchronization event, then there is no problem.
>>>>> The noalias can just apply to the pointer and we know no
other thread will interfere with the memory in the scope because it would be a
race. Does that make sense?
>>>>>
>>>>> ~ Johannes
>>>>>
>>>>>
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p += 1;
>>>>>>           omp_critical_end(tid);
>>>>>> }
>>>>>> __memory_fence()[p]; // May "use" pointer p.
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p *= 2;
>>>>>>          omp_critical_end(tid);
>>>>>> }
>>>>>>
>>>>>> We would then proceed to argue with the following
improvement of
>>>>>> it
>>>>>>
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p += 1;
>>>>>>           omp_critical_end(tid);
>>>>>> }
>>>>>> __memory_fence_release()[p]; // May "use"
pointer p.
>>>>>> ;
>>>>>> __memory_fence_acquire()[p]; // May "use"
pointer p.
>>>>>> if (omp_critical_start(tid)) {
>>>>>>          *p *= 2;
>>>>>>          omp_critical_end(tid);
>>>>>> }
>>>>>>
>>>>>> We then might as well talk about improving LLVM IR
fence instruction definition.
>>>>>>
>>>>>> My preference is to interpret "noalias/restrict to
be applicable to all threads, but only between one fence to the next (per
dynamic control flow sense)". However, if most developers think "
noalias/restrict to be effective to all threads for the entire scope of the
variable/parameter" && we can make such fence based dependency
explicit like in the spirit of the proposal in [0] (and as I outlined above), it
would be good to discuss the long term pros/cons of the both approaches. For the
optimizers, the former can't really be based on dynamic control flow based.
As such, it would end up in some conservative static scoping based enforcement.
The latter, if implementable, may have a better reflection of the mem-fence
requirement and control flow, which may lead to better optimization.
>>>>>>
>>>>>> Thanks,
>>>>>> Hideki
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Johannes Doerfert <johannesdoerfert at
gmail.com>
>>>>>> Sent: Wednesday, January 27, 2021 2:26 PM
>>>>>> To: Kaylor, Andrew <andrew.kaylor at intel.com>
>>>>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman
<efriedma at quicinc.com>
>>>>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>>>>
>>>>>> Long story short, we have 3 general options forward:
>>>>>>
>>>>>> 1) Introduce/Leverage an attribute and make it
"stronger" than the `noalias` guarantee we have right now. This is not
a great option because we would need to scan for intermediate instructions with
this property in order to utilize `noalias`.
>>>>>> 2) We go with explicit uses of the `noalias` pointer as
proposed in [0].
>>>>>> The downside is that we need to add them when we inline
and such. We also need to add them to synchronizing instructions or declare the
behavior of GVN right now as somewhat sane. So we say that synchronizing
instructions in the current function are stronger than `noalias` but that brings
us back to the problem with 1) which I really dislike. I guess, operand bundles
for instructions would solve this in a reasonably nice way.
>>>>>> 3) Introduce a weaker version of `noalias` with the
drawbacks of 1) but at least __restrict__ pointers don't inherit the problem
(as it is a user error to synchronize via restrict pointers).
>>>>>> [4) Don't deduce `noalias` if there might be
synchronization in the scope/function. This is what the Attributor does right
now:
>>>>>>
https://github.com/llvm/llvm-project/blob/fb12df4a8e33d7599380577
>>>>>> 1
>>>>>> 8
>>>>>> 2
>>>>>> 73
>>>>>>
dfb434b2d9c4/llvm/lib/Transforms/IPO/AttributorAttributes.cpp#L24
>>>>>> 5
>>>>>> 8
>>>>>> ]
>>>>>>
>>>>>> I'm interested in other thoughts on this.
>>>>>>
>>>>>> ~ Johannes
>>>>>>
>>>>>>
>>>>>> [0]
>>>>>>
https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pd
>>>>>> f
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1/27/21 2:34 PM, Kaylor, Andrew wrote:
>>>>>>> Thanks, Johannes!
>>>>>>>
>>>>>>> It looks like the bug you were referring to is
https://bugs.llvm.org/show_bug.cgi?id=41781.
>>>>>>>
>>>>>>> I see there that Eli is asserting that
'restrict' (and therefore 'noalias') applies to memory accesses
in any thread. I was assuming otherwise. If I remove the 'noalias'
attribute there are no problems with my example, but this would also mean the
potential loss of local optimizations that would otherwise be possible in more
complicated cases.
>>>>>>>
>>>>>>> In the case I was starting from, the
'noalias' attribute was something our compiler derived based on its
knowledge of the SYCL rules. Within the kernel, we know the pointer appearing as
an argument here won't alias with any other pointers in the kernel, but
nothing prevents other instances of the kernel from modifying the same memory.
Hence, the barriers to synchronize the accesses.
>>>>>>>
>>>>>>> I'll have to read the relevant section of your
paper a few more times to fully grasp what you are saying there. It's clear
that you are addressing the same general problem I'm looking at here. I
wasn't clear on first reading whether you are saying that the
restrict/noalias attribute could be employed for this optimization (possibly
with additional constructs to manage the synchronization) or whether you meant
that something entirely different than restrict/noalias was needed.
>>>>>>>
>>>>>>> -Andy
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Johannes Doerfert <johannesdoerfert at
gmail.com>
>>>>>>> Sent: Wednesday, January 27, 2021 10:55 AM
>>>>>>> To: Kaylor, Andrew <andrew.kaylor at
intel.com>
>>>>>>> Cc: llvm-dev at lists.llvm.org; Eli Friedman
<efriedma at quicinc.com>
>>>>>>> Subject: Re: [llvm-dev] Memory barrier problem
>>>>>>>
>>>>>>>
>>>>>>> On 1/27/21 11:50 AM, Kaylor, Andrew via llvm-dev
wrote:
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I have a problem with multi-threaded memory
synchronization that I'd like to get some input on.
>>>>>>>>
>>>>>>>> Consider the following IR:
>>>>>>>>
>>>>>>>> ------------
>>>>>>>>
>>>>>>>> define void @bar() convergent {
>>>>>>>>           fence acq_rel
>>>>>>>>           ret void
>>>>>>>> }
>>>>>>>>
>>>>>>>> define i32 @foo(i32* noalias %p, i32 %flag) {
>>>>>>>> entry:
>>>>>>>>           store i32 0, i32* %p
>>>>>>>>           call void @bar()
>>>>>>>>           %cmp = icmp eq i32 %flag, 0
>>>>>>>>           br i1 %cmp, label %if.then, label
%if.end
>>>>>>>>
>>>>>>>> if.then:
>>>>>>>>           store i32 1, i32* %p
>>>>>>>>           br label %if.end
>>>>>>>>
>>>>>>>> if.end:
>>>>>>>>           call void @bar()
>>>>>>>>           %x = load i32, i32* %p
>>>>>>>>           ret i32 %x
>>>>>>>> }
>>>>>>>>
>>>>>>>> ------------
>>>>>>>>
>>>>>>>> I have an argument (%p) which is marked with
the 'noalias' attribute. The memory pointed to by this argument is read,
written, and read again within the function. Between these accesses, I am
calling a function that contains a fence instruction. If that call with the
fence is not inlined, GVN will eliminate the second load.
>>>>>>>>
>>>>>>>> ------------
>>>>>>>>
>>>>>>>> define i32 @foo(i32* noalias %p, i32 %flag) {
>>>>>>>> entry:
>>>>>>>>           store i32 0, i32* %p, align 4
>>>>>>>>           call void @bar()
>>>>>>>>           %cmp = icmp eq i32 %flag, 0
>>>>>>>>           br i1 %cmp, label %if.then, label
%if.end
>>>>>>>>
>>>>>>>> if.then:
>>>>>>>>           store i32 1, i32* %p, align 4
>>>>>>>>           br label %if.end
>>>>>>>>
>>>>>>>> if.end:
>>>>>>>>           %x = phi i32 [ 1, %if.then ], [ 0,
%entry ]   ; <============== Incorrect
>>>>>>>>           call void @bar()
>>>>>>>>           ret i32 %x
>>>>>>>> }
>>>>>>>>
>>>>>>>> ------------
>>>>>>>>
>>>>>>>> https://godbolt.org/z/14o8oY
>>>>>>>>
>>>>>>>> This is a reduction of a scenario I've come
across in a SYCL program. The bar() function corresponds to a work group barrier
that is meant to have the memory synchronizing effect described by the fence
instruction in my example. I'm trying to figure out how to construct LLVM IR
that will represent the semantics I need.
>>>>>>>>
>>>>>>>> If I remove the 'noalias' attribute
from the argument, GVN won't make this optimization because it
conservatively assumes that the memory might be modified within the called
function. That's fine, but I think it fixes the problem for the wrong
reason. In fact, the memory location is not modified in the called function and
as I understand it the 'noalias' attribute only guarantees that the
memory won't be accessed *in the current thread* using pointers that
aren't based on the 'noalias' pointer. So, the fact that it might be
modified by another thread shouldn't invalidate the 'noalias'
attribute. Is that correct?
>>>>>>>>
>>>>>>>> I can also block the GVN optimization by
putting the fence instruction directly in the foo() function, such as by
inlining the call to bar(). But, of course, the semantics of the IR should not
depend on whether or not I've inlined functions. In this case the inlining
is trivial, but the problem potentially exists for a called function that uses a
barrier in a way that is not so immediately visible.
>>>>>>>>
>>>>>>>> I put the 'convergent' attribute on my
bar() function mostly to demonstrate that this doesn't solve the problem. As
I understand it, the 'convergent' attribute describes control flow
constraints and says nothing about memory access synchronization. Is that
correct?
>>>>>>>>
>>>>>>>>
>>>>>>>> Is there a way to handle this case? I have some
ideas, but I'd like to start by just posing the question to see if there are
better avenues available than I've considered.
>>>>>>> So far, I don't think we have a proper way to
handle this. The
>>>>>>> argument was, the user code is wrong because
multiple threads
>>>>>>> wrote the variable which violated restrict. As we
added
>>>>>>> deduction of noalias we run into this again. What
we proposed,
>>>>>>> but haven't tried to upstream yet, is to
provide explicit uses of restrict pointers.
>>>>>>> See Chapter 3 in
>>>>>>>
https://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.p
>>>>>>> d
>>>>>>> f
>>>>>>>
>>>>>>> I'd be very interested in discussing this
further, a little short on time right now though.
>>>>>>>
>>>>>>> ~ Johannes
>>>>>>>
>>>>>>> P.S. There was a bug report with ample of related
discussion, I to look for it again, maybe Eli remembers.
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Andy
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - Feb 2021 - Memory barrier problem

[llvm-dev] Memory barrier problem

[llvm-dev] Memory barrier problem