thr3ads.net - llvm dev - [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer) [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Kostya Serebryany

2013-Nov-20 05:01 UTC

[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall <
David.Chisnall at cl.cam.ac.uk> wrote:
> On 19 Nov 2013, at 17:58, Kostya Serebryany <kcc at google.com>
wrote:
>
> > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M <
> michael.m.kuperstein at intel.com> wrote:
> >> What I’m trying to say is that according to my understanding of
the
> C++11 memory model, even in that small reproducer, the store to g and the
> load from g are in fact a data race.
> >>
> >> (This is regardless of the fact the load is protected by a branch
that
> is not taken.)
> >
> > My understanding of the standard is quite the opposite.
>
> I agree.  It is a potential data race, if the branch is taken then it
> definitely is a race because the standard explicitly does not define an
> ordering on loads and stores of non-atomic variables unless some
> happens-before relationship is established by some other operation (which
> does not occur in this example).  In the case where the branch is not
> taken, then there is no data race because in the C++ abstract machine there
> is no load, whether there is one in the underlying concrete machine or not.
>
> I believe that this transform is valid, because there are only two
> possible cases here:
>
> - Some other code has established a happens-before relationship between
> the load and the stores, giving a well-defined ordering, however this code
> is not visible in the function foo() and so the load may happen anywhere.
>
> - No other code establishes a happens-before relationship, and therefore
> the order of the load and the store is completely undefined and as long as
> the load doesn't trap it is completely safe to hoist it.
>
> HOWEVER, I would dispute describing this transform as an optimisation in
> this case.

You have a good point: evaluating such transformation on single-threaded
benchmarks (e.g. SPEC) makes no sense
if we care about multi-threaded code. And evaluating performance of
anything like this on a multi-threaded code is hard too.
And it's very easy to prepare a synthetic test case where this optimization
will cause 4x slowdown (see below).
But if we disable this transformation someone surely will come and complain
:)
Another data point: gcc 4.6.3 does this too.


==> if-bench.cc <=#include <pthread.h>
#include <unistd.h>

extern int foo(int);
extern void bar(int);

#ifndef N
#define N (1UL << 30)
#endif

void *Thread(void *p) {
  for (long i = 0; i < N; i++)
     foo(0);
  return 0;
}

int main() {
  pthread_t t[3];
  pthread_create(&t[0], 0, Thread, 0);
  pthread_create(&t[1], 0, Thread, 0);
  pthread_create(&t[2], 0, Thread, 0);
  for (long i = 0; i < N; i++)
    bar(i);
  pthread_join(t[0], 0);
  pthread_join(t[1], 0);
  pthread_join(t[2], 0);
}

==> g.cc <=int g;

int foo(int cond) {
  if (cond)
    return g;
  // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster.
  return 0;
}

void bar(int i) {  g = i; }


% clang if-bench.cc  g.cc  -lpthread -O1  && time ./a.out


--kcc


>  If the two threads are running on different cores, then we have likely
> introduced some false sharing and have forced the two cores to exchange
> some bus traffic for cache coherency, even though is is completely valid in
> the C++ abstract machine model for the load of g to return a stale cache
> value.  This is only a real problem on strongly ordered  architectures
> (e.g. x86), but given the relative cost of a cache shootdown and everything
> else in this test case (with the exception of the thread creation), I
> wouldn't be surprised if it ended up slowing things down.  Especially
given
> that a modern superscalar CPU will speculatively execute the load ANYWAY if
> it can do so from cache, and if it can't then the performance
improvement
> from doing it before the branch will likely be negligible.
>
> For single-core, in-order, single-issue architectures, or multicore,
> weakly ordered, in-order, single-issue architectures, it's probably a
win...
>
> David
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131120/c953f891/attachment.html>

Kostya Serebryany

2013-Nov-20 11:04 UTC

head link

[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

FTR: http://llvm-reviews.chandlerc.com/D2227 sent for review to fix the
tsan-hostile load speculation.

--kcc


On Wed, Nov 20, 2013 at 9:01 AM, Kostya Serebryany <kcc at google.com>
wrote:
>
>
>
> On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall <
> David.Chisnall at cl.cam.ac.uk> wrote:
>
>> On 19 Nov 2013, at 17:58, Kostya Serebryany <kcc at google.com>
wrote:
>>
>> > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M <
>> michael.m.kuperstein at intel.com> wrote:
>> >> What I’m trying to say is that according to my understanding
of the
>> C++11 memory model, even in that small reproducer, the store to g and
the
>> load from g are in fact a data race.
>> >>
>> >> (This is regardless of the fact the load is protected by a
branch that
>> is not taken.)
>> >
>> > My understanding of the standard is quite the opposite.
>>
>> I agree.  It is a potential data race, if the branch is taken then it
>> definitely is a race because the standard explicitly does not define an
>> ordering on loads and stores of non-atomic variables unless some
>> happens-before relationship is established by some other operation
(which
>> does not occur in this example).  In the case where the branch is not
>> taken, then there is no data race because in the C++ abstract machine
there
>> is no load, whether there is one in the underlying concrete machine or
not.
>>
>> I believe that this transform is valid, because there are only two
>> possible cases here:
>>
>> - Some other code has established a happens-before relationship between
>> the load and the stores, giving a well-defined ordering, however this
code
>> is not visible in the function foo() and so the load may happen
anywhere.
>>
>> - No other code establishes a happens-before relationship, and
therefore
>> the order of the load and the store is completely undefined and as long
as
>> the load doesn't trap it is completely safe to hoist it.
>>
>> HOWEVER, I would dispute describing this transform as an optimisation
in
>> this case.
>
>
> You have a good point: evaluating such transformation on single-threaded
> benchmarks (e.g. SPEC) makes no sense
> if we care about multi-threaded code. And evaluating performance of
> anything like this on a multi-threaded code is hard too.
> And it's very easy to prepare a synthetic test case where this
> optimization will cause 4x slowdown (see below).
> But if we disable this transformation someone surely will come and
> complain :)
> Another data point: gcc 4.6.3 does this too.
>
>
> ==> if-bench.cc <=> #include <pthread.h>
> #include <unistd.h>
>
> extern int foo(int);
> extern void bar(int);
>
> #ifndef N
> #define N (1UL << 30)
> #endif
>
> void *Thread(void *p) {
>   for (long i = 0; i < N; i++)
>      foo(0);
>   return 0;
> }
>
> int main() {
>   pthread_t t[3];
>   pthread_create(&t[0], 0, Thread, 0);
>   pthread_create(&t[1], 0, Thread, 0);
>   pthread_create(&t[2], 0, Thread, 0);
>   for (long i = 0; i < N; i++)
>     bar(i);
>   pthread_join(t[0], 0);
>   pthread_join(t[1], 0);
>   pthread_join(t[2], 0);
> }
>
> ==> g.cc <=> int g;
>
> int foo(int cond) {
>   if (cond)
>     return g;
>   // If we replace 'g' with g*g*g*g, the benchmark becomes 4x
faster.
>   return 0;
> }
>
> void bar(int i) {  g = i; }
>
>
> % clang if-bench.cc  g.cc  -lpthread -O1  && time ./a.out
>
>
> --kcc
>
>
>
>>  If the two threads are running on different cores, then we have likely
>> introduced some false sharing and have forced the two cores to exchange
>> some bus traffic for cache coherency, even though is is completely
valid in
>> the C++ abstract machine model for the load of g to return a stale
cache
>> value.  This is only a real problem on strongly ordered  architectures
>> (e.g. x86), but given the relative cost of a cache shootdown and
everything
>> else in this test case (with the exception of the thread creation), I
>> wouldn't be surprised if it ended up slowing things down. 
Especially given
>> that a modern superscalar CPU will speculatively execute the load
ANYWAY if
>> it can do so from cache, and if it can't then the performance
improvement
>> from doing it before the branch will likely be negligible.
>>
>> For single-core, in-order, single-issue architectures, or multicore,
>> weakly ordered, in-order, single-issue architectures, it's probably
a win...
>>
>> David
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131120/935ed959/attachment.html>

Hal Finkel

2013-Nov-21 07:14 UTC

head link

[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

----- Original Message -----> From: "Kostya Serebryany" <kcc at google.com>
> To: "David Chisnall" <David.Chisnall at cl.cam.ac.uk>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Tuesday, November 19, 2013 11:01:12 PM
> Subject: Re: [LLVMdev] Curiosity about transform changes under Sanitizers
(Was: [PATCH] Disable branch folding with
> MemorySanitizer)
> 
> 
> 
> 
> 
> 
> 
> 
> On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall <
> David.Chisnall at cl.cam.ac.uk > wrote:
> 
> 
> 
> On 19 Nov 2013, at 17:58, Kostya Serebryany < kcc at google.com >
wrote:
> 
> > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M <
> > michael.m.kuperstein at intel.com > wrote:
> >> What I’m trying to say is that according to my understanding of
> >> the C++11 memory model, even in that small reproducer, the store
> >> to g and the load from g are in fact a data race.
> >> 
> >> (This is regardless of the fact the load is protected by a branch
> >> that is not taken.)
> > 
> > My understanding of the standard is quite the opposite.
> 
> I agree. It is a potential data race, if the branch is taken then it
> definitely is a race because the standard explicitly does not define
> an ordering on loads and stores of non-atomic variables unless some
> happens-before relationship is established by some other operation
> (which does not occur in this example). In the case where the branch
> is not taken, then there is no data race because in the C++ abstract
> machine there is no load, whether there is one in the underlying
> concrete machine or not.
> 
> I believe that this transform is valid, because there are only two
> possible cases here:
> 
> - Some other code has established a happens-before relationship
> between the load and the stores, giving a well-defined ordering,
> however this code is not visible in the function foo() and so the
> load may happen anywhere.
> 
> - No other code establishes a happens-before relationship, and
> therefore the order of the load and the store is completely
> undefined and as long as the load doesn't trap it is completely safe
> to hoist it.
> 
> HOWEVER, I would dispute describing this transform as an optimisation
> in this case.
> 
> 
> You have a good point: evaluating such transformation on
> single-threaded benchmarks (e.g. SPEC) makes no sense
> if we care about multi-threaded code. And evaluating performance of
> anything like this on a multi-threaded code is hard too.
> And it's very easy to prepare a synthetic test case where this
> optimization will cause 4x slowdown (see below).
Out of curiosity, can you explain why? Is this cache-line bouncing? Or just an
increase in overall memory-bandwidth use?

 -Hal
> But if we disable this transformation someone surely will come and
> complain :)
> Another data point: gcc 4.6.3 does this too.
> 
> 
> 
> 
> 
> ==> if-bench.cc <=> #include <pthread.h>
> #include <unistd.h>
> 
> 
> extern int foo(int);
> extern void bar(int);
> 
> 
> #ifndef N
> #define N (1UL << 30)
> #endif
> 
> 
> void *Thread(void *p) {
> for (long i = 0; i < N; i++)
> foo(0);
> return 0;
> }
> 
> 
> int main() {
> pthread_t t[3];
> pthread_create(&t[0], 0, Thread, 0);
> pthread_create(&t[1], 0, Thread, 0);
> pthread_create(&t[2], 0, Thread, 0);
> for (long i = 0; i < N; i++)
> bar(i);
> pthread_join(t[0], 0);
> pthread_join(t[1], 0);
> pthread_join(t[2], 0);
> }
> 
> 
> ==> g.cc <=> int g;
> 
> 
> int foo(int cond) {
> if (cond)
> return g;
> // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster.
> return 0;
> }
> 
> 
> void bar(int i) { g = i; }
> 
> 
> 
> 
> 
> % clang if-bench.cc g.cc -lpthread -O1 && time ./a.out
> 
> 
> 
> 
> --kcc
> 
> 
> 
> 
> If the two threads are running on different cores, then we have
> likely introduced some false sharing and have forced the two cores
> to exchange some bus traffic for cache coherency, even though is is
> completely valid in the C++ abstract machine model for the load of g
> to return a stale cache value. This is only a real problem on
> strongly ordered architectures (e.g. x86), but given the relative
> cost of a cache shootdown and everything else in this test case
> (with the exception of the thread creation), I wouldn't be surprised
> if it ended up slowing things down. Especially given that a modern
> superscalar CPU will speculatively execute the load ANYWAY if it can
> do so from cache, and if it can't then the performance improvement
> from doing it before the branch will likely be negligible.
> 
> For single-core, in-order, single-issue architectures, or multicore,
> weakly ordered, in-order, single-issue architectures, it's probably
> a win...
> 
> David
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Kostya Serebryany

2013-Nov-21 07:38 UTC

head link

[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

On Thu, Nov 21, 2013 at 11:14 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "Kostya Serebryany" <kcc at google.com>
> > To: "David Chisnall" <David.Chisnall at cl.cam.ac.uk>
> > Cc: "LLVM Developers Mailing List" <llvmdev at
cs.uiuc.edu>
> > Sent: Tuesday, November 19, 2013 11:01:12 PM
> > Subject: Re: [LLVMdev] Curiosity about transform changes under
> Sanitizers (Was: [PATCH] Disable branch folding with
> > MemorySanitizer)
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall <
> > David.Chisnall at cl.cam.ac.uk > wrote:
> >
> >
> >
> > On 19 Nov 2013, at 17:58, Kostya Serebryany < kcc at google.com
> wrote:
> >
> > > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M <
> > > michael.m.kuperstein at intel.com > wrote:
> > >> What I’m trying to say is that according to my understanding
of
> > >> the C++11 memory model, even in that small reproducer, the
store
> > >> to g and the load from g are in fact a data race.
> > >>
> > >> (This is regardless of the fact the load is protected by a
branch
> > >> that is not taken.)
> > >
> > > My understanding of the standard is quite the opposite.
> >
> > I agree. It is a potential data race, if the branch is taken then it
> > definitely is a race because the standard explicitly does not define
> > an ordering on loads and stores of non-atomic variables unless some
> > happens-before relationship is established by some other operation
> > (which does not occur in this example). In the case where the branch
> > is not taken, then there is no data race because in the C++ abstract
> > machine there is no load, whether there is one in the underlying
> > concrete machine or not.
> >
> > I believe that this transform is valid, because there are only two
> > possible cases here:
> >
> > - Some other code has established a happens-before relationship
> > between the load and the stores, giving a well-defined ordering,
> > however this code is not visible in the function foo() and so the
> > load may happen anywhere.
> >
> > - No other code establishes a happens-before relationship, and
> > therefore the order of the load and the store is completely
> > undefined and as long as the load doesn't trap it is completely
safe
> > to hoist it.
> >
> > HOWEVER, I would dispute describing this transform as an optimisation
> > in this case.
> >
> >
> > You have a good point: evaluating such transformation on
> > single-threaded benchmarks (e.g. SPEC) makes no sense
> > if we care about multi-threaded code. And evaluating performance of
> > anything like this on a multi-threaded code is hard too.
> > And it's very easy to prepare a synthetic test case where this
> > optimization will cause 4x slowdown (see below).
>
> Out of curiosity, can you explain why? Is this cache-line bouncing? Or
> just an increase in overall memory-bandwidth use?
>
Exactly, this is cache-line bouncing

>
>  -Hal
>
> > But if we disable this transformation someone surely will come and
> > complain :)
> > Another data point: gcc 4.6.3 does this too.
> >
> >
> >
> >
> >
> > ==> if-bench.cc <=> > #include <pthread.h>
> > #include <unistd.h>
> >
> >
> > extern int foo(int);
> > extern void bar(int);
> >
> >
> > #ifndef N
> > #define N (1UL << 30)
> > #endif
> >
> >
> > void *Thread(void *p) {
> > for (long i = 0; i < N; i++)
> > foo(0);
> > return 0;
> > }
> >
> >
> > int main() {
> > pthread_t t[3];
> > pthread_create(&t[0], 0, Thread, 0);
> > pthread_create(&t[1], 0, Thread, 0);
> > pthread_create(&t[2], 0, Thread, 0);
> > for (long i = 0; i < N; i++)
> > bar(i);
> > pthread_join(t[0], 0);
> > pthread_join(t[1], 0);
> > pthread_join(t[2], 0);
> > }
> >
> >
> > ==> g.cc <=> > int g;
> >
> >
> > int foo(int cond) {
> > if (cond)
> > return g;
> > // If we replace 'g' with g*g*g*g, the benchmark becomes 4x
faster.
> > return 0;
> > }
> >
> >
> > void bar(int i) { g = i; }
> >
> >
> >
> >
> >
> > % clang if-bench.cc g.cc -lpthread -O1 && time ./a.out
> >
> >
> >
> >
> > --kcc
> >
> >
> >
> >
> > If the two threads are running on different cores, then we have
> > likely introduced some false sharing and have forced the two cores
> > to exchange some bus traffic for cache coherency, even though is is
> > completely valid in the C++ abstract machine model for the load of g
> > to return a stale cache value. This is only a real problem on
> > strongly ordered architectures (e.g. x86), but given the relative
> > cost of a cache shootdown and everything else in this test case
> > (with the exception of the thread creation), I wouldn't be
surprised
> > if it ended up slowing things down. Especially given that a modern
> > superscalar CPU will speculatively execute the load ANYWAY if it can
> > do so from cache, and if it can't then the performance improvement
> > from doing it before the branch will likely be negligible.
> >
> > For single-core, in-order, single-issue architectures, or multicore,
> > weakly ordered, in-order, single-issue architectures, it's
probably
> > a win...
> >
> > David
> >
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131121/4f70b5b3/attachment.html>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Nov 2013 - [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

Apparently Analagous Threads